laurieburchell

Laurie Burchell laurieburchell

Senior Research Engineer @commoncrawl

33 followers · 47 following

Achievements

Highlights

Organizations

Starred repositories

harvard-lil / warcbench

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Python 14 Updated Jul 30, 2025

LBeaudoux / iso639

A fast, comprehensive, ISO 639 library.

Python 46 7 Updated Aug 12, 2025

eric-muller / udhr

Universal Declaration of Human Rights

XSLT 20 11 Updated Nov 7, 2025

DOLMA-NLP / PARME

Parallel corpora for Middle Eastern languages - ACL2025

Python 8 Updated Aug 28, 2025

impresso / ocr-robust-multilingual-embeddings

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…

Python 5 1 Updated Apr 24, 2026

duckdb / duckdb

DuckDB is an analytical in-process SQL database management system

C++ 38,231 3,236 Updated May 15, 2026

jelmervdl / picturehouse-ics

Calendar feeds for Picturehouse cinemas

JavaScript 2 Updated Jul 9, 2025

transducens / PILAR

6 Updated May 28, 2025

johncoxon / octothorpe

1 2 Updated May 7, 2026

commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark

Python 454 94 Updated Mar 26, 2026

google / myanmar-tools

Detect and convert the Zawgyi-One font encoding in C++, Java, JavaScript, PHP, and Ruby

Java 262 84 Updated Mar 13, 2025

sinaahmadi / PersoArabicLID

PALI: Language identification for Perso-Arabic Scripts

Python 11 Updated Jul 11, 2023

hendrycks / error-detection

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Jupyter Notebook 233 40 Updated Dec 27, 2018

facebookresearch / text_characterization_toolkit

A library for computing diverse text characteristics and using them to analyze data sets and models with ease.

Python 41 2 Updated Aug 18, 2022

cdt-data-science / cluster-scripts

A collection of useful scripts, templates, and examples for clusters using SLURM https://slurm.schedmd.com/

Shell 114 18 Updated Oct 8, 2024

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 3,050 263 Updated May 6, 2026

commoncrawl / web-languages

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

68 88 Updated May 9, 2026