laurieburchell

Follow

Laurie Burchell laurieburchell

Follow

Senior Research Engineer @commoncrawl

33 followers · 47 following

Achievements

Achievements

Highlights

Pro

Organizations

Starred repositories

23 stars written in Python

facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications

Python 10,366 843 Updated May 16, 2026

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python 5,955 371 Updated Sep 12, 2025

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 3,051 263 Updated May 6, 2026

commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark

Python 454 94 Updated Mar 26, 2026

google-research / url-nlp

Python 272 37 Updated Aug 1, 2025

allenai / wimbd

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets

Python 227 22 Updated Nov 16, 2024

cisnlp / Glot500

[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Python 106 4 Updated Apr 14, 2026

berlino / tensor2struct-public

Semantic parsers based on encoder-decoder framework

Python 92 22 Updated Mar 8, 2023

davidjurgens / equilid

Socially-Equitable Language Identification

Python 77 15 Updated Mar 25, 2023

microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020

Python 76 56 Updated May 28, 2024

LBeaudoux / iso639

A fast, comprehensive, ISO 639 library.

Python 46 7 Updated Aug 12, 2025

MarsPanther / Amharic-English-Machine-Translation-Corpus

Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.

Python 45 28 Updated Aug 2, 2018

facebookresearch / text_characterization_toolkit

A library for computing diverse text characteristics and using them to analyze data sets and models with ease.

Python 41 2 Updated Aug 18, 2022

tomhosking / separator

Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)

Python 27 5 Updated Nov 8, 2023

zomux / tree2code

tree2code: Learning Discrete Syntactic Codes for Structural Diverse Translation

Python 26 2 Updated Dec 27, 2019

cyb3rk0tik / pyfranc

Text language detection basic on trigrams.

Python 16 4 Updated Oct 2, 2023

harvard-lil / warcbench

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Python 14 Updated Jul 30, 2025

sinaahmadi / PersoArabicLID

PALI: Language identification for Perso-Arabic Scripts

Python 11 Updated Jul 11, 2023

DOLMA-NLP / PARME

Parallel corpora for Middle Eastern languages - ACL2025

Python 8 Updated Aug 28, 2025

thunderpoot / isogloss

ISO 639 and IETF Language Code Lookup Tool

Python 7 1 Updated Oct 7, 2024

impresso / ocr-robust-multilingual-embeddings

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…

Python 5 1 Updated Apr 24, 2026

ricardorei / MT-Marathon-Tutorial-2024

Tutorial about using LLMs for translation and performing MBR on top

Python 3 1 Updated Sep 5, 2024

AMR-KELEG / ADI-under-scrutiny

Codebase for the "Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification" paper accepted to the ArabicNLP conference (co-located with EMNLP 2023)

Python 3 Updated Dec 17, 2023

Starred topics

Python