-
Common Crawl Foundation
- Edinburgh
- laurieburchell.github.com
- https://orcid.org/0000-0003-0724-350X
- @very-laurie.bsky.social
Highlights
- Pro
Starred repositories
A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
Parallel corpora for Middle Eastern languages - ACL2025
This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…
DuckDB is an analytical in-process SQL database management system
Calendar feeds for Picturehouse cinemas
Process Common Crawl data with Python and Spark
Detect and convert the Zawgyi-One font encoding in C++, Java, JavaScript, PHP, and Ruby
PALI: Language identification for Perso-Arabic Scripts
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
A library for computing diverse text characteristics and using them to analyze data sets and models with ease.
A collection of useful scripts, templates, and examples for clusters using SLURM https://slurm.schedmd.com/
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code
A tool to help researchers and product teams understand datasets with the goal of improving data quality, and mitigating fairness and bias issues.
A Survey on Data Selection for Language Models
Tutorial about using LLMs for translation and performing MBR on top
Library for fast text representation and classification.
A LaTeX Class for Informatics theses at The University of Edinburgh
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
Universal Romanizer that can convert any unicode script to roman (latin) script
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.



