Skip to content
View laurieburchell's full-sized avatar

Highlights

  • Pro

Organizations

@commoncrawl

Block or report laurieburchell

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Python 14 Updated Jul 30, 2025

A fast, comprehensive, ISO 639 library.

Python 46 7 Updated Aug 12, 2025

Universal Declaration of Human Rights

XSLT 20 11 Updated Nov 7, 2025

Parallel corpora for Middle Eastern languages - ACL2025

Python 8 Updated Aug 28, 2025

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…

Python 5 1 Updated Apr 24, 2026

DuckDB is an analytical in-process SQL database management system

C++ 38,231 3,236 Updated May 15, 2026

Calendar feeds for Picturehouse cinemas

JavaScript 2 Updated Jul 9, 2025
6 Updated May 28, 2025

Process Common Crawl data with Python and Spark

Python 454 94 Updated Mar 26, 2026

Detect and convert the Zawgyi-One font encoding in C++, Java, JavaScript, PHP, and Ruby

Java 262 84 Updated Mar 13, 2025

PALI: Language identification for Perso-Arabic Scripts

Python 11 Updated Jul 11, 2023

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Jupyter Notebook 233 40 Updated Dec 27, 2018

A library for computing diverse text characteristics and using them to analyze data sets and models with ease.

Python 41 2 Updated Aug 18, 2022

A collection of useful scripts, templates, and examples for clusters using SLURM https://slurm.schedmd.com/

Shell 114 18 Updated Oct 8, 2024

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 3,050 263 Updated May 6, 2026

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

68 88 Updated May 9, 2026

ISO 639 and IETF Language Code Lookup Tool

Python 7 1 Updated Oct 7, 2024

A tool to help researchers and product teams understand datasets with the goal of improving data quality, and mitigating fairness and bias issues.

CSS 294 23 Updated Oct 29, 2022

A Survey on Data Selection for Language Models

258 15 Updated Apr 29, 2025

Tutorial about using LLMs for translation and performing MBR on top

Python 3 1 Updated Sep 5, 2024

Library for fast text representation and classification.

HTML 26,528 4,815 Updated Mar 22, 2024

A LaTeX Class for Informatics theses at The University of Edinburgh

TeX 34 11 Updated Feb 10, 2023

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets

Python 227 22 Updated Nov 16, 2024

Sharding program for Paracrawl

Go 2 2 Updated Sep 24, 2025

Universal Romanizer that can convert any unicode script to roman (latin) script

Perl 248 23 Updated Jul 26, 2024

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python 5,951 371 Updated Sep 12, 2025

Text language detection basic on trigrams.

Python 16 4 Updated Oct 2, 2023

[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Python 106 4 Updated Apr 14, 2026

Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.

Jupyter Notebook 4 1 Updated Jan 31, 2026
Next