Skip to content
View laurieburchell's full-sized avatar

Highlights

  • Pro

Organizations

@commoncrawl

Block or report laurieburchell

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

23 stars written in Python
Clear filter

Hydra is a framework for elegantly configuring complex applications

Python 10,366 843 Updated May 16, 2026

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python 5,955 371 Updated Sep 12, 2025

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Python 3,051 263 Updated May 6, 2026

Process Common Crawl data with Python and Spark

Python 454 94 Updated Mar 26, 2026
Python 272 37 Updated Aug 1, 2025

What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets

Python 227 22 Updated Nov 16, 2024

[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Python 106 4 Updated Apr 14, 2026

Semantic parsers based on encoder-decoder framework

Python 92 22 Updated Mar 8, 2023

Socially-Equitable Language Identification

Python 77 15 Updated Mar 25, 2023

A benchmark for code-switched NLP, ACL 2020

Python 76 56 Updated May 28, 2024

A fast, comprehensive, ISO 639 library.

Python 46 7 Updated Aug 12, 2025

Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.

Python 45 28 Updated Aug 2, 2018

A library for computing diverse text characteristics and using them to analyze data sets and models with ease.

Python 41 2 Updated Aug 18, 2022

Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)

Python 27 5 Updated Nov 8, 2023

tree2code: Learning Discrete Syntactic Codes for Structural Diverse Translation

Python 26 2 Updated Dec 27, 2019

Text language detection basic on trigrams.

Python 16 4 Updated Oct 2, 2023

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

Python 14 Updated Jul 30, 2025

PALI: Language identification for Perso-Arabic Scripts

Python 11 Updated Jul 11, 2023

Parallel corpora for Middle Eastern languages - ACL2025

Python 8 Updated Aug 28, 2025

ISO 639 and IETF Language Code Lookup Tool

Python 7 1 Updated Oct 7, 2024

This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…

Python 5 1 Updated Apr 24, 2026

Tutorial about using LLMs for translation and performing MBR on top

Python 3 1 Updated Sep 5, 2024

Codebase for the "Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification" paper accepted to the ArabicNLP conference (co-located with EMNLP 2023)

Python 3 Updated Dec 17, 2023