-
Common Crawl Foundation
- Edinburgh
- laurieburchell.github.com
- https://orcid.org/0000-0003-0724-350X
- @very-laurie.bsky.social
Highlights
- Pro
Starred repositories
Hydra is a framework for elegantly configuring complex applications
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Process Common Crawl data with Python and Spark
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Semantic parsers based on encoder-decoder framework
Amharic English Machine Translation Corpus prepared through website crawelling and custom preprocessing.
A library for computing diverse text characteristics and using them to analyze data sets and models with ease.
Code for the paper "Factorising Meaning and Form for Intent-Preserving Paraphrasing", Tom Hosking & Mirella Lapata (ACL 2021)
tree2code: Learning Discrete Syntactic Codes for Structural Diverse Translation
A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
PALI: Language identification for Perso-Arabic Scripts
Parallel corpora for Middle Eastern languages - ACL2025
This repository provides datasets, adapted models, and starter code for the ACL 2025 paper "Cheap Character Noise for OCR-Robust Multilingual Embeddings." It supports research on multilingual embed…
Tutorial about using LLMs for translation and performing MBR on top
Codebase for the "Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification" paper accepted to the ArabicNLP conference (co-located with EMNLP 2023)



