Learn to build Large Language Models from the ground up using historical London texts (1500-1850). This comprehensive 4-part series walks you through every step: data collection, custom tokenization, model training, evaluation, and deployment. We build two identical models - only the size differs (117M vs 354M parameters). Includes working code, published models, and production-ready inference.
π Want to understand the core LLM concepts? This project focuses on implementation and hands-on building. For deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out Generative AI in Action by Amit Bahree. You can learn more about the book β by clicking here.
π Ready to Use: The London Historical SLM is already published and available on Hugging Face!
π Blog Series: Building LLMs from Scratch - Part 1 - Complete 4-part series covering the end-to-end process of building historical language models from scratch.
graph LR
A[π Data Collection<br/>218+ sources<br/>1500-1850] --> B[π§Ή Data Cleaning<br/>Text normalization<br/>Filtering]
B --> C[π€ Tokenizer Training<br/>30k vocab<br/>150+ special tokens]
C --> D[ποΈ Model Training<br/>Two Identical Models<br/>SLM: 117M / Regular: 354M]
D --> E[π Evaluation<br/>Historical accuracy<br/>ROUGE, MMLU]
E --> F[π Deployment<br/>Hugging Face<br/>Local Inference]
F --> G[π― Use Cases<br/>Historical text generation<br/>Educational projects<br/>Research applications]
style A fill:#e1f5fe
style D fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#fff3e0
This isn't just a model repositoryβit's a complete educational journey that teaches you how to build LLMs from scratch:
- Data Collection: Gather and process 218+ historical sources from Archive.org
- Custom Tokenization: Build specialized tokenizers for historical language patterns
- Model Architecture: Implement and train GPT-style models from scratch
- Training Infrastructure: Multi-GPU training, checkpointing, and monitoring
- Evaluation: Comprehensive testing with historical accuracy metrics
- Deployment: Publish to Hugging Face and build production inference systems
- Working Code: Every component is fully implemented and documented
- Live Models: Use published models immediately or train your own
- Real Data: 500M+ characters of authentic historical English (1500-1850)
- Production Ready: Professional-grade code with error handling and logging
π Documentation Index: 08_documentation/README.md - Browse all guides here!
We build two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters:
| Model | Parameters | Iterations | Training Time* | Use Case | Best For |
|---|---|---|---|---|---|
| SLM (Small) | 117M | 60,000 | ~8-12 hours | Fast inference, resource-constrained | Development, testing, mobile |
| Regular (Full) | 354M | 60,000 | ~28-32 hours | High-quality generation | Production, research, publishing |
Why Two Models? The SLM is perfect for learning, testing, and resource-constrained environments. The Regular model provides higher quality generation for production use. Both use identical code - just different configuration files!
*Times based on dual GPU training (2x A30 GPUs). Single GPU will take ~2x longer.
Generate historical text in 2 minutes - Perfect for understanding the end result
π― Published Model: The London Historical SLM is ready to use!
π Detailed Guide: See Inference Quick Start
β Status: Both PyTorch checkpoint and Hugging Face inference working perfectly
Prerequisites:
- Python 3.8+ installed
- Internet connection (to download the model)
β οΈ Ubuntu/Debian Users: You also needpython3-venvpackage:sudo apt install python3-venv # For Python 3.8-3.11 sudo apt install python3.12-venv # For Python 3.12+
Quick Setup:
# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 07_utilities/setup_inference.py
# Use the model
python3 06_inference/inference_unified.py --interactiveTry it with Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model (automatically downloads)
tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")
# Generate historical text
prompt = "In the year 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs['input_ids'],
max_new_tokens=50,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Output: "the most extraordinary sight. The Thames flowed dark beneath London Bridge,
# whilst carriages rattled upon the cobblestones with great urgency. Merchants called
# their wares from Cheapside to Billingsgate, and the smoke from countless chimneys
# did obscure the morning sun."Or test with PyTorch checkpoints (if you have local models):
# Test SLM checkpoint (117M parameters)
python 06_inference/inference_pytorch.py \
--checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
--prompt "In the year 1834, I walked through the streets of London and witnessed"
# Test Regular checkpoint (354M parameters)
python 06_inference/inference_pytorch.py \
--checkpoint 09_models/checkpoints/checkpoint-60001.pt \
--prompt "In the year 1834, I walked through the streets of London and witnessed"Learn every step of LLM development - Data collection, tokenization, training, and deployment
Prerequisites:
- Python 3.8+ installed
- 8GB+ RAM (16GB+ recommended)
- 100GB+ free disk space
- CUDA GPU (recommended, CPU works but slower)
β οΈ Ubuntu/Debian Users: You also needpython3-venvpackage:sudo apt install python3-venv # For Python 3.8-3.11 sudo apt install python3.12-venv # For Python 3.12+
Full Setup:
# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 01_environment/setup_environment.py
source activate_env.sh # Linux/Mac
# Download data (218+ historical sources)
python3 02_data_collection/historical_data_collector.py
# Train tokenizer (30k vocab + 150+ special tokens)
python3 03_tokenizer/train_historical_tokenizer.py
# Train model
torchrun --nproc_per_node=2 04_training/train_model_slm.py # SLM (recommended)
# or
torchrun --nproc_per_node=2 04_training/train_model.py # Regular
# Evaluate & test
python3 05_evaluation/run_evaluation.py --mode quick
python3 06_inference/inference_unified.py --interactiveπ Complete Training Guide: See Training Quick Start for detailed instructions
- Tudor English (1500-1600): "thou", "thee", "hath", "doth"
- Stuart Period (1600-1700): Restoration language, court speech
- Georgian Era (1700-1800): Austen-style prose, social commentary
- Victorian Times (1800-1850): Dickens-style narrative, industrial language
- Landmarks: Thames, Westminster, Tower, Fleet Street, Cheapside
- Historical Events: Great Fire, Plague, Civil War, Restoration
- Social Classes: Nobles, merchants, apprentices, beggars
- Professions: Apothecaries, coachmen, watermen, chimneysweeps
This project includes one of the most comprehensive collections of historical English texts available for language model training, spanning 1500-1850 with 218+ sources and 500M+ characters.
| Period | Sources | Key Content | Language Features |
|---|---|---|---|
| Early Modern (1500-1600) | 18 sources | Street literature, civic docs, religious texts | Colloquial slang, legal language, religious rhetoric |
| Georgian (1700-1800) | 50+ sources | Novels, poetry, political works | Enlightenment prose, political discourse, scientific terminology |
| Victorian (1800-1850) | 50+ sources | Complete Austen, Dickens, BrontΓ«s | Social commentary, industrial language, romantic expression |
- Literature (80+ sources): Complete Austen collection, major Dickens works, BrontΓ« sisters, Romantic poetry
- Non-Fiction (60+ sources): Political treatises, economic texts, scientific works, religious sermons
- Periodicals (25+ sources): The Times, Edinburgh Review, Punch, specialized magazines
- Legal Documents (15+ sources): Acts of Parliament, city charters, legal treatises
- Personal Accounts (20+ sources): Diaries, letters, memoirs from historical figures
Total Sources: 218+ historical texts
Time Period: 1500-1850 (350 years)
Estimated Characters: 500M+ characters
Estimated Tokens: ~125M tokens
Languages: Historical English (1500-1850)
Geographic Focus: London and England
Text Types: 8+ major categories
- Custom GPT Implementation: nanoGPT-style architecture optimized for historical text
- Dual Model Support: SLM (117M) and Regular (354M) parameters
- Custom Tokenizer: 30,000 vocabulary with 150+ historical special tokens
- Multi-GPU Training: Efficient training on single/multiple GPUs
- Unified Inference: Both PyTorch checkpoints and Hugging Face models supported
- Modern Training Code: DDP, checkpointing, and WandB integration
- Professional Evaluation: Historical accuracy, ROUGE, MMLU, HellaSWAG
- Device Safety: CPU evaluation during training to avoid GPU conflicts
- Automatic GPU Detection: Smart GPU configuration and fallback
- Archive.org Integration: Automated download from 99+ sources
- Failed Download Recovery: Manual retry system for failed downloads
- Remote Machine Support: Optimized for remote server execution
- Modular Architecture: Easy to add new data sources
- Inference Quick Start - Start here! Use the published model in 2 minutes
- Training Quick Start - Want to train? Get training up and running in 15 minutes
- Training Guide - Complete training for both model variants
- Inference Setup - Deploy and use your trained models
- Data Collection - Download and process historical data
- Synthetic Data - Generate additional training data
- Text Cleaning Process - Complete cleaning pipeline implementation
- Evaluation Quick Reference - Start here! Quick commands and metrics
- Evaluation Guide - Complete manual - How to implement it?
- Tokenizer Vocabulary - Custom tokenizer details
- GPU Tuning Guide - Precision, TF32, batch/seq sizing per GPU
- Hugging Face Publishing - Publish models to Hugging Face
- Deployment Guide - Production deployment options
- WandB Setup - Experiment tracking and monitoring
helloLondon/
βββ config.py # Global configuration system
βββ data/ # Centralized data storage
β βββ london_historical/ # Historical text data
βββ 01_environment/ # Environment setup and configuration
βββ 02_data_collection/ # Data downloading and processing
βββ 03_tokenizer/ # Custom tokenizer training
βββ 04_training/ # Model training scripts
βββ 05_evaluation/ # Model evaluation and testing
βββ 06_inference/ # Model inference and testing
βββ 06_testing/ # Test scripts and validation
βββ 07_utilities/ # Utility files and lightweight setups
βββ 08_documentation/ # Documentation and guides
βββ 09_models/ # Trained models and tokenizers
βββ 10_scripts/ # Launch scripts and automation
This project uses a centralized configuration system (config.py) that manages all paths, settings, and parameters:
- Single Source of Truth: All paths and settings defined in one place
- Easy Maintenance: Change settings once, affects entire project
- Flexible Overrides: Command-line arguments can override any setting
- Professional Structure: Follows industry best practices
The custom tokenizer includes 150+ special tokens for historical London texts:
- Historical Language:
<|quoth|>,<|afeard|>,<|hither|>,<|thither|> - London Landmarks:
<|tower|>,<|newgate|>,<|thames|>,<|fleet|> - Professions:
<|apothecary|>,<|coachman|>,<|waterman|>,<|chimneysweep|> - Period Terms:
<|elizabethan|>,<|restoration|>,<|gunpowder|>,<|popish|>
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 8GB | 16GB+ |
| GPU | Any | RTX 3080+ |
| Storage | 100GB | 200GB+ |
| CPU | 4 cores | 8+ cores |
The helloLondon project follows a systematic approach to building historical language models:
- π Requirements β Define model specifications and data needs
- π Data Collection β Gather historical texts from 218+ sources
- π§Ή Data Processing β Clean and structure historical content
- π€ Tokenizer Training β Create custom vocabulary for historical language
- ποΈ Model Training β Train SLM (117M) or Regular (354M) models
- π Evaluation β Test historical accuracy and language quality
- π Deployment β Publish to Hugging Face or deploy locally
The training progression shows excellent convergence and stability for both models:
SLM (117M parameters):
wandb: Run history:
wandb: eval/iter βββββββ
β
ββββββ
wandb: eval/train_loss βββββββββββββββ
wandb: eval/val_loss βββββββββββββββ
wandb: eval/val_ppl βββββββββββββββ
wandb: train/dt_ms β β
wandb: train/iter ββββββββββββββββββ
β
β
β
β
β
β
βββββββββββ
wandb: train/loss ββ
ββ
β
ββββββ
ββ
ββββββ
βββββ
ββ
βββ
ββββ
ββ
wandb: train/lr ββββββββββββββ
βββββββββββββββ
wandb: train/mfu βββββββββββ
ββ
βββββββ
ββββββββββ
ββββ
ββββ β
wandb: Run summary:
wandb: eval/iter 60000
wandb: eval/train_loss 2.74369
wandb: eval/val_loss 3.44089
wandb: eval/val_ppl 31.21462
wandb: train/dt_ms 10217.92054
wandb: train/iter 60000
wandb: train/loss 2.87667
wandb: train/lr 3e-05
wandb: train/mfu 7.50594
Regular Model (354M parameters):
wandb: Run history:
wandb: eval/iter ββββββββββββββββ
β
β
β
β
β
β
ββββββββββββββ
wandb: eval/train_loss βββββββββββββββββββββββββββββββββββββββ
wandb: eval/val_loss βββββββββββββββββββββββββββββββββββββββ
wandb: eval/val_ppl ββββββββββββββββββββββββββββββββββββ
wandb: train/dt_ms β
wandb: train/iter βββββββββββββββ
β
β
ββββββββββββββββββ
wandb: train/loss ββββββ
β
ββ
βββ
β
β
βββββββββββ
βββ
ββ
βββ
ββββ
ββ
wandb: train/lr ββββββββββββββββ
β
βββββββββββββ
wandb: train/mfu ββββ
β ββββββββββββββ
β
βββ
ββββββββ
ββββββββ
wandb: Run summary:
wandb: eval/iter 60000
wandb: eval/train_loss 2.70315
wandb: eval/val_loss 3.61921
wandb: eval/val_ppl 37.30823
wandb: train/dt_ms 24681.64754
wandb: train/iter 60000
wandb: train/loss 2.70629
wandb: train/lr 0.0
wandb: train/mfu 7.20423
Key Insights:
- Perfect Training Continuity: Both models completed 60,000 steps without interruption
- Excellent Convergence: Both models achieved ~2.7 training loss
- SLM Superior: Better validation performance (3.44 vs 3.62 validation loss)
- Resource Efficiency: SLM achieved better results with 3x fewer parameters
- Stable Performance: Both models showed consistent step times and MFU
- Data Volume: 2-5GB of processed historical text
- Time Coverage: 1500-1850 (350 years)
- Model Size: GPT-2 Small (117M) or Medium (354M) parameters
- Training Time: 7-8 hours (SLM) / 28-32 hours (Regular) on dual GPU (60,000 iterations each)
- Vocabulary: 30,000 tokens with historical language support
- Source Coverage: 218+ historical sources and texts
- Success Rate: ~90% for no-registration sources
This project draws inspiration from two foundational works:
- Source: TimeCapsuleLLM by haykgrigo3
- Contribution: Provided the initial concept and framework for historical language model training
- What I built on: Extended with production-ready infrastructure, comprehensive evaluation frameworks, and deployment to Hugging Face
- Source: nanoGPT by Andrej Karpathy
- Contribution: Provided the core GPT architecture implementation and training methodology
- What I built on: Adapted the architecture for historical text training, added custom tokenization, and integrated modern training practices
I extend these foundational concepts with:
- Production-ready infrastructure with comprehensive error handling and logging
- Custom historical tokenizer optimized for 1500-1850 English
- Advanced data filtering and quality control systems
- Multi-GPU training with Distributed Data Parallel
- Comprehensive evaluation frameworks with historical accuracy metrics
- Hugging Face deployment with professional model cards and documentation
- Unified inference system supporting both PyTorch checkpoints and published models
You are kidding right? π€£
For issues and questions:
- Check the troubleshooting guide
- Review the documentation
- Check the logs in the respective folders
- Create an issue on GitHub
This project is licensed under the MIT License - see the LICENSE file for details.
Ready to explore historical London through AI? ποΈβ¨