🏛️ London Historical LLM (aka helloLondon)

Complete Guide to Building LLMs from Scratch

Learn to build Large Language Models from the ground up using historical London texts (1500-1850). This comprehensive 4-part series walks you through every step: data collection, custom tokenization, model training, evaluation, and deployment. We build two identical models - only the size differs (117M vs 354M parameters). Includes working code, published models, and production-ready inference.

📖 Want to understand the core LLM concepts? This project focuses on implementation and hands-on building. For deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out Generative AI in Action by Amit Bahree. You can learn more about the book → by clicking here.

🚀 Ready to Use: The London Historical SLM is already published and available on Hugging Face!

📝 Blog Series: Building LLMs from Scratch - Part 1 - Complete 4-part series covering the end-to-end process of building historical language models from scratch.

graph LR
    A[📚 Data Collection<br/>218+ sources<br/>1500-1850] --> B[🧹 Data Cleaning<br/>Text normalization<br/>Filtering]
    B --> C[🔤 Tokenizer Training<br/>30k vocab<br/>150+ special tokens]
    C --> D[🏋️ Model Training<br/>Two Identical Models<br/>SLM: 117M / Regular: 354M]
    D --> E[📊 Evaluation<br/>Historical accuracy<br/>ROUGE, MMLU]
    E --> F[🚀 Deployment<br/>Hugging Face<br/>Local Inference]
    F --> G[🎯 Use Cases<br/>Historical text generation<br/>Educational projects<br/>Research applications]
    
    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8
    style G fill:#fff3e0

🎓 What You'll Learn

This isn't just a model repository—it's a complete educational journey that teaches you how to build LLMs from scratch:

Core LLM Building Skills:

Data Collection: Gather and process 218+ historical sources from Archive.org
Custom Tokenization: Build specialized tokenizers for historical language patterns
Model Architecture: Implement and train GPT-style models from scratch
Training Infrastructure: Multi-GPU training, checkpointing, and monitoring
Evaluation: Comprehensive testing with historical accuracy metrics
Deployment: Publish to Hugging Face and build production inference systems

Hands-On Experience:

Working Code: Every component is fully implemented and documented
Live Models: Use published models immediately or train your own
Real Data: 500M+ characters of authentic historical English (1500-1850)
Production Ready: Professional-grade code with error handling and logging

📚 Complete Documentation

📖 Documentation Index: 08_documentation/README.md - Browse all guides here!

🎯 Two Model Variants - Identical Architecture, Different Sizes

We build two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters:

Model	Parameters	Iterations	Training Time*	Use Case	Best For
SLM (Small)	117M	60,000	~8-12 hours	Fast inference, resource-constrained	Development, testing, mobile
Regular (Full)	354M	60,000	~28-32 hours	High-quality generation	Production, research, publishing

Why Two Models? The SLM is perfect for learning, testing, and resource-constrained environments. The Regular model provides higher quality generation for production use. Both use identical code - just different configuration files!

*Times based on dual GPU training (2x A30 GPUs). Single GPU will take ~2x longer.

🚀 Quick Start - Choose Your Learning Path

Path 1: Use Published Model (Start Here)

Generate historical text in 2 minutes - Perfect for understanding the end result

🎯 Published Model: The London Historical SLM is ready to use!
📖 Detailed Guide: See Inference Quick Start
✅ Status: Both PyTorch checkpoint and Hugging Face inference working perfectly

Prerequisites:

Python 3.8+ installed
Internet connection (to download the model)

⚠️ Ubuntu/Debian Users: You also need python3-venv package:
sudo apt install python3-venv  # For Python 3.8-3.11
sudo apt install python3.12-venv  # For Python 3.12+

Quick Setup:

# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 07_utilities/setup_inference.py

# Use the model
python3 06_inference/inference_unified.py --interactive

Try it with Python:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model (automatically downloads)
tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")

# Generate historical text
prompt = "In the year 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs['input_ids'], 
    max_new_tokens=50, 
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Output: "the most extraordinary sight. The Thames flowed dark beneath London Bridge, 
# whilst carriages rattled upon the cobblestones with great urgency. Merchants called 
# their wares from Cheapside to Billingsgate, and the smoke from countless chimneys 
# did obscure the morning sun."

Or test with PyTorch checkpoints (if you have local models):

# Test SLM checkpoint (117M parameters)
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Test Regular checkpoint (354M parameters)  
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/checkpoint-60001.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

Path 2: Build from Scratch (Complete Learning Journey)

Learn every step of LLM development - Data collection, tokenization, training, and deployment

Prerequisites:

Python 3.8+ installed
8GB+ RAM (16GB+ recommended)
100GB+ free disk space
CUDA GPU (recommended, CPU works but slower)

⚠️ Ubuntu/Debian Users: You also need python3-venv package:
sudo apt install python3-venv  # For Python 3.8-3.11
sudo apt install python3.12-venv  # For Python 3.12+

Full Setup:

# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 01_environment/setup_environment.py
source activate_env.sh  # Linux/Mac

# Download data (218+ historical sources)
python3 02_data_collection/historical_data_collector.py

# Train tokenizer (30k vocab + 150+ special tokens)
python3 03_tokenizer/train_historical_tokenizer.py

# Train model
torchrun --nproc_per_node=2 04_training/train_model_slm.py  # SLM (recommended)
# or
torchrun --nproc_per_node=2 04_training/train_model.py      # Regular

# Evaluate & test
python3 05_evaluation/run_evaluation.py --mode quick
python3 06_inference/inference_unified.py --interactive

📖 Complete Training Guide: See Training Quick Start for detailed instructions

🏛️ What You Get

Historical Language Capabilities:

Tudor English (1500-1600): "thou", "thee", "hath", "doth"
Stuart Period (1600-1700): Restoration language, court speech
Georgian Era (1700-1800): Austen-style prose, social commentary
Victorian Times (1800-1850): Dickens-style narrative, industrial language

London-Specific Knowledge:

Landmarks: Thames, Westminster, Tower, Fleet Street, Cheapside
Historical Events: Great Fire, Plague, Civil War, Restoration
Social Classes: Nobles, merchants, apprentices, beggars
Professions: Apothecaries, coachmen, watermen, chimneysweeps

📊 Comprehensive Historical Dataset

This project includes one of the most comprehensive collections of historical English texts available for language model training, spanning 1500-1850 with 218+ sources and 500M+ characters.

🏛️ Historical Coverage

Period	Sources	Key Content	Language Features
Early Modern (1500-1600)	18 sources	Street literature, civic docs, religious texts	Colloquial slang, legal language, religious rhetoric
Georgian (1700-1800)	50+ sources	Novels, poetry, political works	Enlightenment prose, political discourse, scientific terminology
Victorian (1800-1850)	50+ sources	Complete Austen, Dickens, Brontës	Social commentary, industrial language, romantic expression

📖 Source Categories

Literature (80+ sources): Complete Austen collection, major Dickens works, Brontë sisters, Romantic poetry
Non-Fiction (60+ sources): Political treatises, economic texts, scientific works, religious sermons
Periodicals (25+ sources): The Times, Edinburgh Review, Punch, specialized magazines
Legal Documents (15+ sources): Acts of Parliament, city charters, legal treatises
Personal Accounts (20+ sources): Diaries, letters, memoirs from historical figures

📊 Dataset Statistics

Total Sources: 218+ historical texts
Time Period: 1500-1850 (350 years)
Estimated Characters: 500M+ characters
Estimated Tokens: ~125M tokens
Languages: Historical English (1500-1850)
Geographic Focus: London and England
Text Types: 8+ major categories

🔧 Technical Features

Model Architecture

Custom GPT Implementation: nanoGPT-style architecture optimized for historical text
Dual Model Support: SLM (117M) and Regular (354M) parameters
Custom Tokenizer: 30,000 vocabulary with 150+ historical special tokens
Multi-GPU Training: Efficient training on single/multiple GPUs
Unified Inference: Both PyTorch checkpoints and Hugging Face models supported

Training Infrastructure

Modern Training Code: DDP, checkpointing, and WandB integration
Professional Evaluation: Historical accuracy, ROUGE, MMLU, HellaSWAG
Device Safety: CPU evaluation during training to avoid GPU conflicts
Automatic GPU Detection: Smart GPU configuration and fallback

Data Collection

Archive.org Integration: Automated download from 99+ sources
Failed Download Recovery: Manual retry system for failed downloads
Remote Machine Support: Optimized for remote server execution
Modular Architecture: Easy to add new data sources

📚 Documentation Structure

Quick Start Guides

Inference Quick Start - Start here! Use the published model in 2 minutes
Training Quick Start - Want to train? Get training up and running in 15 minutes
Training Guide - Complete training for both model variants
Inference Setup - Deploy and use your trained models

Data & Processing

Data Collection - Download and process historical data
Synthetic Data - Generate additional training data
Text Cleaning Process - Complete cleaning pipeline implementation

Evaluation & Testing

Evaluation Quick Reference - Start here! Quick commands and metrics
Evaluation Guide - Complete manual - How to implement it?

Advanced Features

Tokenizer Vocabulary - Custom tokenizer details
GPU Tuning Guide - Precision, TF32, batch/seq sizing per GPU
Hugging Face Publishing - Publish models to Hugging Face
Deployment Guide - Production deployment options
WandB Setup - Experiment tracking and monitoring

🏗️ Repository Structure

helloLondon/
├── config.py               # Global configuration system
├── data/                   # Centralized data storage
│   └── london_historical/  # Historical text data
├── 01_environment/         # Environment setup and configuration
├── 02_data_collection/     # Data downloading and processing
├── 03_tokenizer/          # Custom tokenizer training
├── 04_training/           # Model training scripts
├── 05_evaluation/         # Model evaluation and testing
├── 06_inference/          # Model inference and testing
├── 06_testing/            # Test scripts and validation
├── 07_utilities/          # Utility files and lightweight setups
├── 08_documentation/      # Documentation and guides
├── 09_models/             # Trained models and tokenizers
└── 10_scripts/            # Launch scripts and automation

⚙️ Global Configuration System

This project uses a centralized configuration system (config.py) that manages all paths, settings, and parameters:

Single Source of Truth: All paths and settings defined in one place
Easy Maintenance: Change settings once, affects entire project
Flexible Overrides: Command-line arguments can override any setting
Professional Structure: Follows industry best practices

Tokenizer Vocabulary

The custom tokenizer includes 150+ special tokens for historical London texts:

Historical Language: <|quoth|>, <|afeard|>, <|hither|>, <|thither|>
London Landmarks: <|tower|>, <|newgate|>, <|thames|>, <|fleet|>
Professions: <|apothecary|>, <|coachman|>, <|waterman|>, <|chimneysweep|>
Period Terms: <|elizabethan|>, <|restoration|>, <|gunpowder|>, <|popish|>

💻 Hardware Requirements

Component	Minimum	Recommended
RAM	8GB	16GB+
GPU	Any	RTX 3080+
Storage	100GB	200GB+
CPU	4 cores	8+ cores

🔄 Development Workflow

The helloLondon project follows a systematic approach to building historical language models:

📋 Requirements → Define model specifications and data needs
🔍 Data Collection → Gather historical texts from 218+ sources
🧹 Data Processing → Clean and structure historical content
🔤 Tokenizer Training → Create custom vocabulary for historical language
🏋️ Model Training → Train SLM (117M) or Regular (354M) models
📊 Evaluation → Test historical accuracy and language quality
🚀 Deployment → Publish to Hugging Face or deploy locally

📊 WandB Training Visualization

The training progression shows excellent convergence and stability for both models:

SLM (117M parameters):

wandb: Run history:
wandb:       eval/iter   ▂▂▃▃▄▄▅▅▆▆▇▇██
wandb: eval/train_loss  ███▇▇▇▇▇▇▇▇▇▇▇▇
wandb:   eval/val_loss  ███████▇▇▇█▇▇▇▇
wandb:    eval/val_ppl  █▇▇▇▇▇▆▆▆▆▆▆▆▆▆
wandb:     train/dt_ms           █            █                
wandb:      train/iter      ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇██
wandb:      train/loss ▆▅▇▅▅▃▇▄▄█▅▄▅▄▃▇▄▄▅ ▃▃▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
wandb:        train/lr ██████████▇▇▇▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂           
wandb:       train/mfu ▃▄▇▇█▄▄▆▆▇▅▂▅▆▆▇▇▂▄▅▇▇▇▆▆▇▇▇▇▅███▅▇▆▇▇ ▇

wandb: Run summary:
wandb:       eval/iter 60000
wandb: eval/train_loss 2.74369
wandb:   eval/val_loss 3.44089
wandb:    eval/val_ppl 31.21462
wandb:     train/dt_ms 10217.92054
wandb:      train/iter 60000
wandb:      train/loss 2.87667
wandb:        train/lr 3e-05
wandb:       train/mfu 7.50594

Regular Model (354M parameters):

wandb: Run history:
wandb:       eval/iter     ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: eval/train_loss  █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▇▆▆▆▆▆▆▆▆▆
wandb:   eval/val_loss  ███████████████████████████████████▇███
wandb:    eval/val_ppl  ████▇▇█▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▇▇▆▇▇▆▆▆▆▆▆▆▆
wandb:     train/dt_ms                  █                      
wandb:      train/iter      ▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇███
wandb:      train/loss ▇▆▆▇▇▅▅█▅▄▃▅▅▅▄▇▄▄▄▄▄▃▃▃▅▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
wandb:        train/lr ▄██████▇▇▇▇▇▆▆▆▅▅▄▄▄▄▄▄▄▄▃▃▂▂▂          
wandb:       train/mfu ▆▇█▅▄ ▄▆▆▆▇▃▃▂▂▆█▃▃▅▅▃█▅▄▆▇▇▇▇▄▅▃█▆▇█▄▃█

wandb: Run summary:
wandb:       eval/iter 60000
wandb: eval/train_loss 2.70315
wandb:   eval/val_loss 3.61921
wandb:    eval/val_ppl 37.30823
wandb:     train/dt_ms 24681.64754
wandb:      train/iter 60000
wandb:      train/loss 2.70629
wandb:        train/lr 0.0
wandb:       train/mfu 7.20423

Key Insights:

Perfect Training Continuity: Both models completed 60,000 steps without interruption
Excellent Convergence: Both models achieved ~2.7 training loss
SLM Superior: Better validation performance (3.44 vs 3.62 validation loss)
Resource Efficiency: SLM achieved better results with 3x fewer parameters
Stable Performance: Both models showed consistent step times and MFU

📈 Expected Results

Data Volume: 2-5GB of processed historical text
Time Coverage: 1500-1850 (350 years)
Model Size: GPT-2 Small (117M) or Medium (354M) parameters
Training Time: 7-8 hours (SLM) / 28-32 hours (Regular) on dual GPU (60,000 iterations each)
Vocabulary: 30,000 tokens with historical language support
Source Coverage: 218+ historical sources and texts
Success Rate: ~90% for no-registration sources

📚 Additional Resources

🤝 Inspiration and Acknowledgments

This project draws inspiration from two foundational works:

TimeCapsuleLLM

Source: TimeCapsuleLLM by haykgrigo3
Contribution: Provided the initial concept and framework for historical language model training
What I built on: Extended with production-ready infrastructure, comprehensive evaluation frameworks, and deployment to Hugging Face

nanoGPT

Source: nanoGPT by Andrej Karpathy
Contribution: Provided the core GPT architecture implementation and training methodology
What I built on: Adapted the architecture for historical text training, added custom tokenization, and integrated modern training practices

My Extensions

I extend these foundational concepts with:

Production-ready infrastructure with comprehensive error handling and logging
Custom historical tokenizer optimized for 1500-1850 English
Advanced data filtering and quality control systems
Multi-GPU training with Distributed Data Parallel
Comprehensive evaluation frameworks with historical accuracy metrics
Hugging Face deployment with professional model cards and documentation
Unified inference system supporting both PyTorch checkpoints and published models

🤝 Support

You are kidding right? 🤣

For issues and questions:

Check the troubleshooting guide
Review the documentation
Check the logs in the respective folders
Create an issue on GitHub

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Ready to explore historical London through AI? 🏛️✨

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01_environment		01_environment
02_data_collection		02_data_collection
03_tokenizer		03_tokenizer
04_training		04_training
05_evaluation		05_evaluation
06_inference		06_inference
06_testing		06_testing
07_utilities		07_utilities
08_documentation		08_documentation
09_models/tokenizers/london_historical_tokenizer		09_models/tokenizers/london_historical_tokenizer
10_scripts		10_scripts
data/london_historical		data/london_historical
LICENSE		LICENSE
README.md		README.md
activate_env.bat		activate_env.bat
activate_env.sh		activate_env.sh
config.py		config.py
environment_config.json		environment_config.json
requirements.txt		requirements.txt

License

mastercda/helloLondon

Folders and files

Latest commit

History

Repository files navigation