Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

gpt2-tinystories

Train a tiny GPT-2 on TinyStories. Explore architecture, optimizer, and schedule hyperparameters to minimize validation loss.

Inspired by Karpathy's autoresearch.

Baseline

  • Architecture: 2 layers, 64 dim, 2 heads, 128 context, GELU, LayerNorm
  • Optimizer: AdamW, lr=3e-4, wd=0.01
  • Schedule: Cosine, 100 warmup steps
  • Training: batch 8, 500 steps, 300s max
  • Baseline val_loss: ~3.5

What to Explore

  • Depth vs width tradeoffs (more layers vs larger dimensions)
  • Normalization (LayerNorm vs RMSNorm)
  • Position encoding (learned vs rotary)
  • Learning rate schedules (cosine vs linear vs constant)
  • Batch size effects
  • Dropout and regularization
  • Activation functions (GELU vs SiLU vs SwiGLU)

Dataset

TinyStories — a synthetic dataset of short stories generated by GPT-3.5/4, designed for training small language models. Character-level tokenization (ASCII-128 vocab).

Leaderboard

See LEADERBOARD.md (auto-updated every 6 hours).