Train a tiny GPT-2 on TinyStories. Explore architecture, optimizer, and schedule hyperparameters to minimize validation loss.
Inspired by Karpathy's autoresearch.
- Architecture: 2 layers, 64 dim, 2 heads, 128 context, GELU, LayerNorm
- Optimizer: AdamW, lr=3e-4, wd=0.01
- Schedule: Cosine, 100 warmup steps
- Training: batch 8, 500 steps, 300s max
- Baseline val_loss: ~3.5
- Depth vs width tradeoffs (more layers vs larger dimensions)
- Normalization (LayerNorm vs RMSNorm)
- Position encoding (learned vs rotary)
- Learning rate schedules (cosine vs linear vs constant)
- Batch size effects
- Dropout and regularization
- Activation functions (GELU vs SiLU vs SwiGLU)
TinyStories — a synthetic dataset of short stories generated by GPT-3.5/4, designed for training small language models. Character-level tokenization (ASCII-128 vocab).
See LEADERBOARD.md (auto-updated every 6 hours).