This is a scratchpad project that implements different LLM parts from scratch, also builds and trains small model variants of popular LLM architectures.
- Attention
 
- Sebastian Raschka 's amazing book Build a Large Language Model From Scratch
 
- Transformer paper Attention Is All You Need (arxiv.org/abs/1706.03762)
 - MQA paper Fast Transformer Decoding: One Write-Head is All You Need (arxiv.org/abs/1911.02150)
 - GQA paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (arxiv.org/abs/2305.13245)
 - DeepSeek-v2 paper (proposed MLA) DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
 
- 2025/06/26 Project start
 - 2025/06/27
 - 2025/06/29
- Add multi-query attention (MQA)
 - Add attention and MHA variants explanation in attention readme
 
 - 2025/07/24
 - 2025/08/05
- Update MLA implementation to follow DeepSeek-v2 official formula