Stars
FlashMLA: Efficient Multi-head Latent Attention Kernels
DeepEP: an efficient expert-parallel communication library
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.