fix failed tests on upstream-2026-10-02#2516
Conversation
…into add-grpo # Conflicts: # .github/CODEOWNERS # .github/workflows/integration_test_8gpu_h100.yaml # .github/workflows/integration_test_8gpu_models.yaml # .github/workflows/integration_test_8gpu_torchft.yaml # torchtitan/components/checkpoint.py # torchtitan/experiments/__init__.py # torchtitan/models/attention.py # torchtitan/models/deepseek_v3/infra/parallelize.py # torchtitan/models/llama3/__init__.py # torchtitan/models/llama3/model/args.py # torchtitan/models/llama3/model/model.py # torchtitan/tools/logging.py # torchtitan/train.py
add grpo :)
add inference logp IS
feat: qwen3-next
feat: seed-oss support
Accumulate n_tokens_seen from fwd/bwd step to calculate MFU/Throughput
Rebased LLEP (Least-Loaded Expert Parallelism) from the old phuc/kimi_k2_with_autotune_llep_optimized_llep branch onto the latest upstream-2026-10-02 to resolve 86 merge conflicts caused by 1000+ upstream commits since the original branch point. New LLEP core files: - torchtitan/distributed/llep.py: dispatch/combine with LPT routing - torchtitan/distributed/llep_autotune.py: hyperparameter autotuning - torchtitan/distributed/llep_kernels.py: Triton kernels Integration points (surgical changes to upstream files): - moe.py: LLEPConfig, fast_init_*, llep_state in GroupedExperts.forward - expert_parallel.py: ExpertParallelLLEP class - job_config.py: LLEP config dataclass - llama4/parallelize.py: LLEP EP selection logic - deepseek_v3/__init__.py: LLEP model variants - train.py: LLEP autotune at startup
The [llep] TOML section (enabled, max_tokens_factor, etc.) was not being applied to moe_args in DeepSeekV3ModelArgs.update_from_config(), so LLEP was never actually activated. This caused OOM on the imbalanced GPU since standard EP doesn't balance memory across ranks.
- Add debugmodel_ep8_llep_3b flavor (1.75B params, 64 experts, EP=8) and debug_model_ep8_llep_3b.toml for benchmarking on new upstream (the 9.5B config OOMs due to upstream memory regression) - Copy 3 missing multinode Kimi K2 LLEP+Muon TOML configs from old branch
Keep only essential files: - docs/llep.md (main documentation) - debug_model_llep.toml (2-GPU smoke test) - debug_model_ep8_llep_3b.toml (8-GPU benchmark) - test_llep_toml_override.toml (unit test config) - debugmodel_llep, debugmodel_ep8_llep_3b model flavors Removed: optimization report, pr008 cleanup doc, loss comparison scripts, baseline/stresstest/mini_kimi/kimi_k2/multinode TOMLs, and their corresponding model flavor definitions.
- Get EP process group from experts' DTensor device_mesh["ep"] instead of non-existent MoE._ep_group attribute - Use moe_module.use_llep and moe_module._llep_config instead of non-existent _llep_enabled/_llep_max_tokens_factor attributes
When both expert_parallel_comm_backend="deepep" and llep.enabled=true, uses DeepEP for balanced steps and falls back to LLEP for imbalanced steps based on adaptive_threshold. New classes: - DeepEPLLEPExpertParallel: per-step dispatch/combine hook that checks imbalance ratio and routes to DeepEP or LLEP path accordingly - DeepEPLLEPMoE: MoE module that passes 5-tuple routing info to experts and handles async combine overlap with shared_experts Wiring: args.py derives moe_impl="deepep_llep" when both flags set, build_moe() creates DeepEPLLEPMoE, apply_moe_ep_tp() installs the adaptive expert parallel hooks.
Replace the 3B debugmodel_ep8_llep_3b (too small for LLEP to help) with the 9.5B debugmodel_ep8_llep (dim=2048, moe_inter_dim=1536, 16 layers, lbs=8) that stresses GPU memory and shows LLEP's benefit. Benchmark results on 8xB200 (steps 5-20): - Speed: +10.9% mean TPS (26,370 vs 23,780) - Memory: 7 GiB spread vs 42 GiB, max 82% vs 97% - Without LLEP at lbs=10: OOM. With LLEP: runs fine. - Loss correctness: <0.001 diff by step 130
DeepEPLLEPMoE.forward() was identical to DeepEPMoE.forward(). The behavioral difference comes entirely from which ExpertParallel hooks get installed (DeepEPExpertParallel vs DeepEPLLEPExpertParallel), not from the MoE class itself.
4-config comparison on 9.5B model, 8xB200: - LLEP (standard): 26,250 TPS, 6 GiB spread - DeepEP+LLEP (adaptive): 25,820 TPS, 7 GiB spread - Standard EP: 21,940 TPS, 41 GiB spread - DeepEP only: 19,640 TPS, 52 GiB spread
The fused_silu_gate Triton kernel (fwd/bwd), FusedSiLUGate autograd class, and silu_gate_reference were not wired into the training path (llep.py never imported them). Delete the dead code from llep_kernels.py, remove test_llep_correctness.py (all tests were for the deleted kernel), and update docs/llep.md to reflect current state.
Move llep.py, llep_kernels.py, llep_autotune.py into torchtitan/distributed/llep/ and docs/llep.md as its README.md. All external imports (from torchtitan.distributed.llep import ...) remain unchanged since llep/__init__.py preserves the public API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmal_ Delete fast_init_trunc_normal_ and fast_init_normal_ from moe.py since upstream utils.py already provides equivalent trunc_normal_ and normal_. Wrap init_weights() call in test with torch.no_grad() to match how the training pipeline invokes it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Least-Loaded Expert Parallelism with new upstream
- Fix GptOssGroupedExperts.init_weights() signature: add missing n_layers param to match MoE.init_weights() call at moe/moe.py:1064 - Fix HF checkpoint integration test: make checkpoint_path dynamic based on output_dir instead of hardcoded artifacts-to-be-uploaded/ path - Fix test_checkpoint.py: add process_group kwarg to fake_save methods - Skip DeepSeek-V3 transformers tokenizer comparison: tokenizer.json (BPE) and tokenizer.model (SentencePiece) use different merge implementations - Add debugmodel_moe_deepep flavor for convergence testing
|
Hi @xrsrke! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
No description provided.