Skip to content

fix failed tests on upstream-2026-10-02#2516

Open
xrsrke wants to merge 202 commits intopytorch:mainfrom
NousResearch:phuc/fix_failed_tests_on_upstream-2026-10-02
Open

fix failed tests on upstream-2026-10-02#2516
xrsrke wants to merge 202 commits intopytorch:mainfrom
NousResearch:phuc/fix_failed_tests_on_upstream-2026-10-02

Conversation

@xrsrke
Copy link

@xrsrke xrsrke commented Mar 6, 2026

No description provided.

jquesnelle and others added 30 commits August 28, 2025 04:52
…into add-grpo

# Conflicts:
#	.github/CODEOWNERS
#	.github/workflows/integration_test_8gpu_h100.yaml
#	.github/workflows/integration_test_8gpu_models.yaml
#	.github/workflows/integration_test_8gpu_torchft.yaml
#	torchtitan/components/checkpoint.py
#	torchtitan/experiments/__init__.py
#	torchtitan/models/attention.py
#	torchtitan/models/deepseek_v3/infra/parallelize.py
#	torchtitan/models/llama3/__init__.py
#	torchtitan/models/llama3/model/args.py
#	torchtitan/models/llama3/model/model.py
#	torchtitan/tools/logging.py
#	torchtitan/train.py
Accumulate n_tokens_seen from fwd/bwd step to calculate MFU/Throughput
jquesnelle and others added 22 commits February 18, 2026 17:43
Rebased LLEP (Least-Loaded Expert Parallelism) from the old
phuc/kimi_k2_with_autotune_llep_optimized_llep branch onto the
latest upstream-2026-10-02 to resolve 86 merge conflicts caused
by 1000+ upstream commits since the original branch point.

New LLEP core files:
- torchtitan/distributed/llep.py: dispatch/combine with LPT routing
- torchtitan/distributed/llep_autotune.py: hyperparameter autotuning
- torchtitan/distributed/llep_kernels.py: Triton kernels

Integration points (surgical changes to upstream files):
- moe.py: LLEPConfig, fast_init_*, llep_state in GroupedExperts.forward
- expert_parallel.py: ExpertParallelLLEP class
- job_config.py: LLEP config dataclass
- llama4/parallelize.py: LLEP EP selection logic
- deepseek_v3/__init__.py: LLEP model variants
- train.py: LLEP autotune at startup
The [llep] TOML section (enabled, max_tokens_factor, etc.) was not being
applied to moe_args in DeepSeekV3ModelArgs.update_from_config(), so LLEP
was never actually activated. This caused OOM on the imbalanced GPU since
standard EP doesn't balance memory across ranks.
- Add debugmodel_ep8_llep_3b flavor (1.75B params, 64 experts, EP=8)
  and debug_model_ep8_llep_3b.toml for benchmarking on new upstream
  (the 9.5B config OOMs due to upstream memory regression)
- Copy 3 missing multinode Kimi K2 LLEP+Muon TOML configs from old branch
Keep only essential files:
- docs/llep.md (main documentation)
- debug_model_llep.toml (2-GPU smoke test)
- debug_model_ep8_llep_3b.toml (8-GPU benchmark)
- test_llep_toml_override.toml (unit test config)
- debugmodel_llep, debugmodel_ep8_llep_3b model flavors

Removed: optimization report, pr008 cleanup doc, loss comparison
scripts, baseline/stresstest/mini_kimi/kimi_k2/multinode TOMLs,
and their corresponding model flavor definitions.
- Get EP process group from experts' DTensor device_mesh["ep"] instead
  of non-existent MoE._ep_group attribute
- Use moe_module.use_llep and moe_module._llep_config instead of
  non-existent _llep_enabled/_llep_max_tokens_factor attributes
When both expert_parallel_comm_backend="deepep" and llep.enabled=true,
uses DeepEP for balanced steps and falls back to LLEP for imbalanced
steps based on adaptive_threshold.

New classes:
- DeepEPLLEPExpertParallel: per-step dispatch/combine hook that checks
  imbalance ratio and routes to DeepEP or LLEP path accordingly
- DeepEPLLEPMoE: MoE module that passes 5-tuple routing info to experts
  and handles async combine overlap with shared_experts

Wiring: args.py derives moe_impl="deepep_llep" when both flags set,
build_moe() creates DeepEPLLEPMoE, apply_moe_ep_tp() installs the
adaptive expert parallel hooks.
Replace the 3B debugmodel_ep8_llep_3b (too small for LLEP to help)
with the 9.5B debugmodel_ep8_llep (dim=2048, moe_inter_dim=1536,
16 layers, lbs=8) that stresses GPU memory and shows LLEP's benefit.

Benchmark results on 8xB200 (steps 5-20):
- Speed: +10.9% mean TPS (26,370 vs 23,780)
- Memory: 7 GiB spread vs 42 GiB, max 82% vs 97%
- Without LLEP at lbs=10: OOM. With LLEP: runs fine.
- Loss correctness: <0.001 diff by step 130
DeepEPLLEPMoE.forward() was identical to DeepEPMoE.forward(). The
behavioral difference comes entirely from which ExpertParallel hooks
get installed (DeepEPExpertParallel vs DeepEPLLEPExpertParallel),
not from the MoE class itself.
4-config comparison on 9.5B model, 8xB200:
- LLEP (standard): 26,250 TPS, 6 GiB spread
- DeepEP+LLEP (adaptive): 25,820 TPS, 7 GiB spread
- Standard EP: 21,940 TPS, 41 GiB spread
- DeepEP only: 19,640 TPS, 52 GiB spread
The fused_silu_gate Triton kernel (fwd/bwd), FusedSiLUGate autograd
class, and silu_gate_reference were not wired into the training path
(llep.py never imported them). Delete the dead code from llep_kernels.py,
remove test_llep_correctness.py (all tests were for the deleted kernel),
and update docs/llep.md to reflect current state.
Move llep.py, llep_kernels.py, llep_autotune.py into
torchtitan/distributed/llep/ and docs/llep.md as its README.md.
All external imports (from torchtitan.distributed.llep import ...)
remain unchanged since llep/__init__.py preserves the public API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmal_

Delete fast_init_trunc_normal_ and fast_init_normal_ from moe.py since
upstream utils.py already provides equivalent trunc_normal_ and normal_.
Wrap init_weights() call in test with torch.no_grad() to match how the
training pipeline invokes it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Least-Loaded Expert Parallelism with new upstream
- Fix GptOssGroupedExperts.init_weights() signature: add missing n_layers
  param to match MoE.init_weights() call at moe/moe.py:1064
- Fix HF checkpoint integration test: make checkpoint_path dynamic based
  on output_dir instead of hardcoded artifacts-to-be-uploaded/ path
- Fix test_checkpoint.py: add process_group kwarg to fake_save methods
- Skip DeepSeek-V3 transformers tokenizer comparison: tokenizer.json (BPE)
  and tokenizer.model (SentencePiece) use different merge implementations
- Add debugmodel_moe_deepep flavor for convergence testing
@meta-cla
Copy link

meta-cla bot commented Mar 6, 2026

Hi @xrsrke!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants