fix failed tests on upstream-2026-10-02 by xrsrke · Pull Request #2516 · pytorch/torchtitan

xrsrke · 2026-03-06T22:03:17Z

No description provided.

…s/qk norm

…into add-grpo # Conflicts: # .github/CODEOWNERS # .github/workflows/integration_test_8gpu_h100.yaml # .github/workflows/integration_test_8gpu_models.yaml # .github/workflows/integration_test_8gpu_torchft.yaml # torchtitan/components/checkpoint.py # torchtitan/experiments/__init__.py # torchtitan/models/attention.py # torchtitan/models/deepseek_v3/infra/parallelize.py # torchtitan/models/llama3/__init__.py # torchtitan/models/llama3/model/args.py # torchtitan/models/llama3/model/model.py # torchtitan/tools/logging.py # torchtitan/train.py

add grpo :)

add inference logp IS

feat: qwen3-next

feat: seed-oss support

…putation

Accumulate n_tokens_seen from fwd/bwd step to calculate MFU/Throughput

Rebased LLEP (Least-Loaded Expert Parallelism) from the old phuc/kimi_k2_with_autotune_llep_optimized_llep branch onto the latest upstream-2026-10-02 to resolve 86 merge conflicts caused by 1000+ upstream commits since the original branch point. New LLEP core files: - torchtitan/distributed/llep.py: dispatch/combine with LPT routing - torchtitan/distributed/llep_autotune.py: hyperparameter autotuning - torchtitan/distributed/llep_kernels.py: Triton kernels Integration points (surgical changes to upstream files): - moe.py: LLEPConfig, fast_init_*, llep_state in GroupedExperts.forward - expert_parallel.py: ExpertParallelLLEP class - job_config.py: LLEP config dataclass - llama4/parallelize.py: LLEP EP selection logic - deepseek_v3/__init__.py: LLEP model variants - train.py: LLEP autotune at startup

The [llep] TOML section (enabled, max_tokens_factor, etc.) was not being applied to moe_args in DeepSeekV3ModelArgs.update_from_config(), so LLEP was never actually activated. This caused OOM on the imbalanced GPU since standard EP doesn't balance memory across ranks.

- Add debugmodel_ep8_llep_3b flavor (1.75B params, 64 experts, EP=8) and debug_model_ep8_llep_3b.toml for benchmarking on new upstream (the 9.5B config OOMs due to upstream memory regression) - Copy 3 missing multinode Kimi K2 LLEP+Muon TOML configs from old branch

Keep only essential files: - docs/llep.md (main documentation) - debug_model_llep.toml (2-GPU smoke test) - debug_model_ep8_llep_3b.toml (8-GPU benchmark) - test_llep_toml_override.toml (unit test config) - debugmodel_llep, debugmodel_ep8_llep_3b model flavors Removed: optimization report, pr008 cleanup doc, loss comparison scripts, baseline/stresstest/mini_kimi/kimi_k2/multinode TOMLs, and their corresponding model flavor definitions.

- Get EP process group from experts' DTensor device_mesh["ep"] instead of non-existent MoE._ep_group attribute - Use moe_module.use_llep and moe_module._llep_config instead of non-existent _llep_enabled/_llep_max_tokens_factor attributes

When both expert_parallel_comm_backend="deepep" and llep.enabled=true, uses DeepEP for balanced steps and falls back to LLEP for imbalanced steps based on adaptive_threshold. New classes: - DeepEPLLEPExpertParallel: per-step dispatch/combine hook that checks imbalance ratio and routes to DeepEP or LLEP path accordingly - DeepEPLLEPMoE: MoE module that passes 5-tuple routing info to experts and handles async combine overlap with shared_experts Wiring: args.py derives moe_impl="deepep_llep" when both flags set, build_moe() creates DeepEPLLEPMoE, apply_moe_ep_tp() installs the adaptive expert parallel hooks.

Replace the 3B debugmodel_ep8_llep_3b (too small for LLEP to help) with the 9.5B debugmodel_ep8_llep (dim=2048, moe_inter_dim=1536, 16 layers, lbs=8) that stresses GPU memory and shows LLEP's benefit. Benchmark results on 8xB200 (steps 5-20): - Speed: +10.9% mean TPS (26,370 vs 23,780) - Memory: 7 GiB spread vs 42 GiB, max 82% vs 97% - Without LLEP at lbs=10: OOM. With LLEP: runs fine. - Loss correctness: <0.001 diff by step 130

DeepEPLLEPMoE.forward() was identical to DeepEPMoE.forward(). The behavioral difference comes entirely from which ExpertParallel hooks get installed (DeepEPExpertParallel vs DeepEPLLEPExpertParallel), not from the MoE class itself.

4-config comparison on 9.5B model, 8xB200: - LLEP (standard): 26,250 TPS, 6 GiB spread - DeepEP+LLEP (adaptive): 25,820 TPS, 7 GiB spread - Standard EP: 21,940 TPS, 41 GiB spread - DeepEP only: 19,640 TPS, 52 GiB spread

The fused_silu_gate Triton kernel (fwd/bwd), FusedSiLUGate autograd class, and silu_gate_reference were not wired into the training path (llep.py never imported them). Delete the dead code from llep_kernels.py, remove test_llep_correctness.py (all tests were for the deleted kernel), and update docs/llep.md to reflect current state.

Move llep.py, llep_kernels.py, llep_autotune.py into torchtitan/distributed/llep/ and docs/llep.md as its README.md. All external imports (from torchtitan.distributed.llep import ...) remain unchanged since llep/__init__.py preserves the public API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rmal_ Delete fast_init_trunc_normal_ and fast_init_normal_ from moe.py since upstream utils.py already provides equivalent trunc_normal_ and normal_. Wrap init_weights() call in test with torch.no_grad() to match how the training pipeline invokes it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Least-Loaded Expert Parallelism with new upstream

- Fix GptOssGroupedExperts.init_weights() signature: add missing n_layers param to match MoE.init_weights() call at moe/moe.py:1064 - Fix HF checkpoint integration test: make checkpoint_path dynamic based on output_dir instead of hardcoded artifacts-to-be-uploaded/ path - Fix test_checkpoint.py: add process_group kwarg to fake_save methods - Skip DeepSeek-V3 transformers tokenizer comparison: tokenizer.json (BPE) and tokenizer.model (SentencePiece) use different merge implementations - Add debugmodel_moe_deepep flavor for convergence testing

meta-cla · 2026-03-06T22:03:27Z

Hi @xrsrke!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

jquesnelle and others added 30 commits August 28, 2025 04:52

nous changes

4add1dc

add grpo :)

92352f0

working version

a8a2049

add config for legacy checkpoint loading

de041d8

nous changes

7fcd8aa

don't require a tokenizer or sequence lengths

1e44011

pass position ids through to llama3

9878802

hack: fix compile on blackwell

abded1e

position ids for deepseek (need to figure out sft vs. pretrain)

ae881c8

only pass sequence_lengths to attention init

5445dea

merge models into one modeling arch since the differences are qkv bia…

13d201c

…s/qk norm

initial commit of qwen3-next

47abb43

fixes from upstream changes

c013f69

Merge pull request #7 from NousResearch/add-grpo

62ca23f

add grpo :)

fix gating and activation

25dc331

add inference logp IS

44b99ad

- Took Verl's IS implementation so we can get sequence IS

eb1a501

Merge pull request #9 from NousResearch/add-inference-logp

03d2da7

add inference logp IS

add position_ids to qwen3-next

33591ba

guard import of causal_conv1d and fla

797db5f

Merge pull request #10 from NousResearch/q3n

6786199

feat: qwen3-next

add seed-oss support

d91ee11

Merge pull request #11 from NousResearch/seed-oss

139f275

feat: seed-oss support

Merge remote-tracking branch 'pytorch/main' into dev-updated-again

5817990

nanoset fixes

9000e20

Enhance Trainer class to accumulate tokens for MFU and throughput com…

09b271a

…putation

fix qwen3 moe init

9b06e67

Merge pull request #12 from NousResearch/fix/mfu_cal

683e90f

Accumulate n_tokens_seen from fwd/bwd step to calculate MFU/Throughput

add Qwen3 10B-A1B

d58345c

jquesnelle and others added 22 commits February 18, 2026 17:43

add target tokens to nanoset

ffdab11

allow sdpa for qwen3-next when linear attention off

62127c5

add GLM 4.7 and 5 configs

a79a842

add Muon Split

2d03909

pass-through dataloader args

4a8b980

Restore upstream scripts/loss_compare.py (not LLEP-specific)

5219c3d

comment out optimizer step log

f799a1d

Add DeepEP+LLEP benchmark results to LLEP README

8a7ea9c

4-config comparison on 9.5B model, 8xB200: - LLEP (standard): 26,250 TPS, 6 GiB spread - DeepEP+LLEP (adaptive): 25,820 TPS, 7 GiB spread - Standard EP: 21,940 TPS, 41 GiB spread - DeepEP only: 19,640 TPS, 52 GiB spread

qwen 3.5 fixes

46b0cf1

Merge pull request #55 from NousResearch/phuc/llep_optimized_v2

198ff4f

feat: Least-Loaded Expert Parallelism with new upstream

xrsrke requested review from allenwang28, daniellepintz, fegin, joecummings, tianyu-l, wconstab and wwwjn as code owners March 6, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix failed tests on upstream-2026-10-02#2516

fix failed tests on upstream-2026-10-02#2516
xrsrke wants to merge 202 commits intopytorch:mainfrom
NousResearch:phuc/fix_failed_tests_on_upstream-2026-10-02

xrsrke commented Mar 6, 2026

Uh oh!

meta-cla bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

xrsrke commented Mar 6, 2026

Uh oh!

meta-cla bot commented Mar 6, 2026

Action Required

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants