Workaround AC HOP mutation issue when tracing token dispatch by xmfan · Pull Request #1984 · pytorch/torchtitan

xmfan · 2025-11-04T07:51:05Z

FIXES #1935

Stacked PRs:

->Workaround AC HOP mutation issue when tracing token dispatch #1984

Workaround AC HOP mutation issue when tracing token dispatch

TORCH_COMPILE_FORCE_DISABLE_CACHES=1 HF_TOKEN=<token> HF_HUB_DISABLE_XET=1 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" with-proxy ./run_train.sh --model.name simple_fsdp.deepseek_v3

This is a problem for SimpleFSDP where we want to fullgraph the entire model, these "mutation" cause graph break

It is less of a problem outside SimpleFSDP, because we don't currently compile token dispatch

stack-info: PR: #1984, branch: xmfan/stack/2

tianyu-l · 2025-11-05T04:24:28Z

torchtitan/models/moe/moe.py

+        input_shape,
+        permuted_indices,
+        input_splits,
+        output_splits,


These shouldn't be exposed to single-device model code. Plus, I don't think it will work if EP is not used.

If it's getting too hard, maybe we should use local_map / to_local to re-implement MoE.

ruisizhang123 · 2025-11-07T21:29:44Z

Thank you for the fix! Do you think it would require fewer user-side changes if we reimplemented apply_ac as a graph pass?

ruisizhang123 · 2025-11-07T21:33:24Z

torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml

 max_norm = 1.0  # grad norm clipping
 steps = 1000
-dataset = "c4"  # supported datasets: c4_test (2K), c4 (177M)
+dataset = "c4_test"  # supported datasets: c4_test (2K), c4 (177M)


minor: toml file config changed.

xmfan · 2025-11-07T22:04:41Z

This change is only needed if you use compile(torch.utils.checkpoint(, so graph pass wouldn't need it. but if you use both eager and graph-based, you will need this again

ruisizhang123 · 2025-11-07T22:11:46Z

This change is only needed if you use compile(torch.utils.checkpoint(, so graph pass wouldn't need it. but if you use both eager and graph-based, you will need this again

Yes, what I meant is that if we're going for a compiler-based approach to distributed parallelism in simplefsdp, it would make sense to have a specialized apply_ac function that’s also compiler-based. (and users are not allowed to use eager checkpoint to implement ac)

Workaround AC HOP mutation issue when tracing token dispatch

16eb293

stack-info: PR: #1984, branch: xmfan/stack/2

xmfan force-pushed the xmfan/stack/2 branch from ef75299 to 16eb293 Compare November 4, 2025 07:51

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025

xmfan marked this pull request as ready for review November 5, 2025 04:03

xmfan requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 5, 2025 04:03

xmfan requested a review from ruisizhang123 November 5, 2025 04:03

tianyu-l requested changes Nov 5, 2025

View reviewed changes

ruisizhang123 reviewed Nov 7, 2025

View reviewed changes

xmfan marked this pull request as draft November 7, 2025 22:04

xmfan closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround AC HOP mutation issue when tracing token dispatch#1984

Workaround AC HOP mutation issue when tracing token dispatch#1984
xmfan wants to merge 1 commit intomainfrom
xmfan/stack/2

xmfan commented Nov 4, 2025 •

edited

Loading

Uh oh!

tianyu-l Nov 5, 2025

Uh oh!

ruisizhang123 commented Nov 7, 2025

Uh oh!

ruisizhang123 Nov 7, 2025

Uh oh!

xmfan commented Nov 7, 2025

Uh oh!

ruisizhang123 commented Nov 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xmfan commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 commented Nov 7, 2025

Uh oh!

ruisizhang123 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan commented Nov 7, 2025

Uh oh!

ruisizhang123 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmfan commented Nov 4, 2025 •

edited

Loading

ruisizhang123 commented Nov 7, 2025 •

edited

Loading