[llama4] enable expert parallel on the same device mesh as tp (tp2ep) by hann-wang · Pull Request #1269 · pytorch/torchtitan

hann-wang · 2025-06-06T07:37:11Z

This PR is built on top of the concept introduced in #731.

In this implementation, the input to the MoE module is sharded along the seqlen dimension rather than being replicated. After gathering tokens from different EP ranks using all_to_all_single_autograd, the output tokens remain sharded along the seqlen dimension.

To activate this feature, set enable_tp2ep = true in the configuration file.

cc @tianyu-l

tianyu-l · 2025-06-06T20:14:07Z

Thank you for the PR! I'll take a look.

tianyu-l

Thank you very much for the PR! I think the idea sounds very interesting.

I have some high-level questions:

Compared with the PR #731 you refer to, this tp2ep implementation is all-to-all based rather than using all-gather / reduce-scatter. Do you have any idea which is more efficient, assuming both are correct?
Personally I think the implementation itself is a bit too intrusive to model code, whereas the idea of torchtitan is trying not to do so (https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L38). Do you think there is a chance you could make it cleaner?
Do you have some testing to show that your implementation is correct, e.g. in terms of loss curves compared with training with single-device code?

tianyu-l · 2025-06-12T05:28:28Z

torchtitan/experiments/llama4/infra/parallelize_llama.py

+                "moe":
+                PrepareModuleInputOutput(
+                    input_layouts=(Shard(1), ),
+                    desired_input_layouts=(Shard(1), ),


If I understand correctly, the input to router is sharded. Then this might break the semantics / correctness of the load balancing algorithm, given the update to self.tokens_per_expert is local to each EP rank.
https://github.com/pytorch/torchtitan/pull/1269/files#diff-87cc24d85c768f0b3d1f5c54cca39dc9de52ee20e8f601814c3200722901aee5R293

Thank you for pointing out this issue. We need an all_reduce across all ep groups.

Fixed in b87aa1e

hann-wang · 2025-06-13T06:51:44Z

Thank you very much for the PR! I think the idea sounds very interesting.

I have some high-level questions:

Compared with the PR [MoE][PoC] Expert Parallel: tp and tp2ep #731 you refer to, this tp2ep implementation is all-to-all based rather than using all-gather / reduce-scatter. Do you have any idea which is more efficient, assuming both are correct?

Personally I think the implementation itself is a bit too intrusive to model code, whereas the idea of torchtitan is trying not to do so (https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L38). Do you think there is a chance you could make it cleaner?

Do you have some testing to show that your implementation is correct, e.g. in terms of loss curves compared with training with single-device code?

When I mentioned [MoE][PoC] Expert Parallel: tp and tp2ep #731, I was referring to sharing the same device mesh between TP and EP. It is indeed possible to create a separate EP mesh in conjunction with TP. If the intermediate dimension of expert weights is relatively small, sharing the same device mesh between TP and EP should be feasible. The sharding of the router in the aforementioned pull request has some issues, as the top-k selection process is constrained to local experts.
The tricky part of EP is that the activations are not evenly splitted, requiring us to determine split sizes through a top-k router. I believe implementing a TokenDispatcher could be beneficial, but I haven't found an appropriate location for its initialization. If placed within ExpertParallel._apply, it causes a torch.compile failure.
I conducted an experiment using the Llama4 debug model, where I modified the dataset to C4 and set the steps to 1000. The loss curve for the TP2EP configuration (in blue) aligns well with that of the single GPU configuration (in black).
(Note: Since torch._grouped_mm is not available on the ROCm platform, this experiment utilizes the cg_grouped_mm kernels found in torchtitan/experiments/kernels/triton_contiguous_group_gemm)

hann-wang added 2 commits June 6, 2025 15:26

[llama4] enable expert parallel on the same device mesh as tp (tp2ep)

7aabc71

Merge branch 'pytorch:main' into han/pr_llama4_expert_parallel

f28ffeb

hann-wang requested review from fegin, tianyu-l and wwwjn as code owners June 6, 2025 07:37

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 6, 2025

hann-wang added 3 commits June 9, 2025 04:38

refactor: move tp2ep communications into TokenDispatcher

18c5d17

Merge branch 'main' into han/pr_llama4_expert_parallel

d727a91

fix: torch.compile failure of TorchAllToAllTokenDispatcher

864ee31

tianyu-l reviewed Jun 12, 2025

View reviewed changes

hann-wang added 2 commits June 13, 2025 05:06

fix: expert bias update

b87aa1e

chore: in-place add

cc7a45c

fix: multiply scores in FP32 datatype

cd1680f

hann-wang mentioned this pull request Jun 16, 2025

[llama4][auxiliary-loss-free load balancing] update expert_bias without backward hooks #1304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama4] enable expert parallel on the same device mesh as tp (tp2ep)#1269

[llama4] enable expert parallel on the same device mesh as tp (tp2ep)#1269
hann-wang wants to merge 8 commits intopytorch:mainfrom
hann-wang:han/pr_llama4_expert_parallel

hann-wang commented Jun 6, 2025 •

edited

Loading

Uh oh!

tianyu-l commented Jun 6, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jun 12, 2025

Uh oh!

hann-wang Jun 13, 2025

Uh oh!

hann-wang commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hann-wang commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Jun 6, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

hann-wang Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

hann-wang commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hann-wang commented Jun 6, 2025 •

edited

Loading