gpt-oss model enablement by wwwjn · Pull Request #1754 · pytorch/torchtitan

wwwjn · 2025-09-24T20:37:33Z

Keep developing on top of #1559. Thanks @KhoomeiK for initial contribution!

Initialized by the same seed checkpoint, set seed=0 and deterministic = True.

GPT-oss
Run 1: dp_shard = 2

Run 2: dp_shard = 2, TP degree = 2 (NGPU=4)

Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4)

Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4)

Run 5: dp_shard=2, EP degree = 2 (NGPU=2)

torchtitan/models/attention.py

wwwjn · 2025-09-30T22:55:49Z

Need to rebase onto #1776

torchtitan/models/attention.py

tianyu-l

Looks great in general. Left some comments. May need some rebase on recent & near-future development.

torchtitan/experiments/__init__.py

torchtitan/experiments/gpt_oss/README.md

torchtitan/experiments/gpt_oss/__init__.py

torchtitan/experiments/gpt_oss/infra/expert_parallel.py

torchtitan/experiments/gpt_oss/model/model.py

torchtitan/experiments/gpt_oss/model/moe.py

torchtitan/experiments/gpt_oss/model/model.py

torchtitan/experiments/gpt_oss/model/args.py

torchtitan/experiments/gpt_oss/__init__.py

wwwjn · 2025-10-06T22:14:10Z

Summary of current status:

There are some prerequisite PRs:

FlexAttn refactor Refactor attention and make attention mask an argument to the model #1776
EP refactor [EP] add initial support for NVSHMEM-based all-to-all #1569
refactor freqs_cis as a input of model.forward() [RFC] Lift freqs_cis as an input of models #1797

Once these PRs are landed, I will refactor:

FlexAttention, adding sliding_window attention mask, and make it orthogonal to block_causal mask.
ExpertParallel() and ExpertTensorParallel() class to reuse as much as possible, as keep aligns with main EP implementation

…ks but reduces mfu for 20b

torchtitan/experiments/gpt_oss/model/args.py

torchtitan/models/attention.py

torchtitan/experiments/gpt_oss/model/model.py

torchtitan/experiments/gpt_oss/model/moe.py

torchtitan/experiments/__init__.py

torchtitan/experiments/gpt_oss/README.md

torchtitan/experiments/gpt_oss/model/moe.py

torchtitan/experiments/gpt_oss/model/model.py

torchtitan/experiments/gpt_oss/model/args.py

torchtitan/models/attention.py

fegin

Please address all the comments before landing. I would appreciate that if you add the reason why we cannot do AuxOutput to the code. Thanks!

torchtitan/experiments/__init__.py

wwwjn · 2025-10-18T07:40:27Z

torchtitan/experiments/gpt_oss/model/args.py

+    n_kv_heads: int = 8
+    sliding_window_size: int = 128
+    attn_mask_type: str = "causal"
+    use_flex_attn: bool = True


I explicitly leave the parameter here, to be compatible with https://github.com/pytorch/torchtitan/blob/refs/heads/main/torchtitan/train.py#L428 here, where we need to call get_attention_masks.

But I added a notes here to prevent user change this flag to false

tianyu-l

I think I found a tricky numerical bug in TP. Maybe we can disable it for now.

tianyu-l · 2025-10-19T08:41:27Z

torchtitan/models/attention.py

+    - Up to `window_size - 1` previous tokens
+    Args:
+        window_size: The maximum number of tokens to attend to (including current token).
+                    Must be >= 1. A window_size of 1 means attend only to self.


need to raise ValueError if user didn't set window_size >= 1

torchtitan/experiments/gpt_oss/model/moe.py

tianyu-l · 2025-10-19T09:21:32Z

torchtitan/experiments/gpt_oss/model/moe.py

+            mlp1_weight = self.mlp1_weight.to_local()
+            mlp1_bias = self.mlp1_bias.to_local()
+            mlp2_weight = self.mlp2_weight.to_local()
+            mlp2_bias = self.mlp2_bias.to_local()


This might not be correct.

When they are dtensors, x * mlp2_weight + mlp2_bias will have placements Partial + Replicate, and sharding prop can automatically first make Replicate -> Partial then perform the addition.

However, when we do to_local, DTensor placement info is discarded, so instead of adding mlp2_bias, the net effect will be adding tp_degree * mlp2_bias.

I don't have clean way to solve this. For forward correctness, we can do mlp2_bias / tp_degree to cancel the extra reduction effect, but the backward will have an extra * tp_degree. Can we wrap mlp2_bias / tp_degree in torch.no_grad so the backward doesn't perform * tp_degree?

You can also disable TP / ETP altogether for gpt-oss for now and leave a TODO.

cc @ezyang @fmassa on difficulties of making TP correct in a local tensor region, when there is bias involved.

tianyu-l · 2025-10-19T09:23:57Z

torchtitan/experiments/gpt_oss/model/moe.py

+        self.mlp1_weight = nn.Parameter(torch.empty((num_experts, dim, hidden_dim * 2)))
+        self.mlp1_bias = nn.Parameter(torch.empty((num_experts, hidden_dim * 2)))
+        self.mlp2_weight = nn.Parameter(torch.empty((num_experts, hidden_dim, dim)))
+        self.mlp2_bias = nn.Parameter(torch.empty((num_experts, dim)))


This is different from main moe.py, where we init the weight params to have shape (num_experts, out_dim, in_dim) and do transpose before using them. The point is for hardware efficiency (mainly in low-precision case). We also need to change the TP / ETP plans to adapt.

See #1517

drisspg · 2025-10-22T03:57:01Z

torchtitan/models/attention.py

        # 2. `self._compiled_flex_attn` is not correct, `self` will be passed in
        #    as the first argument, which will cause an error.
        #    `FlexAttentionWrapper._compiled_flex_attn` is correct.
+        # 3. Used `return_lse` instead of `return_aux` because of easier TP module notation


yeah can you explain this?

This API will be removed in a future release

Thanks! I also noticed the return_lse is being deprecated, the reason we use it here is we want to use TP annotation to change the lse tensor back to a DTensor with placement Shard(1) (in TP region, it's a plain tensor). https://github.com/pytorch/torchtitan/pull/1754/files#diff-3448dcaf6e8b68f3b66a8e1dd298273de3702f93de406569426cd9e03fd7f97bR222. We can not annotate an AuxOutput() object directly using TP APIs. And because we want to keep model code parallelism-free, we don't want to manually turn AuxOut.lse into a DTensor.

I think an alternative way is to handle it in FlexAttentionWarpper, if this is a better way, I will create another PR to fix.

tianyu-l

LGTM, some minor final comments

tianyu-l · 2025-10-22T06:09:00Z

torchtitan/models/attention.py

+    - Up to `window_size - 1` previous tokens
+    Args:
+        window_size: The maximum number of tokens to attend to (including current token).
+                    Must be >= 1. A window_size of 1 means attend only to self.


comment not addressed

tianyu-l · 2025-10-22T06:10:35Z

torchtitan/experiments/gpt_oss/model/moe.py

+        self.use_grouped_mm = use_grouped_mm
+        self.swiglu_limit = swiglu_limit
+
+        self.mlp1_weight = nn.Parameter(torch.empty((num_experts, hidden_dim * 2, dim)))


Please add a comment to indicate which dim is input dim, which is output dim.

@KhoomeiK

Keep developing on top of pytorch#1559. Thanks @KhoomeiK for initial contribution! Initialized by the same seed checkpoint, set seed=0 and deterministic = True. GPT-oss Run 1: dp_shard = 2 <img width="1645" height="291" alt="Screenshot 2025-10-17 at 3 34 20 PM" src="https://github.com/user-attachments/assets/9876555f-7159-42d1-8765-17b62feac22c" /> Run 2: dp_shard = 2, TP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 25 36 PM" src="https://github.com/user-attachments/assets/0014188a-d989-4157-8705-c3fcbab3cf44" /> Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 27 34 PM" src="https://github.com/user-attachments/assets/b4ff5076-8c18-47cb-be06-90cf513bd7df" /> Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4) <img width="1222" height="254" alt="Screenshot 2025-10-21 at 8 30 41 PM" src="https://github.com/user-attachments/assets/8a50e991-c9f2-4b95-b2cc-709acc98e67c" /> Run 5: dp_shard=2, EP degree = 2 (NGPU=2) <img width="1342" height="210" alt="Screenshot 2025-10-17 at 3 35 41 PM" src="https://github.com/user-attachments/assets/6a14a64d-5b43-4efd-b5d2-ab40e2ede52c" /> --------- Co-authored-by: Rohan Pandey <rohan@periodiclabs.ai>

@KhoomeiK

Keep developing on top of pytorch#1559. Thanks @KhoomeiK for initial contribution! Initialized by the same seed checkpoint, set seed=0 and deterministic = True. GPT-oss Run 1: dp_shard = 2 <img width="1645" height="291" alt="Screenshot 2025-10-17 at 3 34 20 PM" src="https://github.com/user-attachments/assets/9876555f-7159-42d1-8765-17b62feac22c" /> Run 2: dp_shard = 2, TP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 25 36 PM" src="https://github.com/user-attachments/assets/0014188a-d989-4157-8705-c3fcbab3cf44" /> Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 27 34 PM" src="https://github.com/user-attachments/assets/b4ff5076-8c18-47cb-be06-90cf513bd7df" /> Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4) <img width="1222" height="254" alt="Screenshot 2025-10-21 at 8 30 41 PM" src="https://github.com/user-attachments/assets/8a50e991-c9f2-4b95-b2cc-709acc98e67c" /> Run 5: dp_shard=2, EP degree = 2 (NGPU=2) <img width="1342" height="210" alt="Screenshot 2025-10-17 at 3 35 41 PM" src="https://github.com/user-attachments/assets/6a14a64d-5b43-4efd-b5d2-ab40e2ede52c" /> --------- Co-authored-by: Rohan Pandey <rohan@periodiclabs.ai>

@KhoomeiK

Keep developing on top of pytorch#1559. Thanks @KhoomeiK for initial contribution! Initialized by the same seed checkpoint, set seed=0 and deterministic = True. GPT-oss Run 1: dp_shard = 2 <img width="1645" height="291" alt="Screenshot 2025-10-17 at 3 34 20 PM" src="https://github.com/user-attachments/assets/9876555f-7159-42d1-8765-17b62feac22c" /> Run 2: dp_shard = 2, TP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 25 36 PM" src="https://github.com/user-attachments/assets/0014188a-d989-4157-8705-c3fcbab3cf44" /> Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4) <img width="1222" height="203" alt="Screenshot 2025-10-21 at 8 27 34 PM" src="https://github.com/user-attachments/assets/b4ff5076-8c18-47cb-be06-90cf513bd7df" /> Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4) <img width="1222" height="254" alt="Screenshot 2025-10-21 at 8 30 41 PM" src="https://github.com/user-attachments/assets/8a50e991-c9f2-4b95-b2cc-709acc98e67c" /> Run 5: dp_shard=2, EP degree = 2 (NGPU=2) <img width="1342" height="210" alt="Screenshot 2025-10-17 at 3 35 41 PM" src="https://github.com/user-attachments/assets/6a14a64d-5b43-4efd-b5d2-ab40e2ede52c" /> --------- Co-authored-by: Rohan Pandey <rohan@periodiclabs.ai>

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025

wwwjn commented Sep 24, 2025

View reviewed changes

torchtitan/models/attention.py Outdated Show resolved Hide resolved

wwwjn force-pushed the gpt-oss branch 2 times, most recently from 48b2a11 to 07c0ff4 Compare September 30, 2025 04:34

wwwjn marked this pull request as ready for review September 30, 2025 23:01

wwwjn requested review from tianyu-l and wconstab as code owners September 30, 2025 23:01

wwwjn changed the title ~~[WIP] gpt-oss model enablement~~ gpt-oss model enablement Sep 30, 2025

wwwjn commented Sep 30, 2025

View reviewed changes

torchtitan/models/attention.py Outdated Show resolved Hide resolved

tianyu-l reviewed Oct 5, 2025

View reviewed changes

wwwjn force-pushed the gpt-oss branch from cd89d26 to 10869da Compare October 6, 2025 22:32

wwwjn force-pushed the gpt-oss branch from 8bfbf7c to e740303 Compare October 14, 2025 04:46

Rohan Pandey and others added 16 commits October 15, 2025 16:30

gptoss experimental support

4d3a84f

clean up tentative licensing

e0306fe

training fixes: expert load balancing, TP for sinks + experts, EP wor…

b93f650

…ks but reduces mfu for 20b

only assert sdpa backends if using sdpa; improve conversion script

b5290a2

fixed conversion script with param by param

c08a485

new lse-based flexattn implementation for sinks

6b51de6

test

2aef02e

rebase

ace1a0f

fix flexattn

0245062

check and replace rope

0e846f5

FSDP work, TP doesn't work

0fb65b8

test

c6748c4

fix sink

54b7748

test EP

92a68e1

working on ETP

7bd9e4d

clean up

7e4f38f

wwwjn added 3 commits October 15, 2025 16:32

rebase to main

db2f6b6

rebase on to expert parallel changes

40bd901

refactor FlexAttention

da672b2

wwwjn force-pushed the gpt-oss branch from e740303 to da672b2 Compare October 17, 2025 02:42

fix ep

f7b9d84

wwwjn force-pushed the gpt-oss branch from 59ecee0 to f7b9d84 Compare October 17, 2025 05:33

wwwjn requested review from fegin and tianyu-l October 17, 2025 22:42

fix TP

0da857a

fegin reviewed Oct 17, 2025

View reviewed changes

tianyu-l reviewed Oct 18, 2025

View reviewed changes

fegin approved these changes Oct 18, 2025

View reviewed changes

wwwjn added 2 commits October 18, 2025 00:34

address comments

c424815

address comments

331e6d8

wwwjn force-pushed the gpt-oss branch from 134320f to 331e6d8 Compare October 18, 2025 07:39

wwwjn commented Oct 18, 2025

View reviewed changes

fix args

e26190c

wwwjn requested a review from tianyu-l October 19, 2025 03:33

tianyu-l requested changes Oct 19, 2025

View reviewed changes

wwwjn added 3 commits October 21, 2025 20:38

add scaled bias

9db1e98

optimize using col major experts

d1dff5f

lint

237f07d

drisspg reviewed Oct 22, 2025

View reviewed changes

tianyu-l approved these changes Oct 22, 2025

View reviewed changes

add comments

3b1ba3d

shuhuayu merged commit de335b3 into main Oct 22, 2025
8 checks passed

tianyu-l mentioned this pull request Oct 23, 2025

[WIP] Experimental implementation of gpt-oss (grouped GEMM MoE + FlexAttention sink/sliding) #1559

Closed

Conversation

wwwjn commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Sep 30, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wwwjn commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wwwjn commented Sep 24, 2025 •

edited

Loading