Let CUDA and ROCm read different loss result #2156

fegin · 2025-12-15T19:19:02Z

Stack from ghstack (oldest at bottom):

-> Let CUDA and ROCm read different loss result #2156

CUDA and ROCm have different loss results. So we need to read from different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

[ghstack-poisoned]

ghstack-source-id: 3d5e534 Pull-Request: #2156

pytorch-bot · 2025-12-15T19:19:09Z

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

pytorch-bot · 2025-12-15T19:21:28Z

Warning: Unknown label ciflow/rocm.
Currently recognized labels are

ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2156 * __->__ #2155 This assumes that the local built version has the same parent folder as torchtitan. Also fixes some pyrefly errors for moe.py

[ghstack-poisoned]

ghstack-source-id: e2f557e Pull-Request: #2156

[ghstack-poisoned]

ghstack-source-id: 5b53961 Pull-Request: #2156

tianyu-l · 2025-12-15T21:43:10Z

.github/workflows/integration_test_8gpu_features.yaml

+          # after 5 steps. Leave fore AMD people to fix this if this is
+          # something bother users.
+          LOSS_FILE="tests/assets/losses/llama3_rocm.txt"
+          STEPS=5


why setting it to 5 if you are already diverging the loss file?

It's in the comment of the yaml and the summary of this PR.

The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

oh I see, I missed that part.

[ghstack-poisoned]

ghstack-source-id: 3e53cf0 Pull-Request: #2156

This reverts commit 4e1623c.

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2157 CUDA and ROCm have different loss results. So we need to read from different loss result files. The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior. **This PR is a reland PR of #2156 due to some landing issue of the previous PR.

Update

b4158f4

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners December 15, 2025 19:19

pytorch-bot bot added ciflow/8gpu ciflow/rocm-mi300 module: rocm labels Dec 15, 2025

fegin added a commit that referenced this pull request Dec 15, 2025

[Not Ready]Let CUDA and ROCm read different loss result

43ea696

ghstack-source-id: 3d5e534 Pull-Request: #2156

fegin mentioned this pull request Dec 15, 2025

Add local built pytorch path for pyrefly #2155

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 15, 2025

fegin added the ciflow/rocm label Dec 15, 2025

Update

0e7f7eb

[ghstack-poisoned]

fegin added a commit that referenced this pull request Dec 15, 2025

Let CUDA and ROCm read different loss result

9b86a81

ghstack-source-id: e2f557e Pull-Request: #2156

Update

e0ebea9

[ghstack-poisoned]

fegin added a commit that referenced this pull request Dec 15, 2025

Let CUDA and ROCm read different loss result

7dd0497

ghstack-source-id: 5b53961 Pull-Request: #2156

fegin changed the title ~~[Not Ready]Let CUDA and ROCm read different loss result~~ Let CUDA and ROCm read different loss result Dec 15, 2025

tianyu-l reviewed Dec 15, 2025

View reviewed changes

fegin requested a review from tianyu-l December 15, 2025 21:45

Update

168bd51

[ghstack-poisoned]

fegin added a commit that referenced this pull request Dec 15, 2025

Let CUDA and ROCm read different loss result

f757023

ghstack-source-id: 3e53cf0 Pull-Request: #2156

tianyu-l approved these changes Dec 15, 2025

View reviewed changes

fegin merged commit 4e1623c into gh/fegin/56/base Dec 15, 2025
9 checks passed

fegin added a commit that referenced this pull request Dec 15, 2025

Revert "Let CUDA and ROCm read different loss result (#2156)"

a7324db

This reverts commit 4e1623c.

This was referenced Dec 15, 2025

[RELAND] Let CUDA and ROCm read different loss result #2157

Merged

Torchtitan CI gap Between ROCm & CUDA #2098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Let CUDA and ROCm read different loss result #2156

Let CUDA and ROCm read different loss result #2156

Uh oh!

fegin commented Dec 15, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

tianyu-l Dec 15, 2025

Uh oh!

fegin Dec 15, 2025 •

edited

Loading

Uh oh!

tianyu-l Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Let CUDA and ROCm read different loss result #2156

Let CUDA and ROCm read different loss result #2156

Uh oh!

Conversation

fegin commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

tianyu-l Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fegin commented Dec 15, 2025 •

edited

Loading

fegin Dec 15, 2025 •

edited

Loading