[RELAND] Let CUDA and ROCm read different loss result #2157

fegin · 2025-12-15T22:43:30Z

Stack from ghstack (oldest at bottom):

-> [RELAND] Let CUDA and ROCm read different loss result #2157

CUDA and ROCm have different loss results. So we need to read from different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

This PR is a reland PR of #2156 due to some landing issue of the previous PR.

[ghstack-poisoned]

ghstack-source-id: 3e53cf0 Pull-Request: #2157

pytorch-bot · 2025-12-15T22:43:37Z

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

fegin added 2 commits December 15, 2025 14:43

Update (base update)

6ebff8e

[ghstack-poisoned]

Update

8884473

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners December 15, 2025 22:43

fegin added a commit that referenced this pull request Dec 15, 2025

Let CUDA and ROCm read different loss result

da0b654

ghstack-source-id: 3e53cf0 Pull-Request: #2157

pytorch-bot bot added ciflow/8gpu ciflow/rocm-mi300 module: rocm labels Dec 15, 2025

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 15, 2025

fegin changed the base branch from gh/fegin/57/base to main December 15, 2025 22:44

fegin changed the title ~~Let CUDA and ROCm read different loss result~~ [RELAND] Let CUDA and ROCm read different loss result Dec 15, 2025

pytorch-bot bot added the ci-no-td label Dec 15, 2025

tianyu-l approved these changes Dec 16, 2025

View reviewed changes

fegin merged commit f64bbad into main Dec 16, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RELAND] Let CUDA and ROCm read different loss result #2157

[RELAND] Let CUDA and ROCm read different loss result #2157

Uh oh!

fegin commented Dec 15, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[RELAND] Let CUDA and ROCm read different loss result #2157

[RELAND] Let CUDA and ROCm read different loss result #2157

Uh oh!

Conversation

fegin commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fegin commented Dec 15, 2025 •

edited

Loading