Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Dec 15, 2025

Stack from ghstack (oldest at bottom):

CUDA and ROCm have different loss results. So we need to read from different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

This PR is a reland PR of #2156 due to some landing issue of the previous PR.

[ghstack-poisoned]
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 15, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 15, 2025

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

@fegin fegin changed the base branch from gh/fegin/57/base to main December 15, 2025 22:44
@fegin fegin changed the title Let CUDA and ROCm read different loss result [RELAND] Let CUDA and ROCm read different loss result Dec 15, 2025
@pytorch-bot pytorch-bot bot added the ci-no-td label Dec 15, 2025
@fegin fegin merged commit f64bbad into main Dec 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td ciflow/rocm-mi300 ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants