Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Dec 15, 2025

Stack from ghstack (oldest at bottom):

  1. CUDA and ROCm have different loss results. So we need to read from different loss result files.
  2. The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 15, 2025

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 15, 2025

Warning: Unknown label ciflow/rocm.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

fegin added a commit that referenced this pull request Dec 15, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2156
* __->__ #2155

This assumes that the local built version has the same parent folder as
torchtitan.

Also fixes some pyrefly errors for moe.py
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 15, 2025
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 15, 2025
@fegin fegin changed the title [Not Ready]Let CUDA and ROCm read different loss result Let CUDA and ROCm read different loss result Dec 15, 2025
# after 5 steps. Leave fore AMD people to fix this if this is
# something bother users.
LOSS_FILE="tests/assets/losses/llama3_rocm.txt"
STEPS=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why setting it to 5 if you are already diverging the loss file?

Copy link
Contributor Author

@fegin fegin Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the comment of the yaml and the summary of this PR.

The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, I missed that part.

@fegin fegin requested a review from tianyu-l December 15, 2025 21:45
[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 15, 2025
@fegin fegin merged commit 4e1623c into gh/fegin/56/base Dec 15, 2025
9 checks passed
fegin added a commit that referenced this pull request Dec 15, 2025
fegin added a commit that referenced this pull request Dec 16, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2157

CUDA and ROCm have different loss results. So we need to read from
different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when
running with ROCm, we also need to adjust this. But this is more an
unknown issue that AMD people may want to figure out the root cause or
confirm that this is an expected behavior.

**This PR is a reland PR of
#2156 due to some landing
issue of the previous PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants