-
Notifications
You must be signed in to change notification settings - Fork 644
Let CUDA and ROCm read different loss result #2156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2156 * __->__ #2155 This assumes that the local built version has the same parent folder as torchtitan. Also fixes some pyrefly errors for moe.py
| # after 5 steps. Leave fore AMD people to fix this if this is | ||
| # something bother users. | ||
| LOSS_FILE="tests/assets/losses/llama3_rocm.txt" | ||
| STEPS=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why setting it to 5 if you are already diverging the loss file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in the comment of the yaml and the summary of this PR.
The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, I missed that part.
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2157 CUDA and ROCm have different loss results. So we need to read from different loss result files. The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior. **This PR is a reland PR of #2156 due to some landing issue of the previous PR.
Stack from ghstack (oldest at bottom):