Conversation
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
| (e.g., the logical progression), or if it's short just omit the bullet list | ||
| entirely. | ||
|
|
||
| Disclose that the PR was authored with Claude. |
There was a problem hiding this comment.
copied over from pytorch's claude md
There was a problem hiding this comment.
With this file, we can directly ask Claude code to create a PR for us?
There was a problem hiding this comment.
you can always do that. this just makes claude disclose that the PR was co-authored with Claude in the description.
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11
| ngpu: int = 4 | ||
| disabled: bool = False | ||
| skip_rocm_test: bool = False | ||
| determinism_test: bool = False # Run twice and verify losses are identical |
There was a problem hiding this comment.
The point is not only about being deterministic, but also not changing before vs. after
- pytorch nightly updates
- user commits
Is it correct that this PR doesn't address such issues?
There was a problem hiding this comment.
This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.
There was a problem hiding this comment.
So this test only guards the deterministic is setup correctly and working correctly, right? I think if we make sure the loss doesn't change before vs. after (pytorch nightly, and user commits), it already covers the deterministic check:
- If it's not run-to-run deterministic, it's impossible to achieve identical loss before and after
There was a problem hiding this comment.
Can you expand more on the setup for "loss doesn't change before vs. after"?
The existing tests that I see only cover if the first run on a process always matches the expected loss, not whether the same process will keep producing the same loss (which is what you need when you develop locally).
There was a problem hiding this comment.
if the first run on a process always matches the expected loss
My thought is that: the expected loss comes from a deterministic run (we see deterministic seed, and the use deterministic algorithm). The "first run" here you are referring to, is also a deterministic run. If the deterministic algorithm does not work , or any missing deterministic setting, these 2 runs is impossible to have identical loss, as randomness will make the loss difference.
In this sense, the current test "first run on a process always matches the expected loss" already covers 1) deterministic and 2) any potential changes from user commit and pytorch nightly. It's a combined effect of both.
There was a problem hiding this comment.
some components persist cache on the machine, like compile, and maybe some other modules. the second run has a warm cache and runs different code paths than the first run. for the compile case, you can always run with TORCHINDUCTOR_FORCE_DISABLE_CACHES, and idk what would be the solution for others. we don't need to land this if CI is too constrained.
There was a problem hiding this comment.
some components persist cache on the machine, like compile, and maybe some other modules.
That's a valid reason to run 2 consecutive runs, but the current loss compare script is testing on non-compile model. And I feel like the correctness of cached path should also be guaranteed by finer-granularity unit test
| (e.g., the logical progression), or if it's short just omit the bullet list | ||
| entirely. | ||
|
|
||
| Disclose that the PR was authored with Claude. |
There was a problem hiding this comment.
With this file, we can directly ask Claude code to create a PR for us?
| import re | ||
|
|
||
|
|
||
| def extract_losses_from_log(log_file: str) -> dict[int, float]: |
There was a problem hiding this comment.
What are other use cases for this function to put it in an utils file?
There was a problem hiding this comment.
wdym? this pr uses it in 2 places. one of them is in tests
Add run-to-run determinism testing to H100 CI
This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with
determinism_test=Truewill run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.The core loss extraction logic is factored into
torchtitan/tools/loss_utils.pyand shared between the integration test runner and the existingloss_compare.pyscript. The scripts directory is now a package to enable clean imports viapython -m scripts.loss_compare.The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).
Co-authored-by: Claude noreply@anthropic.com