Add run-to-run determinism testing to H100 CI by xmfan · Pull Request #2339 · pytorch/torchtitan

xmfan · 2026-02-06T23:33:00Z

Add run-to-run determinism testing to H100 CI

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with determinism_test=True will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly.

The core loss extraction logic is factored into torchtitan/tools/loss_utils.py and shared between the integration test runner and the existing loss_compare.py script. The scripts directory is now a package to enable clean imports via python -m scripts.loss_compare.

The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only).

Co-authored-by: Claude noreply@anthropic.com

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11

xmfan · 2026-02-06T23:34:30Z

CLAUDE.md

+(e.g., the logical progression), or if it's short just omit the bullet list
+entirely.
+
+Disclose that the PR was authored with Claude.


copied over from pytorch's claude md

With this file, we can directly ask Claude code to create a PR for us?

you can always do that. this just makes claude disclose that the PR was co-authored with Claude in the description.

This adds automatic run-to-run determinism verification for H100 integration tests. Tests marked with `determinism_test=True` will run twice with identical configuration and deterministic flags, then compare losses to ensure they match exactly. The core loss extraction logic is factored into `torchtitan/tools/loss_utils.py` and shared between the integration test runner and the existing `loss_compare.py` script. The scripts directory is now a package to enable clean imports via `python -m scripts.loss_compare`. The Float8 and HSDP+CP+compile+Float8 tests in the H100 suite are enabled for determinism testing (CUDA only). The `--run-to-run-determinism` flag in loss_compare.py now explicitly validates that no test-specific options are provided, raising a ValueError if they are. Co-authored-by: Claude <noreply@anthropic.com> stack-info: PR: #2339, branch: xmfan/stack/11

tianyu-l · 2026-02-10T00:20:31Z

tests/integration_tests/__init__.py

    ngpu: int = 4
    disabled: bool = False
    skip_rocm_test: bool = False
+    determinism_test: bool = False  # Run twice and verify losses are identical


The point is not only about being deterministic, but also not changing before vs. after

pytorch nightly updates

user commits

Is it correct that this PR doesn't address such issues?

This pr just makes sure that when you run the same command twice, it produces the same outputs. by adding this to PR time CI, you would run H100 CI twice on each PR, both against the same pytorch nightly.

So this test only guards the deterministic is setup correctly and working correctly, right? I think if we make sure the loss doesn't change before vs. after (pytorch nightly, and user commits), it already covers the deterministic check:

If it's not run-to-run deterministic, it's impossible to achieve identical loss before and after

Can you expand more on the setup for "loss doesn't change before vs. after"?

The existing tests that I see only cover if the first run on a process always matches the expected loss, not whether the same process will keep producing the same loss (which is what you need when you develop locally).

if the first run on a process always matches the expected loss

My thought is that: the expected loss comes from a deterministic run (we see deterministic seed, and the use deterministic algorithm). The "first run" here you are referring to, is also a deterministic run. If the deterministic algorithm does not work , or any missing deterministic setting, these 2 runs is impossible to have identical loss, as randomness will make the loss difference.

In this sense, the current test "first run on a process always matches the expected loss" already covers 1) deterministic and 2) any potential changes from user commit and pytorch nightly. It's a combined effect of both.

some components persist cache on the machine, like compile, and maybe some other modules. the second run has a warm cache and runs different code paths than the first run. for the compile case, you can always run with TORCHINDUCTOR_FORCE_DISABLE_CACHES, and idk what would be the solution for others. we don't need to land this if CI is too constrained.

some components persist cache on the machine, like compile, and maybe some other modules.

That's a valid reason to run 2 consecutive runs, but the current loss compare script is testing on non-compile model. And I feel like the correctness of cached path should also be guaranteed by finer-granularity unit test

wwwjn · 2026-02-10T08:17:31Z

CLAUDE.md

+(e.g., the logical progression), or if it's short just omit the bullet list
+entirely.
+
+Disclose that the PR was authored with Claude.


With this file, we can directly ask Claude code to create a PR for us?

wwwjn · 2026-02-10T08:19:04Z

torchtitan/tools/loss_utils.py

+import re
+
+
+def extract_losses_from_log(log_file: str) -> dict[int, float]:


What are other use cases for this function to put it in an utils file?

wdym? this pr uses it in 2 places. one of them is in tests

xmfan force-pushed the xmfan/stack/11 branch from 1cfdf55 to ce77fee Compare February 6, 2026 23:33

pytorch-bot bot added the ciflow/8gpu label Feb 6, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 6, 2026

xmfan commented Feb 6, 2026

View reviewed changes

xmfan force-pushed the xmfan/stack/11 branch from ce77fee to 965ec8e Compare February 9, 2026 21:43

xmfan force-pushed the xmfan/stack/11 branch from 965ec8e to 072a12c Compare February 9, 2026 21:51

xmfan marked this pull request as ready for review February 9, 2026 23:23

xmfan requested review from fegin, tianyu-l, wconstab and wwwjn as code owners February 9, 2026 23:23

tianyu-l reviewed Feb 10, 2026

View reviewed changes

wwwjn requested changes Feb 10, 2026

View reviewed changes

xmfan closed this Feb 17, 2026

		import re


		def extract_losses_from_log(log_file: str) -> dict[int, float]:

Conversation

xmfan commented Feb 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wwwjn Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmfan Feb 10, 2026 •

edited

Loading

xmfan Feb 10, 2026 •

edited

Loading

wwwjn Feb 10, 2026 •

edited

Loading