fixed validation error when using flash attention #2142

francesco-bertolotti · 2025-12-11T19:34:43Z

This PR addresses the issue #2140

Briefly, the bug is related to the validation step.
The validate method did not pass the attention mask to the model.
When using flash attention, this leads to an error.

Co-authored-by: Francesco Bertolotti <[email protected]>

wwwjn

Thanks for the fix! Please fix the PP enabled branch as well, I think we would do something similar to train.py: https://github.com/pytorch/torchtitan/blob/refs/heads/main/torchtitan/train.py#L507-L516

wwwjn · 2025-12-12T07:12:49Z

torchtitan/components/validate.py

                break

+            # prepare attention mask
+            extra_kwargs : dict[str, Any] = {"attention_masks" : model_parts[0].get_attention_masks(


For readability, can you make this into a function similar as: https://github.com/pytorch/torchtitan/blob/refs/heads/main/torchtitan/train.py#L416C9-L416C33

francesco-bertolotti · 2025-12-12T08:49:52Z

The new commit propagated the fix to the pipeline parallelism case. Further, I have encapsulated the creation of the attention mask in a method.

Co-authored-by: Francesco Bertolotti <[email protected]>

tianyu-l · 2025-12-12T10:49:44Z

torchtitan/components/validate.py

                "unequal sample counts across ranks when dataset is exhausted."
            )

+    def post_dataloading_process(


can we reuse the one in train.py?

In principle, it would be possible, but I would need to make some modifications since their signatures differ. In particular, the trainer accesses model_parts through self.

One possible solution is to turn post_dataloading_process into a utility function that both the trainer and validator can call. However, this would mean you can no longer modify the behavior through inheritance—unless the current post_dataloading_process simply becomes a wrapper around the new utility function.|

If you have any suggestions, I can add a commit.

I see.
If you want to "modify the behavior", we probably should have a PostDataloadingProcessor class.
If not, I think it's fine to call a util directly in both trainer and validator.

WDYT?
cc @fegin

I'm thinking this as well, I'm seeing the duplicated code in trainer and validator, model_parts is not a big problem, you can add it. See #2144. But the duplicated logic is pretty concerning.

My initial proposal is to have an util function in distributed.utils so that both methods can call it. The only worry is that I don't know whether distributed.utils is the best place to put it. It is not now because the util function purely unwrap the input but the util function will encapsulate the CP logic after #2144. So distributed.utils seems to be a good place.

@francesco-bertolotti you can do it as your PR may be landed first. I can rebase on top of your change.

agreed we should share code, but not sure about distributed.utils. Let's discuss offline.

tianyu-l

unblocking

tianyu-l · 2025-12-14T19:12:28Z

@francesco-bertolotti CPU unit test failed, could you take a look?

francesco-bertolotti · 2025-12-15T09:23:38Z

Hi @tianyu-l,

I looked into the error, and it turns out to be a circular import issue.

In validate.py, I need access to model_args. Since Validator doesn’t have direct access to it, I currently construct it via get_train_spec. However, importing get_train_spec triggers an import of BaseValidator, which lives in the same module as Validator, leading to the circular dependency.

I see two possible ways to resolve this:

Pass model_args directly to Validator, eliminating the need to import get_train_spec.
Move BaseValidator into its own module.

I’m leaning slightly toward changing the Validator signature, as it feels cleaner overall, though it would be a breaking change.

Curious to hear your thoughts.

tianyu-l · 2025-12-16T22:47:10Z

It seems this is being fixed in #2144, with the first approach you mentioned.

Are you OK if we consolidate effort there?

fixed validation error when using flash attention

07d729a

Co-authored-by: Francesco Bertolotti <[email protected]>

francesco-bertolotti requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 11, 2025 19:34

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 11, 2025

francesco-bertolotti mentioned this pull request Dec 11, 2025

Validation breaking with FlashAttention #2140

Open

wwwjn requested changes Dec 12, 2025

View reviewed changes

propagated fix for pipeline parallelism

628dbb6

Co-authored-by: Francesco Bertolotti <[email protected]>

francesco-bertolotti force-pushed the f14-flash-attn-valid-fix branch from f8693e5 to 628dbb6 Compare December 12, 2025 08:51

tianyu-l requested changes Dec 12, 2025

View reviewed changes

rakkit mentioned this pull request Dec 12, 2025

Staging SFT training #2148

Open

tianyu-l approved these changes Dec 12, 2025

View reviewed changes

tianyu-l linked an issue Dec 14, 2025 that may be closed by this pull request

Validation breaking with FlashAttention #2140

Open

fixed validation error when using flash attention #2142

Are you sure you want to change the base?

fixed validation error when using flash attention #2142

Uh oh!

Conversation

francesco-bertolotti commented Dec 11, 2025

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

francesco-bertolotti commented Dec 12, 2025

Uh oh!

tianyu-l Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

francesco-bertolotti Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Dec 14, 2025

Uh oh!

francesco-bertolotti commented Dec 15, 2025

Uh oh!

tianyu-l commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin Dec 12, 2025 •

edited

Loading