[simplefsdp] fix region ac in zero2-style FSDP #1970

ruisizhang123 · 2025-10-30T23:50:39Z

As titled, this is a follow up PR that avoid issuing additional bwd AG when activation checkpointing is enable. Prev, we only tested DSV3, which is not composable with AC.

The idea is quite simple: we add an additional checkpoint policy to AC when reshard_after_forward is False, which avoids recompute FSDP-related comms in ac.

reshard_after_fwd = False

SAC + llama3 (trace)

Full AC + llama3 (trace)

No AC + llama3 [trace]

reshard_after_fwd = True

SAC + llama3 (Trace)

Full AC + llama3 (Trace)

No AC + llama3 (Trace)

soulitzer · 2025-10-31T21:46:22Z

Is the logic for wrapping mostly identical for wrapping the modules/ worth deduplicating?
Is the only difference between zero2 and zero3 style FSDP that the policy for the collectives are different?

ruisizhang123 · 2025-10-31T21:51:10Z

Is the logic for wrapping mostly identical for wrapping the modules/ worth deduplicating? Is the only difference between zero2 and zero3 style FSDP that the policy for the collectives are different?

yes, I actually think I should reuse some functions from general apply_ac.py. However, the addtional simplefsdp ac policy is scattered around several functions, which makes reuse a bit hard...

tianyu-l

The problem might be harder than it sounds. That's actually one of the reasons I hadn't implemented it by myself.

For now, I'm OK with erroring out when reshard_after_forward=False + SAC/full AC is used.

tianyu-l · 2025-11-01T01:16:56Z

torchtitan/experiments/simple_fsdp/activation_checkpoint.py

+_op_simple_fsdp_save_list = {
+    torch.ops._c10d_functional.all_gather_into_tensor.default,
+    torch.ops._c10d_functional.wait_tensor.default,
+    torch.ops.aten._to_copy.default,
+}


To achieve the same effect, can we just modify the the input to apply_ac?
https://github.com/pytorch/torchtitan/blob/main/torchtitan/distributed/activation_checkpoint.py#L292
Do we have to duplicate other parts?

you will see the additional simplefsdp_checkpointing_context_fn function is applied to _apply_full_ac and _apply_op_sac function. This is where this checkpoint policy actually takes into effect and why reusing other parts won't work, if this makes sense.

tianyu-l · 2025-11-01T01:25:08Z

torchtitan/experiments/simple_fsdp/activation_checkpoint.py

+
+# for avoid recomputing SimpleFSDP all_gather in zero2-style FSDP
+# it enforces additional policy to always mark SimpleFSDP all_gather as PREFER_SAVE
+_op_simple_fsdp_save_list = {


I'm afraid this won't give us the right semantics.

SimpleFSDP is not the only module that uses these ops (all-gather, wait, to_copy). When you specify these ops in the save list, the side effect is that it will save all other occurrences of these ops and cause memory regression compared with FSDP2 reshard_after_forward=False + SAC.

This may be worked around by using custom all-gather/wait/to_copy for SimpleFSDP, as suggested by @fmassa . But then the question is how do you substitute the DTensor built-in collectives to use these custom ops?

Besides, what happens if full AC is combined with reshard_after_forward=False? I don't think the latter will take effect. Using SAC with fsdp ops only policy + custom SimpleFSDP ops is a proxy workaround.

I also thought about FSDP+TP case for reshard_after_forward=False. But seems there is just not a good way of handling this other than adding a customized ac policy.... I'm more leaning toward add a warning "in multi-parallelism setting, open reshard_after_forward may cause memory regression".

Adding a custom op can be a big change and might break things... One potential way is get FSDP process group from device mesh, and check the all_gather op's process group in ac annotation here. Then, we only add MUST_SAVE to FSDP AG node, if this makes sense.

For full AC, see my previous comment.

One potential way is get FSDP process group from device mesh, and check the all_gather op's process group in ac annotation here. Then, we only add MUST_SAVE to FSDP AG node, if this makes sense.

This sounds fine for AG. But I'm more worried about _to_copy (needed in SimpleFSDP to achieve mixed precision). There are just too many other _to_copies in a transformer.

For full AC, see my previous comment.

Oh, my bad that I missed that part. It seems you are indeed using SAC with FSDP policy to mimic full AC + reshard_after_forward=False. Because of the implementation difference of SAC and full AC, they may not be identical, but I think it's OK approximation.

Nevertheless, the coding style is made hacky because of API limitations. Let's take this chance to discuss with the team how people want to move forward.

cc @fmassa @xmfan @soulitzer @fegin

yes, it would be much easier to tag _to_copy ops in fx graph by looking at if the _to_copy is connected with FSDP AG. We have a chance to get things right in compile mode, but adding AC in eager mode correctly is hard...

From the other angle -- maybe it's also easier to make reshard_after_forward=False in compile.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2025

ruisizhang123 force-pushed the ruisi/zero2_fix branch from 9c4b454 to 429bbb7 Compare October 30, 2025 23:51

ruisizhang123 marked this pull request as draft October 30, 2025 23:56

ezyang requested a review from soulitzer October 31, 2025 12:44

ruisizhang123 force-pushed the ruisi/zero2_fix branch 2 times, most recently from 275259e to 284695c Compare October 31, 2025 18:55

ruisizhang123 marked this pull request as ready for review October 31, 2025 18:55

ruisizhang123 requested a review from tianyu-l October 31, 2025 18:55

[simplefsdp] fix region ac in zero 2

b59290b

ruisizhang123 force-pushed the ruisi/zero2_fix branch from 284695c to b59290b Compare October 31, 2025 18:59

tianyu-l requested changes Nov 1, 2025

View reviewed changes

ruisizhang123 mentioned this pull request Nov 3, 2025

SimpleFSDP Status Tracking #1980

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[simplefsdp] fix region ac in zero2-style FSDP #1970

[simplefsdp] fix region ac in zero2-style FSDP #1970

ruisizhang123 commented Oct 30, 2025 •

edited

Loading

Uh oh!

soulitzer commented Oct 31, 2025

Uh oh!

ruisizhang123 commented Oct 31, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Nov 1, 2025

Uh oh!

ruisizhang123 Nov 1, 2025

Uh oh!

tianyu-l Nov 1, 2025

Uh oh!

ruisizhang123 Nov 1, 2025 •

edited

Loading

Uh oh!

tianyu-l Nov 1, 2025

Uh oh!

ruisizhang123 Nov 1, 2025 •

edited

Loading

Uh oh!

tianyu-l Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[simplefsdp] fix region ac in zero2-style FSDP #1970

Are you sure you want to change the base?

[simplefsdp] fix region ac in zero2-style FSDP #1970

Conversation

ruisizhang123 commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soulitzer commented Oct 31, 2025

Uh oh!

ruisizhang123 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ruisizhang123 commented Oct 30, 2025 •

edited

Loading

ruisizhang123 commented Oct 31, 2025 •

edited

Loading

ruisizhang123 Nov 1, 2025 •

edited

Loading

ruisizhang123 Nov 1, 2025 •

edited

Loading