Skip to content

Adding prefetching of first shards to train script when fsdp enabled#1955

Closed
chelsea0x3b wants to merge 2 commits intopytorch:mainfrom
chelsea0x3b:fsdp-prefetching
Closed

Adding prefetching of first shards to train script when fsdp enabled#1955
chelsea0x3b wants to merge 2 commits intopytorch:mainfrom
chelsea0x3b:fsdp-prefetching

Conversation

@chelsea0x3b
Copy link
Contributor

If model is sharded calling .unshard() will prefetch the first shard. I placed this before the data loader & other preprocessing so it should overlap.

Sources:

Issuing 1st all-gather earlier: Implicit prefetching happens at the time of calling model(x). The 1st all-gather gets exposed. We can call model.unshard() explicitly earlier to issue 1st all-gather earlier

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 28, 2025
Comment on lines +507 to +509
if self.parallel_dims.fsdp_enabled:
# NOTE: prefetches the model
self.model_parts[0].unshard(async_op=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how much benefit do we get from this? Could you show some traces?

I have the following concerns:

  1. When we don't do logging (where getting loss to CPU incurs d2h sync), GPU is ahead of CPU and can already overlap with dataloading, so this is not saving anything.
  2. In torchtitan there're other FSDP implementation, in in general we should avoid FSDP2-only code. There's a way around by testing if model_parts[0] is an FSDPModule, but it's not so clean.

Let's see if the benefit justifies the complexity. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah turns out there is little benefit, you're right! I think i had logging higher when I was noticing an improvement. Will close, ty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants