Gradient accumulation broken in PP

### Bug description

Using gradient accumulation is incompatible with PipleineSchedule(..., scale_grads=True) option, which defaults to True.

When this option is set, at each step, all gradients are scaled by the micro-batch size. This works fine for a single gradient accumulation step, but when using multiple steps, this will rescale the total gradient by this factor, not just at the end of gradient accumulation.

The result is that the accumulated gradient is an exponential moving average, rather than a sum. Overall, the resulting gradients are much smaller than they should be and using gradient accumulation with PP is not equivalent to using it without PP -- the loss curves diverge substantially, as well as the gradient-norms are way off.

A secondary consequence is that at every step, it divides the gradients by n_microbatches, which is computationally expensive when applied to a large model.

I identified the same issue in my own pipeline trainer implementation a week or two ago. When checking how Torch Titan addressed the issue, I discovered that Titan probably has the same bug.

I had the time to confirm the presence of the issue today and have submitted https://github.com/pytorch/torchtitan/pull/1732 to resolve the issue.


### Versions

torch 2.10.0.dev20250915+cu126

For anyone who may be interested, I have added support for Torch Titan to my configuration framework, which is what I used for reproducing the issue.

https://github.com/jdinalt/forgather/tree/main/examples/torchtitan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient accumulation broken in PP #1733

Bug description

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradient accumulation broken in PP #1733

Description

Bug description

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions