Optimizer-in-backward Fusion Implementation #1295

tinuademargaret · 2025-04-28T13:06:39Z

This PR implements optimizer-in-backward fusion (suggested in this issue), which integrates optimizer updates directly into the backward pass to decrease peak memory usage by eliminating the need to store gradients. It uses pytorch's _apply_optimizer_in_backward to fuse optimizer steps with backpropagation.

This feature only works with the sft trainer.

Optimizer-in-backward fusion is incompatible with gradient accumulation and is controlled by a bwd_hook flag in the optimizer config.

There is added validation to ensure that training configurations that rely on gradient accumulation does not enable the optimizer fusion option.

CLAassistant · 2025-04-28T13:06:46Z

All committers have signed the CLA.

eric-haibin-lin · 2025-04-28T15:26:41Z

thanks! Just to double check, this supports gradient accumulation?

tinuademargaret · 2025-04-28T17:05:10Z

thanks! Just to double check, this supports gradient accumulation?

No, It uses pytorch's apply_optimizer_in_backward which does not support gradient accumulation.

vermouth1992 · 2025-04-29T13:33:21Z

thanks! Just to double check, this supports gradient accumulation?

No, It uses pytorch's apply_optimizer_in_backward which does not support gradient accumulation.

I guess this constrain limit the use of apply_optimizer_in_backward because most of the cases, we need gradient accumulation to avoid OOM :(

eric-haibin-lin

Thanks for the contribution! For RL i think it's very likely that we need gradient accumulation to train with large batch sizes. But this feature would still be useful for SFT. Do you think it make sense if you only keep the changes for SFT trainer, while keeping the RL trainer simple for now? Thx

tinuademargaret · 2025-05-08T01:35:40Z

Thanks for the contribution! For RL i think it's very likely that we need gradient accumulation to train with large batch sizes. But this feature would still be useful for SFT. Do you think it make sense if you only keep the changes for SFT trainer, while keeping the RL trainer simple for now? Thx

I agree RL almost always needs gradient accumulation, so the immediate win for RL is limited. I've scoped the PR down to the SFT trainer for now and left the RL workers unchanged. PyTorch mentioned here that they are working on a more flexible post backward hook, I'm not sure of the status of this, but once that lands we can revisit a gradient accumulation friendly version for RL.

tinuademargaret and others added 23 commits April 8, 2025 13:53

initial test

fcdc0f5

initial profiling

d463293

initial fusion test

4b91919

fixes

2d8e0bc

fix lr scheduler

3ff92ac

debugging fusion with gradient accumulation

f991417

use flag

50c8db3

fixes

7d263e2

tests

b632651

test flat params

81930ea

update params to flat params

74e09c7

sft optimiser fuse

281fc87

fix batch size

bf2a0a7

fix normalise bsz

3a5d7e9

test ppo config

c6a39b7

fix config

ac0d426

add bwd hook to actor worker

20af206

update sft trainer

55c35ef

fixes

e7b7acc

update critic

d827d11

fix

22565e1

delete prev memory

f2f94ef

remove prev scheduler

ccefe19

tinuademargaret mentioned this pull request Apr 28, 2025

Additional memory optimization features #144

Open

Merge branch 'main' into feat-optimiser-fuse

70a1859

tinuademargaret marked this pull request as ready for review April 28, 2025 13:52

ZihengJiang added the status: need review label Apr 29, 2025

eric-haibin-lin requested changes May 2, 2025

View reviewed changes

tinuademargaret added 4 commits May 7, 2025 15:51

revert changes for rl workers

2c43fa5

update sft config

ccd3658

clean up

a098aa0

Merge branch 'main' into feat-optimiser-fuse

7df87b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizer-in-backward Fusion Implementation #1295

Optimizer-in-backward Fusion Implementation #1295

tinuademargaret commented Apr 28, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Apr 28, 2025 •

edited

Loading

Uh oh!

eric-haibin-lin commented Apr 28, 2025

Uh oh!

tinuademargaret commented Apr 28, 2025

Uh oh!

vermouth1992 commented Apr 29, 2025

Uh oh!

eric-haibin-lin left a comment

Uh oh!

tinuademargaret commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Optimizer-in-backward Fusion Implementation #1295

Are you sure you want to change the base?

Optimizer-in-backward Fusion Implementation #1295

Conversation

tinuademargaret commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-haibin-lin commented Apr 28, 2025

Uh oh!

tinuademargaret commented Apr 28, 2025

Uh oh!

vermouth1992 commented Apr 29, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

tinuademargaret commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tinuademargaret commented Apr 28, 2025 •

edited

Loading

CLAassistant commented Apr 28, 2025 •

edited

Loading