-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[ckpt] refactor: enhance FSDP checkpoint manager flexibility #1350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
2d4a9c7
refactor(checkpoint): enhance FSDP checkpoint manager flexibility
0x404 c795f83
add hf_model
0x404 e513360
merge fsdp2 main and solve conflicts
0x404 26b7793
Merge branch 'main' into rollout_from_fsdp_ckpt
0x404 d84b0dc
Merge branch 'main' into rollout_from_fsdp_ckpt
0x404 14760eb
refactor: update checkpoint handling in FSDP and Megatron workers
0x404 b9628ed
validate checkpoint load and save contents in FSDPCheckpointManager
0x404 1802b55
Merge remote-tracking branch 'origin/main' into rollout_from_fsdp_ckpt
0x404 dcc3be5
fix
0x404 8448cbc
update ppo_megatron_trainer.yaml
0x404 a1b1a14
fix megatron
0x404 c969682
minor fix
0x404 0527aad
add assertion and docs
ETOgaosion 6f8f46e
refactor checkpoint logging
ETOgaosion 59dced7
fix load optimizer scheduler
ETOgaosion d36e018
try to refactor APIs
ETOgaosion 7b62ca6
fix missing API
ETOgaosion 6429da0
directly use omegaconf in rollout
ETOgaosion e96c1d1
fix CI
ETOgaosion 82082d8
fix flush
ETOgaosion ead0d74
Merge branch 'main' into rollout_from_fsdp_ckpt
ETOgaosion 4cd28ea
fix logging
ETOgaosion File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a little strange to use
save_xxxto control the behavior of loading checkpoints, megatron has ause_checkpoint_opt_param_scheduleroroverride_opt_param_schedulerto control optimizer scheduler loading process, can you design a new mechanism?What about divide the
self.checkpoint_contentsto load/save?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ETOgaosion, sorry for the late update, just go through a busy week. Did you mean use two list to control which content to load/save? like remove the current
checkpoint_contentsand introduce likecheckpoint_load_contentsandcheckpoint_save_contents?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe it's better to achieve finer-grained and flexible checkpoint choice
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe our default API like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I understand