[ckpt] refactor: enhance FSDP checkpoint manager flexibility by 0x404 · Pull Request #1350 · verl-project/verl

0x404 · 2025-05-01T14:53:40Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

This PR enables FSDPCheckpointManager to accept optimizer and lr_scheduler as None, removing some existing TODO. Now FSDPCheckpointManager performs saving and loading according to checkpoint_contents, only saving/loading content in checkpoint_contents. This behavior is consistent with MegatronCheckpointManager.

When allowing optimizer and lr_scheduler to be None, we can create an FSDPCheckpointManager for fsdp_module when FSDPWorkers are initialized only for rollout (is_actor==False and is_rollout==True). This allows users to use main_generation.py to directly load FSDP checkpoints without merging them into hf_model.

Also, added save_xx property in the base class to replace all "xx" in checkpoint_contents statements, making the code look better.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: FSDP
Inference: VLLM

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

0x404 · 2025-05-09T10:46:39Z

Hi @vermouth1992, could u help review this PR?

ETOgaosion · 2025-05-27T05:38:24Z

verl/utils/checkpoint/megatron_checkpoint_manager.py

            return

-        if "model" in self.checkpoint_contents:
+        if self.save_model:


I find it a little strange to use save_xxx to control the behavior of loading checkpoints, megatron has a use_checkpoint_opt_param_scheduler or override_opt_param_scheduler to control optimizer scheduler loading process, can you design a new mechanism?

What about divide the self.checkpoint_contents to load/save?

Hi @ETOgaosion, sorry for the late update, just go through a busy week. Did you mean use two list to control which content to load/save? like remove the current checkpoint_contents and introduce like checkpoint_load_contents and checkpoint_save_contents?

Yes, maybe it's better to achieve finer-grained and flexible checkpoint choice

Maybe our default API like this:

checkpoint_contents: save: [...] load: ${(...).checkpoint_contents.save}

okay, I understand

ETOgaosion · 2025-05-27T05:40:40Z

Hi @0x404 , thanks for your efforts~

Could you consider my suggestion and resolve conflicts?

0x404 · 2025-05-27T05:44:48Z

hi @ETOgaosion, no problem! Quite busy recently, I would revise this as soon as as possible:)

eric-haibin-lin

is it possible to create some unit tests?

0x404 · 2025-06-09T14:36:12Z

Hi all, Sorry for the late update. I think this PR is ready for review, and I will add some unit tests tomorrow. Hi @ETOgaosion, Should we trigger the CI first to see if we are breaking anything?

ETOgaosion · 2025-06-11T04:39:20Z

@0x404 I add some modification to docs, loggings and saving logic, please review~

ETOgaosion · 2025-06-11T11:47:16Z

@0x404 please double check if there are any missing mistakes?

0x404

LGTM from my part, nice work!

…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>

0x404 added 2 commits May 1, 2025 14:41

refactor(checkpoint): enhance FSDP checkpoint manager flexibility

2d4a9c7

add hf_model

c795f83

0x404 changed the title ~~refactor(checkpoint): enhance FSDP checkpoint manager flexibility~~ [FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility May 1, 2025

0x404 added 2 commits May 5, 2025 04:02

merge fsdp2 main and solve conflicts

e513360

Merge branch 'main' into rollout_from_fsdp_ckpt

26b7793

ETOgaosion reviewed May 27, 2025

View reviewed changes

ETOgaosion added the status: review in process label May 27, 2025

Merge branch 'main' into rollout_from_fsdp_ckpt

d84b0dc

eric-haibin-lin reviewed Jun 1, 2025

View reviewed changes

0x404 added 3 commits June 9, 2025 17:42

refactor: update checkpoint handling in FSDP and Megatron workers

14760eb

validate checkpoint load and save contents in FSDPCheckpointManager

b9628ed

Merge remote-tracking branch 'origin/main' into rollout_from_fsdp_ckpt

1802b55

0x404 marked this pull request as draft June 9, 2025 09:52

0x404 added 4 commits June 9, 2025 19:01

fix

dcc3be5

update ppo_megatron_trainer.yaml

8448cbc

fix megatron

a1b1a14

minor fix

c969682

0x404 marked this pull request as ready for review June 9, 2025 14:33

ETOgaosion changed the title ~~[FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility~~ [ckpt] refactor: enhance FSDP checkpoint manager flexibility Jun 10, 2025

ETOgaosion added 3 commits June 11, 2025 10:44

add assertion and docs

0527aad

refactor checkpoint logging

6f8f46e

fix load optimizer scheduler

59dced7

ETOgaosion requested review from ccclyu and vermouth1992 June 11, 2025 04:39

vermouth1992 previously approved these changes Jun 11, 2025

View reviewed changes

try to refactor APIs

d36e018

ETOgaosion dismissed vermouth1992’s stale review via d36e018 June 11, 2025 05:02

ETOgaosion added 4 commits June 11, 2025 13:07

fix missing API

7b62ca6

directly use omegaconf in rollout

6429da0

fix CI

e96c1d1

fix flush

82082d8

ETOgaosion previously approved these changes Jun 11, 2025

View reviewed changes

Merge branch 'main' into rollout_from_fsdp_ckpt

ead0d74

ETOgaosion dismissed their stale review via ead0d74 June 11, 2025 11:51

0x404 commented Jun 11, 2025

View reviewed changes

fix logging

4cd28ea

ETOgaosion approved these changes Jun 12, 2025

View reviewed changes

ETOgaosion merged commit a1a152e into verl-project:main Jun 12, 2025
36 checks passed

Conversation

0x404 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

0x404 commented May 9, 2025

Uh oh!

ETOgaosion May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x404 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0x404 Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented May 27, 2025

Uh oh!

0x404 commented May 27, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

0x404 commented Jun 9, 2025

Uh oh!

ETOgaosion commented Jun 11, 2025

Uh oh!

ETOgaosion commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0x404 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0x404 commented May 1, 2025 •

edited

Loading

ETOgaosion May 27, 2025 •

edited

Loading

ETOgaosion Jun 5, 2025 •

edited

Loading

ETOgaosion commented Jun 11, 2025 •

edited

Loading