Skip to content

[ckpt] refactor: enhance FSDP checkpoint manager flexibility#1350

Merged
ETOgaosion merged 22 commits intoverl-project:mainfrom
0x404:rollout_from_fsdp_ckpt
Jun 12, 2025
Merged

[ckpt] refactor: enhance FSDP checkpoint manager flexibility#1350
ETOgaosion merged 22 commits intoverl-project:mainfrom
0x404:rollout_from_fsdp_ckpt

Conversation

@0x404
Copy link
Collaborator

@0x404 0x404 commented May 1, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

This PR enables FSDPCheckpointManager to accept optimizer and lr_scheduler as None, removing some existing TODO. Now FSDPCheckpointManager performs saving and loading according to checkpoint_contents, only saving/loading content in checkpoint_contents. This behavior is consistent with MegatronCheckpointManager.

When allowing optimizer and lr_scheduler to be None, we can create an FSDPCheckpointManager for fsdp_module when FSDPWorkers are initialized only for rollout (is_actor==False and is_rollout==True). This allows users to use main_generation.py to directly load FSDP checkpoints without merging them into hf_model.

Also, added save_xx property in the base class to replace all "xx" in checkpoint_contents statements, making the code look better.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this 

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: FSDP
  • Inference: VLLM

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@0x404 0x404 changed the title refactor(checkpoint): enhance FSDP checkpoint manager flexibility [FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility May 1, 2025
@0x404
Copy link
Collaborator Author

0x404 commented May 9, 2025

Hi @vermouth1992, could u help review this PR?

return

if "model" in self.checkpoint_contents:
if self.save_model:
Copy link
Collaborator

@ETOgaosion ETOgaosion May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a little strange to use save_xxx to control the behavior of loading checkpoints, megatron has a use_checkpoint_opt_param_scheduler or override_opt_param_scheduler to control optimizer scheduler loading process, can you design a new mechanism?

What about divide the self.checkpoint_contents to load/save?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ETOgaosion, sorry for the late update, just go through a busy week. Did you mean use two list to control which content to load/save? like remove the current checkpoint_contents and introduce like checkpoint_load_contents and checkpoint_save_contents?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe it's better to achieve finer-grained and flexible checkpoint choice

Copy link
Collaborator

@ETOgaosion ETOgaosion Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe our default API like this:

checkpoint_contents:
    save: [...]
    load: ${(...).checkpoint_contents.save}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I understand

@ETOgaosion
Copy link
Collaborator

Hi @0x404 , thanks for your efforts~

Could you consider my suggestion and resolve conflicts?

@0x404
Copy link
Collaborator Author

0x404 commented May 27, 2025

hi @ETOgaosion, no problem! Quite busy recently, I would revise this as soon as as possible:)

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to create some unit tests?

@0x404 0x404 marked this pull request as draft June 9, 2025 09:52
@0x404 0x404 marked this pull request as ready for review June 9, 2025 14:33
@0x404
Copy link
Collaborator Author

0x404 commented Jun 9, 2025

Hi all, Sorry for the late update. I think this PR is ready for review, and I will add some unit tests tomorrow. Hi @ETOgaosion, Should we trigger the CI first to see if we are breaking anything?

@ETOgaosion ETOgaosion changed the title [FSDPCheckpoint] refactor: enhance FSDP checkpoint manager flexibility [ckpt] refactor: enhance FSDP checkpoint manager flexibility Jun 10, 2025
@ETOgaosion
Copy link
Collaborator

@0x404 I add some modification to docs, loggings and saving logic, please review~

vermouth1992
vermouth1992 previously approved these changes Jun 11, 2025
ETOgaosion
ETOgaosion previously approved these changes Jun 11, 2025
@ETOgaosion
Copy link
Collaborator

ETOgaosion commented Jun 11, 2025

@0x404 please double check if there are any missing mistakes?

Copy link
Collaborator Author

@0x404 0x404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my part, nice work!

@ETOgaosion ETOgaosion merged commit a1a152e into verl-project:main Jun 12, 2025
36 checks passed
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 13, 2025
…oject#1350)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

This PR enables `FSDPCheckpointManager` to accept optimizer and
`lr_scheduler` as None, removing some existing TODO. Now
`FSDPCheckpointManager` performs saving and loading according to
`checkpoint_contents`, only saving/loading content in
`checkpoint_contents`. This behavior is consistent with
`MegatronCheckpointManager`.

When allowing `optimizer` and `lr_scheduler` to be None, we can create
an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are
initialized only for rollout (`is_actor==False and is_rollout==True`).
This allows users to use `main_generation.py` to directly load FSDP
checkpoints without merging them into hf_model.

Also, added `save_xx` property in the base class to replace all `"xx" in
checkpoint_contents` statements, making the code look better.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: FSDP
- **Inference**: VLLM

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…oject#1350)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

This PR enables `FSDPCheckpointManager` to accept optimizer and
`lr_scheduler` as None, removing some existing TODO. Now
`FSDPCheckpointManager` performs saving and loading according to
`checkpoint_contents`, only saving/loading content in
`checkpoint_contents`. This behavior is consistent with
`MegatronCheckpointManager`.

When allowing `optimizer` and `lr_scheduler` to be None, we can create
an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are
initialized only for rollout (`is_actor==False and is_rollout==True`).
This allows users to use `main_generation.py` to directly load FSDP
checkpoints without merging them into hf_model.

Also, added `save_xx` property in the base class to replace all `"xx" in
checkpoint_contents` statements, making the code look better.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: FSDP
- **Inference**: VLLM

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…oject#1350)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

This PR enables `FSDPCheckpointManager` to accept optimizer and
`lr_scheduler` as None, removing some existing TODO. Now
`FSDPCheckpointManager` performs saving and loading according to
`checkpoint_contents`, only saving/loading content in
`checkpoint_contents`. This behavior is consistent with
`MegatronCheckpointManager`.

When allowing `optimizer` and `lr_scheduler` to be None, we can create
an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are
initialized only for rollout (`is_actor==False and is_rollout==True`).
This allows users to use `main_generation.py` to directly load FSDP
checkpoints without merging them into hf_model.

Also, added `save_xx` property in the base class to replace all `"xx" in
checkpoint_contents` statements, making the code look better.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: FSDP
- **Inference**: VLLM

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…oject#1350)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

This PR enables `FSDPCheckpointManager` to accept optimizer and
`lr_scheduler` as None, removing some existing TODO. Now
`FSDPCheckpointManager` performs saving and loading according to
`checkpoint_contents`, only saving/loading content in
`checkpoint_contents`. This behavior is consistent with
`MegatronCheckpointManager`.

When allowing `optimizer` and `lr_scheduler` to be None, we can create
an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are
initialized only for rollout (`is_actor==False and is_rollout==True`).
This allows users to use `main_generation.py` to directly load FSDP
checkpoints without merging them into hf_model.

Also, added `save_xx` property in the base class to replace all `"xx" in
checkpoint_contents` statements, making the code look better.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: FSDP
- **Inference**: VLLM

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…oject#1350)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

> Add one-line overview of what this PR aims to achieve or accomplish. 

This PR enables `FSDPCheckpointManager` to accept optimizer and
`lr_scheduler` as None, removing some existing TODO. Now
`FSDPCheckpointManager` performs saving and loading according to
`checkpoint_contents`, only saving/loading content in
`checkpoint_contents`. This behavior is consistent with
`MegatronCheckpointManager`.

When allowing `optimizer` and `lr_scheduler` to be None, we can create
an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are
initialized only for rollout (`is_actor==False and is_rollout==True`).
This allows users to use `main_generation.py` to directly load FSDP
checkpoints without merging them into hf_model.

Also, added `save_xx` property in the base class to replace all `"xx" in
checkpoint_contents` statements, making the code look better.

### High-Level Design

> Demonstrate the high-level design if this PR is complex.

### Specific Changes

> List the specific changes.

### API

> Demonstrate how the API changes if any.

### Usage Example

> Provide usage example(s) for easier usage.

```python
# Add code snippet or script demonstrating how to use this 
```

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.

Currently CI should test this PR correctly.

### Additional Info.

- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: FSDP
- **Inference**: VLLM

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.

---------

Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn>
Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants