[ckpt] refactor: enhance FSDP checkpoint manager flexibility#1350
[ckpt] refactor: enhance FSDP checkpoint manager flexibility#1350ETOgaosion merged 22 commits intoverl-project:mainfrom
Conversation
|
Hi @vermouth1992, could u help review this PR? |
| return | ||
|
|
||
| if "model" in self.checkpoint_contents: | ||
| if self.save_model: |
There was a problem hiding this comment.
I find it a little strange to use save_xxx to control the behavior of loading checkpoints, megatron has a use_checkpoint_opt_param_scheduler or override_opt_param_scheduler to control optimizer scheduler loading process, can you design a new mechanism?
What about divide the self.checkpoint_contents to load/save?
There was a problem hiding this comment.
Hi @ETOgaosion, sorry for the late update, just go through a busy week. Did you mean use two list to control which content to load/save? like remove the current checkpoint_contents and introduce like checkpoint_load_contents and checkpoint_save_contents?
There was a problem hiding this comment.
Yes, maybe it's better to achieve finer-grained and flexible checkpoint choice
There was a problem hiding this comment.
Maybe our default API like this:
checkpoint_contents:
save: [...]
load: ${(...).checkpoint_contents.save}|
Hi @0x404 , thanks for your efforts~ Could you consider my suggestion and resolve conflicts? |
|
hi @ETOgaosion, no problem! Quite busy recently, I would revise this as soon as as possible:) |
eric-haibin-lin
left a comment
There was a problem hiding this comment.
is it possible to create some unit tests?
|
Hi all, Sorry for the late update. I think this PR is ready for review, and I will add some unit tests tomorrow. Hi @ETOgaosion, Should we trigger the CI first to see if we are breaking anything? |
|
@0x404 I add some modification to docs, loggings and saving logic, please review~ |
|
@0x404 please double check if there are any missing mistakes? |
0x404
left a comment
There was a problem hiding this comment.
LGTM from my part, nice work!
…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
…oject#1350) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? > Add one-line overview of what this PR aims to achieve or accomplish. This PR enables `FSDPCheckpointManager` to accept optimizer and `lr_scheduler` as None, removing some existing TODO. Now `FSDPCheckpointManager` performs saving and loading according to `checkpoint_contents`, only saving/loading content in `checkpoint_contents`. This behavior is consistent with `MegatronCheckpointManager`. When allowing `optimizer` and `lr_scheduler` to be None, we can create an `FSDPCheckpointManager` for `fsdp_module` when FSDPWorkers are initialized only for rollout (`is_actor==False and is_rollout==True`). This allows users to use `main_generation.py` to directly load FSDP checkpoints without merging them into hf_model. Also, added `save_xx` property in the base class to replace all `"xx" in checkpoint_contents` statements, making the code look better. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. Currently CI should test this PR correctly. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: FSDP - **Inference**: VLLM ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary. --------- Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>
Checklist Before Starting
What does this PR do?
This PR enables
FSDPCheckpointManagerto accept optimizer andlr_scheduleras None, removing some existing TODO. NowFSDPCheckpointManagerperforms saving and loading according tocheckpoint_contents, only saving/loading content incheckpoint_contents. This behavior is consistent withMegatronCheckpointManager.When allowing
optimizerandlr_schedulerto be None, we can create anFSDPCheckpointManagerforfsdp_modulewhen FSDPWorkers are initialized only for rollout (is_actor==False and is_rollout==True). This allows users to usemain_generation.pyto directly load FSDP checkpoints without merging them into hf_model.Also, added
save_xxproperty in the base class to replace all"xx" in checkpoint_contentsstatements, making the code look better.High-Level Design
Specific Changes
API
Usage Example
# Add code snippet or script demonstrating how to use thisTest
Currently CI should test this PR correctly.
Additional Info.
Checklist Before Submitting
[BREAKING]to the PR title if it breaks any API.