[megatron] feat: use mbridge as megatron adaptor#2064
[megatron] feat: use mbridge as megatron adaptor#2064ETOgaosion merged 14 commits intoverl-project:mainfrom
Conversation
| seed: 42 | ||
| override_transformer_config: {} # additional transformer config like: num_layers_in_first(/last)_pipeline_stage | ||
| profile: # profile the actor model in `update_policy` | ||
| use_mbridge: False |
There was a problem hiding this comment.
Actually use dist_checkpointing and mbridge should be an either-or relation? Maybe we shall use some naming like io_methods.loading_backend/saving_backend to choose between huggingface/dist_checkpointing/mbridge?
Also, we may need to consider how this combined with checkpoint configuration. Maybe directly merge these into checkpoint?
There was a problem hiding this comment.
@ccclyu @dataproblems , could you give some advice on the API design?
How use_dist_checkpointing and use_mbridge work to better integrate? My original thinking:
checkpoint:
pre_load: # first time load
format: [hf, dist_ckpt]. # hf default use_mbridge
load:
format: [hf, dist_ckpt]
save:
format: [hf, dist_ckpt]But maybe this will break some APIs.
There was a problem hiding this comment.
I think the current way is ok in the config, since it's possible to have some relationship between load and save operations ( actor saves the model, rollout loads it - in the case where the two are not colocated ). However, we would need a validation when the config is read to make sure the load and save options are compatible with each other.
Implementation wise, I would add an abstraction that captures the checkpoint saving logic away from the checkpoint manager and the workers, that way the code base for checkpoint manager and workers relies on a stable interface and allows you to provide more options while modifying less code. Is that something that you were looking for, or am I missing the point here?
There was a problem hiding this comment.
Thanks, Your latter part makes sense to me, it's a refactor point, here I hope to focus on API design.
So use_mbridge is a more functional option including model initialization, so it shall work as @ISEEKYAN 's implementation, so the question is whether use_dist_checkpointing should migrate into checkpoint config to work as first time loading option? Since API migration shall not involve this PR's changes, we will separate the feature development and the interface refactor, is it OK?
There was a problem hiding this comment.
It looks good to me.
More detail about mbridge, it will include model init, parameter reshard, save/load HF format, forward with seq_pack/fused kernel (to be added), and other potential improvement on megatron side as a solution from NV to use megatron in RL frameworks
There was a problem hiding this comment.
current config LGTM. Long-term wise, if we migrate to mbridge, will use_dist_checkpointing be deprecated and it only loads hf format?
There was a problem hiding this comment.
Personally I prefer use HF format in all lifetime of training.
But supporting dist_checkpointing or other formats like bytecheckpoint would make it more flexible if user is using a private pre-trained model. So the config might be like:
checkpoint:
pre_load: # first time load
format: [hf, dist_ckpt, bytecheckpoint] # hf default use_mbridge
load_save:
format: [hf, dist_ckpt, bytecheckpoint]
We would deprecate use_dist_checkpointing but keep it for a while and remind the user to use the new way. And we would update the example scripts to the new way.
eric-haibin-lin
left a comment
There was a problem hiding this comment.
Not necessarily for this PR but, is it possible to create some unit tests?
verl/workers/megatron_workers.py
Outdated
| if self.bridge is not None: | ||
| from verl.models.mcore.mbridge import freeze_moe_router | ||
|
|
||
| post_model_creation_callbacks = [] | ||
| if override_model_config.get("moe_config", {}).get("freeze_moe_router", False): | ||
| post_model_creation_callbacks.append(freeze_moe_router) | ||
|
|
||
| # Step 3: initialize the megatron model | ||
| def make_model(wrap_with_ddp=False): | ||
| if self.bridge is not None: | ||
| return self.bridge.get_model( | ||
| post_model_creation_callbacks=post_model_creation_callbacks, wrap_with_ddp=wrap_with_ddp | ||
| ) |
There was a problem hiding this comment.
do you think we can move the post_model_creation_callbacks definition to make_model method?
There was a problem hiding this comment.
just updated the implementation
sure, I would commit another PR for a small refactor to clean the megatron_worker.py and unified unit tests of megatron adaption |
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
|
Hi! I'd like to ask, does the mbridge mode currently support the checkpoint retraining mechanism? |
Mbridge support the weights part load/save, but the optimizer states should be saved in distributed_checkpointing format. |
|
@ISEEKYAN, hello. |
|
@rj42 Optimizer saving process using mbridge implementation still needs some fix to save optimizer states. |
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.
What does this PR do?
MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge
mbridge is developed and maintained by NVIDIA, providing functions for:
with mbridge, the direct improvement is:
Test
tested with GSM8k qwen2-7B-instruct

High-Level Design
add an option
actor_rollout_ref.actor.megatron.use_mbridge, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridgeSpecific Changes
API
Usage Example
add this line to the script:
Checklist Before Submitting
[BREAKING]to the PR titledescriptionif it breaks any API.