[megatron] feat: use mbridge as megatron adaptor by ISEEKYAN · Pull Request #2064 · verl-project/verl

ISEEKYAN · 2025-06-17T12:02:05Z

What does this PR do?

MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge
mbridge is developed and maintained by NVIDIA, providing functions for:

modeling HF models with megatron
loading/saving HF format weights with no memory overhead
online export parameter to rollout engine with per-tensor-generator
RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron.

with mbridge, the direct improvement is:

a clean interface for megatron
no offline dist_ckpt conversion needed
no offline model merger needed

Test

tested with GSM8k qwen2-7B-instruct

High-Level Design

add an option actor_rollout_ref.actor.megatron.use_mbridge, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge

Specific Changes

List the specific changes.

API

Demonstrate how the API changes if any.

Usage Example

add this line to the script:

    actor_rollout_ref.actor.megatron.use_mbridge=True \

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title description if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

verl/models/mcore/mbridge.py

ETOgaosion · 2025-06-23T06:15:55Z

verl/trainer/config/ppo_megatron_trainer.yaml

      seed: 42
      override_transformer_config: {} # additional transformer config like: num_layers_in_first(/last)_pipeline_stage
-    profile: # profile the actor model in `update_policy`
+      use_mbridge: False


Actually use dist_checkpointing and mbridge should be an either-or relation? Maybe we shall use some naming like io_methods.loading_backend/saving_backend to choose between huggingface/dist_checkpointing/mbridge?

Also, we may need to consider how this combined with checkpoint configuration. Maybe directly merge these into checkpoint?

@ccclyu @dataproblems , could you give some advice on the API design?

How use_dist_checkpointing and use_mbridge work to better integrate? My original thinking:

checkpoint: pre_load: # first time load format: [hf, dist_ckpt]. # hf default use_mbridge load: format: [hf, dist_ckpt] save: format: [hf, dist_ckpt]

But maybe this will break some APIs.

I think the current way is ok in the config, since it's possible to have some relationship between load and save operations ( actor saves the model, rollout loads it - in the case where the two are not colocated ). However, we would need a validation when the config is read to make sure the load and save options are compatible with each other.

Implementation wise, I would add an abstraction that captures the checkpoint saving logic away from the checkpoint manager and the workers, that way the code base for checkpoint manager and workers relies on a stable interface and allows you to provide more options while modifying less code. Is that something that you were looking for, or am I missing the point here?

Thanks, Your latter part makes sense to me, it's a refactor point, here I hope to focus on API design.

So use_mbridge is a more functional option including model initialization, so it shall work as @ISEEKYAN 's implementation, so the question is whether use_dist_checkpointing should migrate into checkpoint config to work as first time loading option? Since API migration shall not involve this PR's changes, we will separate the feature development and the interface refactor, is it OK?

cc @ISEEKYAN @dataproblems @ccclyu

It looks good to me.
More detail about mbridge, it will include model init, parameter reshard, save/load HF format, forward with seq_pack/fused kernel (to be added), and other potential improvement on megatron side as a solution from NV to use megatron in RL frameworks

current config LGTM. Long-term wise, if we migrate to mbridge, will use_dist_checkpointing be deprecated and it only loads hf format?

Personally I prefer use HF format in all lifetime of training.
But supporting dist_checkpointing or other formats like bytecheckpoint would make it more flexible if user is using a private pre-trained model. So the config might be like:

checkpoint: pre_load: # first time load format: [hf, dist_ckpt, bytecheckpoint] # hf default use_mbridge load_save: format: [hf, dist_ckpt, bytecheckpoint]

We would deprecate use_dist_checkpointing but keep it for a while and remind the user to use the new way. And we would update the example scripts to the new way.

verl/utils/checkpoint/megatron_checkpoint_manager.py

verl/workers/megatron_workers.py

eric-haibin-lin

Not necessarily for this PR but, is it possible to create some unit tests?

ccclyu · 2025-07-02T06:51:18Z

verl/workers/megatron_workers.py

+        if self.bridge is not None:
+            from verl.models.mcore.mbridge import freeze_moe_router
+
+            post_model_creation_callbacks = []
+            if override_model_config.get("moe_config", {}).get("freeze_moe_router", False):
+                post_model_creation_callbacks.append(freeze_moe_router)
+
        # Step 3: initialize the megatron model
+        def make_model(wrap_with_ddp=False):
+            if self.bridge is not None:
+                return self.bridge.get_model(
+                    post_model_creation_callbacks=post_model_creation_callbacks, wrap_with_ddp=wrap_with_ddp
+                )


do you think we can move the post_model_creation_callbacks definition to make_model method?

just updated the implementation

ISEEKYAN · 2025-07-02T07:01:28Z

Not necessarily for this PR but, is it possible to create some unit tests?

sure, I would commit another PR for a small refactor to clean the megatron_worker.py and unified unit tests of megatron adaption

### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

Qin10 · 2025-07-11T07:16:15Z

Hi! I'd like to ask, does the mbridge mode currently support the checkpoint retraining mechanism?

ISEEKYAN · 2025-07-11T07:51:10Z

Hi! I'd like to ask, does the mbridge mode currently support the checkpoint retraining mechanism?

Mbridge support the weights part load/save, but the optimizer states should be saved in distributed_checkpointing format.

rj42 · 2025-07-13T09:03:41Z

@ISEEKYAN, hello.
Could you tell me, please, what needs to be done for 'optimizer states should be saved in distributed_checkpointing format.'? Is this done at the config level? Is there a ready-made working example?
I'd appreciate it.

ETOgaosion · 2025-07-14T02:43:04Z

@rj42 Optimizer saving process using mbridge implementation still needs some fix to save optimizer states.

### What does this PR do? MBridge provides a seamless bridge between Hugging Face models and Megatron-Core's optimized implementation for efficient distributed training and inference. It also offers necessary tools and processes for integrating Reinforcement Learning (RL) with Megatron. see https://github.com/ISEEKYAN/mbridge mbridge is developed and maintained by NVIDIA, providing functions for: - modeling HF models with megatron - loading/saving HF format weights with no memory overhead - online export parameter to rollout engine with per-tensor-generator - RL specific optimization and friendly APIs on Megatron side. Some early access features for megatron. with mbridge, the direct improvement is: - a clean interface for megatron - no offline dist_ckpt conversion needed - no offline model merger needed ### Test tested with GSM8k qwen2-7B-instruct <img width="486" alt="image" src="https://github.com/user-attachments/assets/dd271e8a-9167-470f-8b0c-dde2bcfe1800" /> ### High-Level Design add an option `actor_rollout_ref.actor.megatron.use_mbridge`, default is False. Set it to true for enable. when enabled, the model_instantiate/model_init_load/checkpoint_save/checkpoint_load/per_tensor_generator will be taken over by mbridge ### Specific Changes > List the specific changes. ### API > Demonstrate how the API changes if any. ### Usage Example add this line to the script: ``` actor_rollout_ref.actor.megatron.use_mbridge=True \ ``` ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path.

use mbridge

99b41fe

ISEEKYAN requested review from PeterSH6, eric-haibin-lin, tongyx361, vermouth1992, wuxibin89 and zw0610 as code owners June 17, 2025 12:02

add license

961bdf1

eric-haibin-lin reviewed Jun 19, 2025

View reviewed changes

verl/models/mcore/mbridge.py Show resolved Hide resolved

ISEEKYAN added 3 commits June 19, 2025 23:26

Merge branch 'main' into use_mbridge

9e5ef70

update installation

1f2871a

mbridge importing

94b9cd7

ETOgaosion reviewed Jun 23, 2025

View reviewed changes

ETOgaosion mentioned this pull request Jun 23, 2025

Qwen3-30B issue: AttributeError: 'MoELayer' object has no attribute 'linear_fc1' #2082

Open

ISEEKYAN added 2 commits June 25, 2025 02:13

Merge branch 'main' into use_mbridge

f464a23

Merge branch 'main' into use_mbridge

8527ee1

ETOgaosion reviewed Jun 25, 2025

View reviewed changes

modify for review

a79d98b

ETOgaosion requested a review from ccclyu June 26, 2025 01:50

ISEEKYAN added 4 commits June 25, 2025 19:05

fix pre-commit

563aac0

fix CI

b2fc082

Merge branch 'main' into use_mbridge

b7f9731

Merge branch 'main' into use_mbridge

9947363

eric-haibin-lin reviewed Jul 2, 2025

View reviewed changes

ccclyu reviewed Jul 2, 2025

View reviewed changes

ISEEKYAN added 2 commits July 2, 2025 00:58

update worker

da586f2

add CI for mbridge

1773864

ETOgaosion approved these changes Jul 3, 2025

View reviewed changes

ETOgaosion merged commit 433544f into verl-project:main Jul 3, 2025
38 checks passed

Conversation

ISEEKYAN commented Jun 17, 2025

What does this PR do?

Test

High-Level Design

Specific Changes

API

Usage Example

Checklist Before Submitting

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ISEEKYAN commented Jul 2, 2025

Uh oh!

Uh oh!

Qin10 commented Jul 11, 2025

Uh oh!

ISEEKYAN commented Jul 11, 2025

Uh oh!

rj42 commented Jul 13, 2025

Uh oh!

ETOgaosion commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants