[megatron] feat: fused kernel suppport for new model engine by HollowMan6 · Pull Request #5191 · verl-project/verl

HollowMan6 · 2026-02-03T21:25:16Z

What does this PR do?

Add fused_forward_no_padding method so that model engine can also support fused kernels.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Tested on 8*A800 GPUs with triton version 3.5.1 using DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B, the training curves are closely matched by either disabling or enabling fused kernels, and even enabling fused kernels has better rollout training mismatch level.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

Copilot

Pull request overview

This PR wires up fused-kernel support for the Megatron-based training engine when using the “no padding” (nested tensor) data path, aligning the training engine with the model’s fused-kernel capabilities.

Changes:

Propagates use_fused_kernels from model config into engine config and worker defaults so fused-kernel usage is consistently configured across components.
Adds a Megatron engine hook (_maybe_enable_fused_kernels) that conditionally patches Megatron GPT models with fused forward kernels (disabled for value models and when MTP is enabled).
Introduces a fused no-padding forward path in verl.models.mcore (fused_forward_no_padding_gen + get_mcore_forward_fused_no_padding_fn) and routes MegatronEngine’s forward_step through it when use_fused_kernels is enabled under DatasetPadMode.NO_PADDING.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`verl/workers/engine_workers.py`	Syncs `engine_config.use_fused_kernels` with `model_config.use_fused_kernels` and uses it when injecting default non-tensor flags into training/inference batches.
`verl/workers/engine/megatron/transformer_impl.py`	Adds `_maybe_enable_fused_kernels` to patch Megatron GPT forward with fused kernels, and extends `forward_step` to call a new fused no-padding forward function when `use_fused_kernels` is requested.
`verl/models/mcore/registry.py`	Registers `get_mcore_forward_fused_no_padding_fn` to return the appropriate fused no-padding forward generator based on the HF architecture (VLM vs. standard LM).
`verl/models/mcore/model_forward_fused.py`	Implements `fused_forward_no_padding_gen`, which runs a fused, no-padding (nested tensor) forward pass with optional VLM handling, and reuses existing THD preprocess/postprocess utilities.
`verl/models/mcore/__init__.py`	Re-exports `get_mcore_forward_fused_no_padding_fn` as part of the public mcore API so engines can obtain the fused no-padding forward function.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

verl/workers/engine/megatron/transformer_impl.py

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

verl/workers/engine/megatron/transformer_impl.py

Add `fused_forward_no_padding` method so that model engine can also support fused kernels. Tested on 8*A800 GPUs with triton version 3.5.1 using DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B, the training curves are closely matched by either disabling or enabling fused kernels, and even enabling fused kernels has better rollout training mismatch level. Signed-off-by: Hollow Man <hollowman@opensuse.org>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ISEEKYAN · 2026-02-04T02:19:31Z

hi @HollowMan6, we found the fused kernel is broken with the speed after triton3.3 about 3 months ago, since then we don't maintain recommend the fused kernel. Do you observe a speed down after enabling this fused kernel.

As a option, my colleagues have rewrite the fused kernel and integrated it into the MCore, see NVIDIA/Megatron-LM#3076 and NVIDIA/Megatron-LM#2739. But unfortunately now this kernel only supports Blackwell, not even for Hopper. And we have another PR to adopt the mcore version of fused_kernel at #5130.

I think the best way to use this kernel is to enable it through transformer_config, but this can't help Hopper and Ampere users.
The existing way is OK but I want you to check if the speed is slow down significantly, thanks!

ISEEKYAN

The code changes are OK, it only depends on the fused kernel perf

HollowMan6 · 2026-02-04T06:56:16Z

Hi @ISEEKYAN ! Thanks for your reviewing and that's interesting to know! Actually, I didn't find a speed degradation on A800 when training Moonlight-16B-A3B with triton 3.5.1, please check the last figure in the PR description, in contrast, it improves the speed a lot (step time around 340 -> around 270)!

If you were benchmarking this on Hopper GPUs 3 months ago and found the speed degradation, actually there's a PR to improve it about 1 and a half months ago and the speed up was also very obvious: #4576 I've also tested it with triton 3.5.0 together with this support fix #4676 on L20x and also didn't notice any performance degradation at that time as well.

Unfortunately I don't have blackwell GPUs at hand and can't really help try out/test the new kernel, but since I still observe performance improvement when enabling fused kernel on verl side for now and #4576 also does some improvement on Hopper side, maybe we can still go back and recommend the fused kernel (though for now I no longer have Hopper GPUs on my side and can't really tell how good it is there).

Since I didn't make fused kernel the default option for model engine in this PR, I think it's safe and also good to ask other people to try this out and check the performance benefit, so I'll go ahead and merge this one.

ISEEKYAN · 2026-02-04T10:57:33Z

thanks @HollowMan6, good to know that the great works in #4576

HollowMan6 requested review from eric-haibin-lin and vermouth1992 as code owners February 3, 2026 21:25

Copilot AI review requested due to automatic review settings February 3, 2026 21:25

HollowMan6 requested a review from ISEEKYAN as a code owner February 3, 2026 21:25

Copilot started reviewing on behalf of HollowMan6 February 3, 2026 21:25 View session

HollowMan6 requested a review from wuxibin89 February 3, 2026 21:26

Copilot AI reviewed Feb 3, 2026

View reviewed changes

verl/workers/engine/megatron/transformer_impl.py Outdated Show resolved Hide resolved

verl/workers/engine/megatron/transformer_impl.py Show resolved Hide resolved

HollowMan6 force-pushed the fused_kernels_engine branch 2 times, most recently from fdf8c8b to 2e48299 Compare February 3, 2026 21:37

HollowMan6 requested a review from Copilot February 3, 2026 21:38

Copilot started reviewing on behalf of HollowMan6 February 3, 2026 21:38 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

verl/workers/engine/megatron/transformer_impl.py Outdated Show resolved Hide resolved

HollowMan6 force-pushed the fused_kernels_engine branch from 2e48299 to 1c54bc8 Compare February 3, 2026 21:53

HollowMan6 requested a review from Copilot February 3, 2026 21:54

Copilot started reviewing on behalf of HollowMan6 February 3, 2026 21:55 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

ISEEKYAN approved these changes Feb 4, 2026

View reviewed changes

HollowMan6 merged commit e5f5ea6 into verl-project:main Feb 4, 2026
101 of 117 checks passed

HollowMan6 deleted the fused_kernels_engine branch February 4, 2026 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron] feat: fused kernel suppport for new model engine#5191

[megatron] feat: fused kernel suppport for new model engine#5191
HollowMan6 merged 1 commit intoverl-project:mainfrom
HollowMan6:fused_kernels_engine

HollowMan6 commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

ISEEKYAN commented Feb 4, 2026

Uh oh!

ISEEKYAN left a comment

Uh oh!

HollowMan6 commented Feb 4, 2026

Uh oh!

Uh oh!

ISEEKYAN commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HollowMan6 commented Feb 3, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

ISEEKYAN commented Feb 4, 2026

Uh oh!

ISEEKYAN left a comment

Choose a reason for hiding this comment

Uh oh!

HollowMan6 commented Feb 4, 2026

Uh oh!

Uh oh!

ISEEKYAN commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants