Skip to content

[megatron] feat: fused kernel suppport for new model engine#5191

Merged
HollowMan6 merged 1 commit intoverl-project:mainfrom
HollowMan6:fused_kernels_engine
Feb 4, 2026
Merged

[megatron] feat: fused kernel suppport for new model engine#5191
HollowMan6 merged 1 commit intoverl-project:mainfrom
HollowMan6:fused_kernels_engine

Conversation

@HollowMan6
Copy link
Collaborator

What does this PR do?

Add fused_forward_no_padding method so that model engine can also support fused kernels.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Tested on 8*A800 GPUs with triton version 3.5.1 using DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B, the training curves are closely matched by either disabling or enabling fused kernels, and even enabling fused kernels has better rollout training mismatch level.

image image image image

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copilot AI review requested due to automatic review settings February 3, 2026 21:25
@HollowMan6 HollowMan6 requested a review from ISEEKYAN as a code owner February 3, 2026 21:25
@HollowMan6 HollowMan6 requested a review from wuxibin89 February 3, 2026 21:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires up fused-kernel support for the Megatron-based training engine when using the “no padding” (nested tensor) data path, aligning the training engine with the model’s fused-kernel capabilities.

Changes:

  • Propagates use_fused_kernels from model config into engine config and worker defaults so fused-kernel usage is consistently configured across components.
  • Adds a Megatron engine hook (_maybe_enable_fused_kernels) that conditionally patches Megatron GPT models with fused forward kernels (disabled for value models and when MTP is enabled).
  • Introduces a fused no-padding forward path in verl.models.mcore (fused_forward_no_padding_gen + get_mcore_forward_fused_no_padding_fn) and routes MegatronEngine’s forward_step through it when use_fused_kernels is enabled under DatasetPadMode.NO_PADDING.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
verl/workers/engine_workers.py Syncs engine_config.use_fused_kernels with model_config.use_fused_kernels and uses it when injecting default non-tensor flags into training/inference batches.
verl/workers/engine/megatron/transformer_impl.py Adds _maybe_enable_fused_kernels to patch Megatron GPT forward with fused kernels, and extends forward_step to call a new fused no-padding forward function when use_fused_kernels is requested.
verl/models/mcore/registry.py Registers get_mcore_forward_fused_no_padding_fn to return the appropriate fused no-padding forward generator based on the HF architecture (VLM vs. standard LM).
verl/models/mcore/model_forward_fused.py Implements fused_forward_no_padding_gen, which runs a fused, no-padding (nested tensor) forward pass with optional VLM handling, and reuses existing THD preprocess/postprocess utilities.
verl/models/mcore/__init__.py Re-exports get_mcore_forward_fused_no_padding_fn as part of the public mcore API so engines can obtain the fused no-padding forward function.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HollowMan6 HollowMan6 force-pushed the fused_kernels_engine branch 2 times, most recently from fdf8c8b to 2e48299 Compare February 3, 2026 21:37
@HollowMan6 HollowMan6 requested a review from Copilot February 3, 2026 21:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add `fused_forward_no_padding` method so that model engine
can also support fused kernels.

Tested on 8*A800 GPUs with triton version 3.5.1 using
DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B,
the training curves are closely matched by either disabling
or enabling fused kernels, and even enabling fused kernels
has better rollout training mismatch level.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ISEEKYAN
Copy link
Collaborator

ISEEKYAN commented Feb 4, 2026

hi @HollowMan6, we found the fused kernel is broken with the speed after triton3.3 about 3 months ago, since then we don't maintain recommend the fused kernel. Do you observe a speed down after enabling this fused kernel.

As a option, my colleagues have rewrite the fused kernel and integrated it into the MCore, see NVIDIA/Megatron-LM#3076 and NVIDIA/Megatron-LM#2739. But unfortunately now this kernel only supports Blackwell, not even for Hopper. And we have another PR to adopt the mcore version of fused_kernel at #5130.

I think the best way to use this kernel is to enable it through transformer_config, but this can't help Hopper and Ampere users.
The existing way is OK but I want you to check if the speed is slow down significantly, thanks!

Copy link
Collaborator

@ISEEKYAN ISEEKYAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes are OK, it only depends on the fused kernel perf

@HollowMan6
Copy link
Collaborator Author

Hi @ISEEKYAN ! Thanks for your reviewing and that's interesting to know! Actually, I didn't find a speed degradation on A800 when training Moonlight-16B-A3B with triton 3.5.1, please check the last figure in the PR description, in contrast, it improves the speed a lot (step time around 340 -> around 270)!

If you were benchmarking this on Hopper GPUs 3 months ago and found the speed degradation, actually there's a PR to improve it about 1 and a half months ago and the speed up was also very obvious: #4576 I've also tested it with triton 3.5.0 together with this support fix #4676 on L20x and also didn't notice any performance degradation at that time as well.

Unfortunately I don't have blackwell GPUs at hand and can't really help try out/test the new kernel, but since I still observe performance improvement when enabling fused kernel on verl side for now and #4576 also does some improvement on Hopper side, maybe we can still go back and recommend the fused kernel (though for now I no longer have Hopper GPUs on my side and can't really tell how good it is there).

Since I didn't make fused kernel the default option for model engine in this PR, I think it's safe and also good to ask other people to try this out and check the performance benefit, so I'll go ahead and merge this one.

@HollowMan6 HollowMan6 merged commit e5f5ea6 into verl-project:main Feb 4, 2026
101 of 117 checks passed
@HollowMan6 HollowMan6 deleted the fused_kernels_engine branch February 4, 2026 06:58
@ISEEKYAN
Copy link
Collaborator

ISEEKYAN commented Feb 4, 2026

thanks @HollowMan6, good to know that the great works in #4576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants