[megatron] feat: fused kernel suppport for new model engine#5191
[megatron] feat: fused kernel suppport for new model engine#5191HollowMan6 merged 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR wires up fused-kernel support for the Megatron-based training engine when using the “no padding” (nested tensor) data path, aligning the training engine with the model’s fused-kernel capabilities.
Changes:
- Propagates
use_fused_kernelsfrom model config into engine config and worker defaults so fused-kernel usage is consistently configured across components. - Adds a Megatron engine hook (
_maybe_enable_fused_kernels) that conditionally patches Megatron GPT models with fused forward kernels (disabled for value models and when MTP is enabled). - Introduces a fused no-padding forward path in
verl.models.mcore(fused_forward_no_padding_gen+get_mcore_forward_fused_no_padding_fn) and routes MegatronEngine’sforward_stepthrough it whenuse_fused_kernelsis enabled underDatasetPadMode.NO_PADDING.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
verl/workers/engine_workers.py |
Syncs engine_config.use_fused_kernels with model_config.use_fused_kernels and uses it when injecting default non-tensor flags into training/inference batches. |
verl/workers/engine/megatron/transformer_impl.py |
Adds _maybe_enable_fused_kernels to patch Megatron GPT forward with fused kernels, and extends forward_step to call a new fused no-padding forward function when use_fused_kernels is requested. |
verl/models/mcore/registry.py |
Registers get_mcore_forward_fused_no_padding_fn to return the appropriate fused no-padding forward generator based on the HF architecture (VLM vs. standard LM). |
verl/models/mcore/model_forward_fused.py |
Implements fused_forward_no_padding_gen, which runs a fused, no-padding (nested tensor) forward pass with optional VLM handling, and reuses existing THD preprocess/postprocess utilities. |
verl/models/mcore/__init__.py |
Re-exports get_mcore_forward_fused_no_padding_fn as part of the public mcore API so engines can obtain the fused no-padding forward function. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fdf8c8b to
2e48299
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add `fused_forward_no_padding` method so that model engine can also support fused kernels. Tested on 8*A800 GPUs with triton version 3.5.1 using DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B, the training curves are closely matched by either disabling or enabling fused kernels, and even enabling fused kernels has better rollout training mismatch level. Signed-off-by: Hollow Man <hollowman@opensuse.org>
2e48299 to
1c54bc8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
hi @HollowMan6, we found the fused kernel is broken with the speed after triton3.3 about 3 months ago, since then we don't maintain recommend the fused kernel. Do you observe a speed down after enabling this fused kernel. As a option, my colleagues have rewrite the fused kernel and integrated it into the MCore, see NVIDIA/Megatron-LM#3076 and NVIDIA/Megatron-LM#2739. But unfortunately now this kernel only supports Blackwell, not even for Hopper. And we have another PR to adopt the mcore version of fused_kernel at #5130. I think the best way to use this kernel is to enable it through |
ISEEKYAN
left a comment
There was a problem hiding this comment.
The code changes are OK, it only depends on the fused kernel perf
|
Hi @ISEEKYAN ! Thanks for your reviewing and that's interesting to know! Actually, I didn't find a speed degradation on A800 when training Moonlight-16B-A3B with triton 3.5.1, please check the last figure in the PR description, in contrast, it improves the speed a lot (step time around 340 -> around 270)! If you were benchmarking this on Hopper GPUs 3 months ago and found the speed degradation, actually there's a PR to improve it about 1 and a half months ago and the speed up was also very obvious: #4576 I've also tested it with triton 3.5.0 together with this support fix #4676 on L20x and also didn't notice any performance degradation at that time as well. Unfortunately I don't have blackwell GPUs at hand and can't really help try out/test the new kernel, but since I still observe performance improvement when enabling fused kernel on verl side for now and #4576 also does some improvement on Hopper side, maybe we can still go back and recommend the fused kernel (though for now I no longer have Hopper GPUs on my side and can't really tell how good it is there). Since I didn't make fused kernel the default option for model engine in this PR, I think it's safe and also good to ask other people to try this out and check the performance benefit, so I'll go ahead and merge this one. |
|
thanks @HollowMan6, good to know that the great works in #4576 |
What does this PR do?
Add
fused_forward_no_paddingmethod so that model engine can also support fused kernels.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Tested on 8*A800 GPUs with triton version 3.5.1 using DAPO-Math-17k dataset with DAPO and using Moonlight-16B-A3B, the training curves are closely matched by either disabling or enabling fused kernels, and even enabling fused kernels has better rollout training mismatch level.
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.