-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[megatron] fix: VLMs using fused kernels #3849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
[megatron] fix: VLMs using fused kernels
Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. I found another issue in the original upstream codebase: the temperature parameter doesn't get correctly passed to `_fused_GPTModel_forward`. To make this work normally, I added **kwargs to allow models to accept additional arbitrary kwargs and pass these **kwargs all to self.language_model, for both the verl side (Qwen2_5VLModel) and all the vision models under mbridge Given the current situation of functions under `verl/models/mcore/model_forward.py` and `verl/models/mcore/model_forward_fused.py` (code duplications, several unused and untested condition branches, as well as their allowance for arbitrary kwargs but not getting them passed to anywhere), it is very hard to debug and maintain. So I also refactored the functions and cleaned up the code under those 2 files to use closure for unifying vision models and normal "GPT" (language) models, for better maintainability. Signed-off-by: Hollow Man <hollowman@opensuse.org>
- Loading branch information
commit de86db90a8dfb709efa1df34ebef0f60c018fb84
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.