[megatron] fix: VLMs using fused kernels#3849
[megatron] fix: VLMs using fused kernels#3849wuxibin89 merged 1 commit intoverl-project:mainfrom HollowMan6:mega_fused_forward
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the fused forward pass for Qwen VL models by patching _postprocess instead of the entire forward method. This is a solid improvement for maintainability and resolves an issue with unexpected keyword arguments. The modification to pass position_ids=None in fused_forward_qwen2_5_vl is also appropriate. I have identified one critical issue where the retrieval of output weights is incorrect when share_embeddings_and_output_weights is enabled, which would lead to a crash. A code suggestion to rectify this is provided.
|
Update: I fixed the previous issue, we still need cc: @ISEEKYAN Although it doesn't throw any errors when I do training with this code, I would appreciate some review/feedback here. |
There was a problem hiding this comment.
Code Review
This pull request addresses a TypeError caused by an unexpected keyword argument in the _fused_GPTModel_forward function and a shape mismatch error when calculating mrope in fused_forward_qwen2_5_vl. The changes involve modifying the function signature of _fused_GPTModel_forward and adjusting the arguments passed to the model in fused_forward_qwen2_5_vl.
|
@HollowMan6 Good job! I really appreciate your work on megatron supports. But once this PR is merged, we will lose compatibility to I recommend:
BTW, this changes is also relative to #3522 |
|
Thank you for your valuable feedback, @ISEEKYAN! I added an assertion to check if mcore is greater/equal to 0.13.0 when I briefly checked #3522 and left a comment there, which also applies to one of the problems this PR is targeting: #3522 (comment). Currently, I'm more focused on dense models, but I'll definitely try that out when I need to train MoE models and have enough nodes. |
There was a problem hiding this comment.
Code Review
This pull request refactors the model forward pass to fix an issue with Qwen3VL and fused forward, and to align with mcore >= 0.13.0. The introduction of model_forward_gen and fused_forward_model_gen unifies the forward pass logic for different models, which is a good improvement for maintainability. However, the review identified two major concerns. A critical issue is the removal of Mixture-of-Token Parallelism (MTP) support in the fused forward pass, which could break models relying on this feature. Another high-severity issue is the removal of the non-packed sequence forward path, which reduces functionality and could be a breaking change.
|
I found another issue in the original upstream codebase: the Given the current situation of functions under I've tested locally for Qwen3VL, either w/ or w/o the fused kernel enabled, with the patched mbridge mentioned above, and they work correctly. I can see CI has also been passed, the current failed ones are related to network and not related to this PR, so I think we are good to go! |
|
Now all the CI has passed! |
Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. I found another issue in the original upstream codebase: the temperature parameter doesn't get correctly passed to `_fused_GPTModel_forward`. To make this work normally, I added **kwargs to allow models to accept additional arbitrary kwargs and pass these **kwargs all to self.language_model, for both the verl side (Qwen2_5VLModel) and all the vision models under mbridge Given the current situation of functions under `verl/models/mcore/model_forward.py` and `verl/models/mcore/model_forward_fused.py` (code duplications, several unused and untested condition branches, as well as their allowance for arbitrary kwargs but not getting them passed to anywhere), it is very hard to debug and maintain. So I also refactored the functions and cleaned up the code under those 2 files to use closure for unifying vision models and normal "GPT" (language) models, for better maintainability. Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
When I use the main branch code of mbridge and verl to train Qwen3-VL-30B-A3B-Instruct with CP=2, I get the following error: |
|
@xichengpro I didn't test the video part, and it looks like the video data support of Qwen3VL hasn't been added to mbridge. Feel free to contribute to mbridge for this! |
|
|
@xichengpro Qwen3VL model context parallel support hasn't been merged to main branch of mbridge, you will need to use this PR's fork to get it working: ISEEKYAN/mbridge#36 Note that if you directly install from that PR, remember to manually patch the code there with ISEEKYAN/mbridge#37 to get fused kernel properly working. Anyway, any other issues can be followed under that PR. |
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that #3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
@asirgogogo Are you sure you are using the latest main branch version of mbridge? Please make sure you have included ISEEKYAN/mbridge@74093e3 in your local installed mbridge. This PR should have already fixed your problem: ISEEKYAN/mbridge#37 |
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/14af8fe14b5c115e58b5967bb4f313ff0ab994a4/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/8c13e08d8a6c250c1b63da73fbdd120d27ab6820/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project/verl#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/62e68885e570a0c9297f9cf414645455d2f46c83/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/7cb00e5933e686411c829205e8dfc8e7adc99585/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>
…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

What does this PR do?
Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the
GPTModelforward as well for Qwen3VL to support deepstack:https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84
Since mcore v0.13.0 introduced
_postprocessand_preprocess, and our patch focuses on_postprocess, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch_postprocessas we will need to passtemperatureargument as well:In addition, there will be shape mismatch error when calculating
mrope, if we passposition_idsinfused_forward_qwen2_5_vl, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)