[megatron] fix: VLMs using fused kernels by HollowMan6 · Pull Request #3849 · verl-project/verl

HollowMan6 · 2025-10-21T11:46:15Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the GPTModel forward as well for Qwen3VL to support deepstack:
https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced _postprocess and _preprocess, and our patch focuses on _postprocess, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch _postprocess as we will need to pass temperature argument as well:

output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'

In addition, there will be shape mismatch error when calculating mrope, if we pass position_ids in fused_forward_qwen2_5_vl, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request refactors the fused forward pass for Qwen VL models by patching _postprocess instead of the entire forward method. This is a solid improvement for maintainability and resolves an issue with unexpected keyword arguments. The modification to pass position_ids=None in fused_forward_qwen2_5_vl is also appropriate. I have identified one critical issue where the retrieval of output weights is incorrect when share_embeddings_and_output_weights is enabled, which would lead to a crash. A code suggestion to rectify this is provided.

verl/models/mcore/model_forward_fused.py

HollowMan6 · 2025-10-21T11:53:21Z

Update: I fixed the previous issue, we still need forward as we need to pass the temperature argument there. CI passed now, and the current CI failure is not related to this patch ~~now I modified it to use a more conservative way~~

cc: @ISEEKYAN

Although it doesn't throw any errors when I do training with this code, I would appreciate some review/feedback here.

gemini-code-assist

Code Review

This pull request addresses a TypeError caused by an unexpected keyword argument in the _fused_GPTModel_forward function and a shape mismatch error when calculating mrope in fused_forward_qwen2_5_vl. The changes involve modifying the function signature of _fused_GPTModel_forward and adjusting the arguments passed to the model in fused_forward_qwen2_5_vl.

verl/models/mcore/model_forward_fused.py

ISEEKYAN · 2025-10-22T12:27:40Z

@HollowMan6 Good job! I really appreciate your work on megatron supports.

But once this PR is merged, we will lose compatibility to megatron <= 0.12, since the _preprocess and _post_process is introduced since mcore0.13. And also the manner of patching fused kernel is better if we patch the _postprocess function instead of 'forward' function.

I recommend:

add some warning if the user is using megatron<=0.12 in this PR
leave a TODO to inform the we should patch "_postprocess" function.

BTW, this changes is also relative to #3522

HollowMan6 · 2025-10-22T13:36:53Z

Thank you for your valuable feedback, @ISEEKYAN! I added an assertion to check if mcore is greater/equal to 0.13.0 when patch_fused_forward. I've also updated the related documents/Dockerfile/CI to make sure they all use mcore 0.13.0 and above. For patching _postprocess, since currently we still need to patch forward because we need to pass temperature explicitly to self._postprocess when calling, maybe there can be a better way to handle this, but I can't come up with an appropriate one, so I added the TODO in the comment as instructed.

I briefly checked #3522 and left a comment there, which also applies to one of the problems this PR is targeting: #3522 (comment). Currently, I'm more focused on dense models, but I'll definitely try that out when I need to train MoE models and have enough nodes.

gemini-code-assist

Code Review

This pull request refactors the model forward pass to fix an issue with Qwen3VL and fused forward, and to align with mcore >= 0.13.0. The introduction of model_forward_gen and fused_forward_model_gen unifies the forward pass logic for different models, which is a good improvement for maintainability. However, the review identified two major concerns. A critical issue is the removal of Mixture-of-Token Parallelism (MTP) support in the fused forward pass, which could break models relying on this feature. Another high-severity issue is the removal of the non-packed sequence forward path, which reduces functionality and could be a breaking change.

verl/models/mcore/model_forward_fused.py

verl/models/mcore/model_forward.py

HollowMan6 · 2025-10-22T19:15:07Z

I found another issue in the original upstream codebase: the temperature parameter doesn't get correctly passed to _fused_GPTModel_forward. To make this work normally, I added **kwargs to allow models to accept additional arbitrary kwargs and pass these **kwargs all to self.language_model, for both the verl side (Qwen2_5VLModel) and all the vision models under mbridge: ISEEKYAN/mbridge#37

Given the current situation of functions under verl/models/mcore/model_forward.py and verl/models/mcore/model_forward_fused.py (code duplications, several unused and untested condition branches, as well as their allowance for arbitrary kwargs but not getting them passed to anywhere), it is very hard to debug and maintain. So I also refactored the functions and cleaned up the code under those 2 files to use closure for unifying vision models and normal "GPT" (language) models, for better maintainability.

I've tested locally for Qwen3VL, either w/ or w/o the fused kernel enabled, with the patched mbridge mentioned above, and they work correctly.

I can see CI has also been passed, the current failed ones are related to network and not related to this PR, so I think we are good to go!

cc: @ISEEKYAN @wuxibin89 @vermouth1992

HollowMan6 · 2025-10-23T15:23:10Z

Now all the CI has passed!

Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. I found another issue in the original upstream codebase: the temperature parameter doesn't get correctly passed to `_fused_GPTModel_forward`. To make this work normally, I added **kwargs to allow models to accept additional arbitrary kwargs and pass these **kwargs all to self.language_model, for both the verl side (Qwen2_5VLModel) and all the vision models under mbridge Given the current situation of functions under `verl/models/mcore/model_forward.py` and `verl/models/mcore/model_forward_fused.py` (code duplications, several unused and untested condition branches, as well as their allowance for arbitrary kwargs but not getting them passed to anywhere), it is very hard to debug and maintain. So I also refactored the functions and cleaned up the code under those 2 files to use closure for unifying vision models and normal "GPT" (language) models, for better maintainability. Signed-off-by: Hollow Man <hollowman@opensuse.org>

xichengpro · 2025-10-24T10:00:01Z

When I use the main branch code of mbridge and verl to train Qwen3-VL-30B-A3B-Instruct with CP=2, I get the following error:

(TaskRunner pid=30894)   File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/model.py", line 279, in forward
(TaskRunner pid=30894)     assert video_embeds is None, "not support video now"
(TaskRunner pid=30894)            ^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=30894) AssertionError: not support video now

@HollowMan6 @ISEEKYAN @wuxibin89

HollowMan6 · 2025-10-24T10:04:33Z

@xichengpro I didn't test the video part, and it looks like the video data support of Qwen3VL hasn't been added to mbridge. Feel free to contribute to mbridge for this!

xichengpro · 2025-10-24T10:22:25Z

@xichengpro I didn't test the video part, and it looks like the video data support of Qwen3VL hasn't been added to mbridge. Feel free to contribute to mbridge for this!
@HollowMan6
Sorry, I missed one point: my data is geo3k, which is an image dataset.
here's my script:

set -xeuo pipefail
cd /data/dbn-ceph/verl-251024/verl
project_name='DAPO'
exp_name='qwen3vl_1001_mg_1023_cp2'
export CUDA_DEVICE_MAX_CONNECTIONS=1
export VLLM_ALLREDUCE_USE_SYMM_MEM=0
export TENSORBOARD_DIR=/data/dbn-ceph/exp/qwen3vl/tensorboard/qwen3vl_1001_mg

# Paths
MODEL_PATH=/data/dbn-ceph/models/huggingface/Qwen/Qwen3-VL-30B-A3B-Instruct
CKPTS_DIR=/data/dbn-ceph/exp/qwen3vl/ckpt/${exp_name}
train_path=/data/dbn-ceph/datasets/data/geo3k/train.parquet
test_path=/data/dbn-ceph/datasets/data/geo3k/test.parquet
# ray job submit \
#     --runtime-env=verl/trainer/runtime_env.yaml \
#     --no-wait \
#     -- \
    python3 -m verl.trainer.main_ppo --config-path=config \
    --config-name='ppo_megatron_trainer.yaml'\
    algorithm.adv_estimator=grpo \
    data.train_files="$train_path" \
    data.val_files="$test_path" \
    data.train_batch_size=32 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.filter_overlong_prompts_workers=128 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.model.use_fused_kernels=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
    actor_rollout_ref.actor.megatron.context_parallel_size=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=5120 \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=20480 \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=20480 \
    actor_rollout_ref.rollout.name=vllm \
    +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.megatron.use_mbridge=True \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.actor.megatron.optimizer_offload=True \
    actor_rollout_ref.actor.megatron.grad_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1 \
    +actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=False \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type=alltoall \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
    +actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.val_before_train=False \
    trainer.logger='["console","tensorboard"]' \
    trainer.project_name='verl_grpo_example_geo3k' \
    trainer.experiment_name='qwen3_vl_30b_megatron' \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=4 \
    trainer.total_epochs=15 \
    2>&1 | tee /data/dbn-ceph/exp/qwen3vl/log/qwen3_vl_30b_megatron_$(date +'%Y%m%d_%H%M%S').log

HollowMan6 · 2025-10-24T11:53:18Z

@xichengpro Qwen3VL model context parallel support hasn't been merged to main branch of mbridge, you will need to use this PR's fork to get it working: ISEEKYAN/mbridge#36 Note that if you directly install from that PR, remember to manually patch the code there with ISEEKYAN/mbridge#37 to get fused kernel properly working.

Anyway, any other issues can be followed under that PR.

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that #3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

asirgogogo · 2025-10-27T14:46:32Z

When I use mbridge and your modified code to train Qwen3-VL-30B-A3B-Instruct, I get the following error:

HollowMan6 · 2025-10-27T14:51:26Z

@asirgogogo Are you sure you are using the latest main branch version of mbridge? Please make sure you have included ISEEKYAN/mbridge@74093e3 in your local installed mbridge. This PR should have already fixed your problem: ISEEKYAN/mbridge#37

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/14af8fe14b5c115e58b5967bb4f313ff0ab994a4/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/8c13e08d8a6c250c1b63da73fbdd120d27ab6820/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project/verl#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/62e68885e570a0c9297f9cf414645455d2f46c83/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the `GPTModel` forward as well for Qwen3VL to support deepstack: https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84 Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our patch focuses on `_postprocess`, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch `_postprocess` as we will need to pass `temperature` argument as well: ```logs output = self.forward_backward_batch( /verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch losses_reduced = forward_backward_func( miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ....... verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl output_orig: CausalLMOutputForPPO = model( ...... mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward output = self.language_model( envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) /envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks' ``` In addition, there will be shape mismatch error when calculating `mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/7cb00e5933e686411c829205e8dfc8e7adc99585/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…project#3901) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. - Fix typos in the verl image README: `nvidia-cudnn-cu12` - Upgrade to use mcore >= 0.13.0 since the fused forward patch now requires that verl-project#3849 (comment) - Upgrade vllm to the latest, sglang to the version indicated in `setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10 and flashinfer 0.3.1. - Use flash-attn 2.8.1 since that's the oldest version that has Torch 2.8 precompiled support, also it's the latest flash-attn version supported by TransformerEngine (starting from 2.6 till latest 2.8) - Fix the documentation since apex should not be needed in FSDP backend. - Add note mentioning that pyext is lack of maintainace and cannot work with python 3.12, and they can install from a separate fork when. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. I installed the environments in this way and it works fine. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 requested review from ISEEKYAN and vermouth1992 as code owners October 21, 2025 11:46

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

verl/models/mcore/model_forward_fused.py Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 21, 2025

View reviewed changes

verl/models/mcore/model_forward_fused.py Show resolved Hide resolved

verl/models/mcore/model_forward_fused.py Show resolved Hide resolved

HollowMan6 requested review from eric-haibin-lin and zhaochenyang20 as code owners October 22, 2025 13:18

HollowMan6 mentioned this pull request Oct 22, 2025

[Megatron] feat: 1f1b overlap/moe_a2a_overlap #3522

Merged

4 tasks

HollowMan6 requested a review from ZihengJiang as a code owner October 22, 2025 17:34

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

verl/models/mcore/model_forward_fused.py Show resolved Hide resolved

verl/models/mcore/model_forward.py Show resolved Hide resolved

HollowMan6 requested a review from wuxibin89 October 22, 2025 19:15

HollowMan6 changed the title ~~[megatron] fix: Qwen3VL with fused forward~~ [megatron] fix: VLMs with fused forward Oct 22, 2025

HollowMan6 changed the title ~~[megatron] fix: VLMs with fused forward~~ [megatron] fix: VLMs using fused kernels Oct 23, 2025

wuxibin89 approved these changes Oct 24, 2025

View reviewed changes

wuxibin89 merged commit ab676dc into verl-project:main Oct 24, 2025
75 checks passed

HollowMan6 deleted the mega_fused_forward branch October 24, 2025 05:21

HollowMan6 mentioned this pull request Oct 24, 2025

Qwen3-VL-Dense RL Training in verl Using Megatron #3875

Closed

4 tasks

This was referenced Oct 24, 2025

chore: add kwargs for _pre/_postprocess of GPTModel NVIDIA/Megatron-LM#1923

Open

[doc] chore: update installation scripts to use newer versions #3901

Merged

Conversation

HollowMan6 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

HollowMan6 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ISEEKYAN commented Oct 22, 2025

Uh oh!

HollowMan6 commented Oct 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

HollowMan6 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HollowMan6 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xichengpro commented Oct 24, 2025

Uh oh!

HollowMan6 commented Oct 24, 2025

Uh oh!

xichengpro commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HollowMan6 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asirgogogo commented Oct 27, 2025

Uh oh!

HollowMan6 commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HollowMan6 commented Oct 21, 2025 •

edited

Loading

HollowMan6 commented Oct 21, 2025 •

edited

Loading

HollowMan6 commented Oct 22, 2025 •

edited

Loading

HollowMan6 commented Oct 23, 2025 •

edited

Loading

xichengpro commented Oct 24, 2025 •

edited

Loading

HollowMan6 commented Oct 24, 2025 •

edited

Loading