Skip to content

[megatron] fix: VLMs using fused kernels#3849

Merged
wuxibin89 merged 1 commit intoverl-project:mainfrom
HollowMan6:mega_fused_forward
Oct 24, 2025
Merged

[megatron] fix: VLMs using fused kernels#3849
wuxibin89 merged 1 commit intoverl-project:mainfrom
HollowMan6:mega_fused_forward

Conversation

@HollowMan6
Copy link
Collaborator

@HollowMan6 HollowMan6 commented Oct 21, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Currently, we will have error regarding to unexpected keyword argument 'visual_pos_masks', this is because mbridge did some customization of the GPTModel forward as well for Qwen3VL to support deepstack:
https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced _postprocess and _preprocess, and our patch focuses on _postprocess, I also cleaned up the function for better maintainability and to fix this extra deepstack argument issue. We can't simply patch _postprocess as we will need to pass temperature argument as well:

output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'

In addition, there will be shape mismatch error when calculating mrope, if we pass position_ids in fused_forward_qwen2_5_vl, I tried to debug but the shape passed here doesn't make sense, and since according to https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117 it says model will calculate position_ids, I just follow the code there to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL without throwing further errors.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the fused forward pass for Qwen VL models by patching _postprocess instead of the entire forward method. This is a solid improvement for maintainability and resolves an issue with unexpected keyword arguments. The modification to pass position_ids=None in fused_forward_qwen2_5_vl is also appropriate. I have identified one critical issue where the retrieval of output weights is incorrect when share_embeddings_and_output_weights is enabled, which would lead to a crash. A code suggestion to rectify this is provided.

@HollowMan6
Copy link
Collaborator Author

HollowMan6 commented Oct 21, 2025

Update: I fixed the previous issue, we still need forward as we need to pass the temperature argument there. CI passed now, and the current CI failure is not related to this patch now I modified it to use a more conservative way

cc: @ISEEKYAN

Although it doesn't throw any errors when I do training with this code, I would appreciate some review/feedback here.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a TypeError caused by an unexpected keyword argument in the _fused_GPTModel_forward function and a shape mismatch error when calculating mrope in fused_forward_qwen2_5_vl. The changes involve modifying the function signature of _fused_GPTModel_forward and adjusting the arguments passed to the model in fused_forward_qwen2_5_vl.

@ISEEKYAN
Copy link
Collaborator

@HollowMan6 Good job! I really appreciate your work on megatron supports.

But once this PR is merged, we will lose compatibility to megatron <= 0.12, since the _preprocess and _post_process is introduced since mcore0.13. And also the manner of patching fused kernel is better if we patch the _postprocess function instead of 'forward' function.

I recommend:

  1. add some warning if the user is using megatron<=0.12 in this PR
  2. leave a TODO to inform the we should patch "_postprocess" function.

BTW, this changes is also relative to #3522

@HollowMan6
Copy link
Collaborator Author

Thank you for your valuable feedback, @ISEEKYAN! I added an assertion to check if mcore is greater/equal to 0.13.0 when patch_fused_forward. I've also updated the related documents/Dockerfile/CI to make sure they all use mcore 0.13.0 and above. For patching _postprocess, since currently we still need to patch forward because we need to pass temperature explicitly to self._postprocess when calling, maybe there can be a better way to handle this, but I can't come up with an appropriate one, so I added the TODO in the comment as instructed.

I briefly checked #3522 and left a comment there, which also applies to one of the problems this PR is targeting: #3522 (comment). Currently, I'm more focused on dense models, but I'll definitely try that out when I need to train MoE models and have enough nodes.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the model forward pass to fix an issue with Qwen3VL and fused forward, and to align with mcore >= 0.13.0. The introduction of model_forward_gen and fused_forward_model_gen unifies the forward pass logic for different models, which is a good improvement for maintainability. However, the review identified two major concerns. A critical issue is the removal of Mixture-of-Token Parallelism (MTP) support in the fused forward pass, which could break models relying on this feature. Another high-severity issue is the removal of the non-packed sequence forward path, which reduces functionality and could be a breaking change.

@HollowMan6
Copy link
Collaborator Author

HollowMan6 commented Oct 22, 2025

I found another issue in the original upstream codebase: the temperature parameter doesn't get correctly passed to _fused_GPTModel_forward. To make this work normally, I added **kwargs to allow models to accept additional arbitrary kwargs and pass these **kwargs all to self.language_model, for both the verl side (Qwen2_5VLModel) and all the vision models under mbridge: ISEEKYAN/mbridge#37

Given the current situation of functions under verl/models/mcore/model_forward.py and verl/models/mcore/model_forward_fused.py (code duplications, several unused and untested condition branches, as well as their allowance for arbitrary kwargs but not getting them passed to anywhere), it is very hard to debug and maintain. So I also refactored the functions and cleaned up the code under those 2 files to use closure for unifying vision models and normal "GPT" (language) models, for better maintainability.

I've tested locally for Qwen3VL, either w/ or w/o the fused kernel enabled, with the patched mbridge mentioned above, and they work correctly.

I can see CI has also been passed, the current failed ones are related to network and not related to this PR, so I think we are good to go!

cc: @ISEEKYAN @wuxibin89 @vermouth1992

@HollowMan6 HollowMan6 requested a review from wuxibin89 October 22, 2025 19:15
@HollowMan6 HollowMan6 changed the title [megatron] fix: Qwen3VL with fused forward [megatron] fix: VLMs with fused forward Oct 22, 2025
@HollowMan6
Copy link
Collaborator Author

HollowMan6 commented Oct 23, 2025

Now all the CI has passed!

@HollowMan6 HollowMan6 changed the title [megatron] fix: VLMs with fused forward [megatron] fix: VLMs using fused kernels Oct 23, 2025
Currently, we will have error regarding to unexpected
keyword argument 'visual_pos_masks', this is because mbridge
did some customization of the `GPTModel` forward as well for
Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`,
and our patch focuses on `_postprocess`, I also cleaned up the function
for better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to
pass `temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating `mrope`,
 if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I tried to debug
but the shape passed here doesn't make sense, and since according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and Qwen3VL
without throwing further errors.

I found another issue in the original upstream codebase: the temperature
parameter doesn't get correctly passed to `_fused_GPTModel_forward`. To make
this work normally, I added **kwargs to allow models to accept additional
arbitrary kwargs and pass these **kwargs all to self.language_model, for both
the verl side (Qwen2_5VLModel) and all the vision models under mbridge

Given the current situation of functions under `verl/models/mcore/model_forward.py`
and `verl/models/mcore/model_forward_fused.py` (code duplications, several unused
and untested condition branches, as well as their allowance for arbitrary kwargs
but not getting them passed to anywhere), it is very hard to debug and maintain.
So I also refactored the functions and cleaned up the code under those 2 files
to use closure for unifying vision models and normal "GPT" (language) models,
for better maintainability.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@wuxibin89 wuxibin89 merged commit ab676dc into verl-project:main Oct 24, 2025
75 checks passed
@HollowMan6 HollowMan6 deleted the mega_fused_forward branch October 24, 2025 05:21
@xichengpro
Copy link
Contributor

When I use the main branch code of mbridge and verl to train Qwen3-VL-30B-A3B-Instruct with CP=2, I get the following error:

(TaskRunner pid=30894)   File "/usr/local/lib/python3.12/dist-packages/mbridge/models/qwen3_vl/model.py", line 279, in forward
(TaskRunner pid=30894)     assert video_embeds is None, "not support video now"
(TaskRunner pid=30894)            ^^^^^^^^^^^^^^^^^^^^
(TaskRunner pid=30894) AssertionError: not support video now

@HollowMan6 @ISEEKYAN @wuxibin89

@HollowMan6
Copy link
Collaborator Author

@xichengpro I didn't test the video part, and it looks like the video data support of Qwen3VL hasn't been added to mbridge. Feel free to contribute to mbridge for this!

@xichengpro
Copy link
Contributor

xichengpro commented Oct 24, 2025

@xichengpro I didn't test the video part, and it looks like the video data support of Qwen3VL hasn't been added to mbridge. Feel free to contribute to mbridge for this!
@HollowMan6
Sorry, I missed one point: my data is geo3k, which is an image dataset.
here's my script:

set -xeuo pipefail
cd /data/dbn-ceph/verl-251024/verl
project_name='DAPO'
exp_name='qwen3vl_1001_mg_1023_cp2'
export CUDA_DEVICE_MAX_CONNECTIONS=1
export VLLM_ALLREDUCE_USE_SYMM_MEM=0
export TENSORBOARD_DIR=/data/dbn-ceph/exp/qwen3vl/tensorboard/qwen3vl_1001_mg

# Paths
MODEL_PATH=/data/dbn-ceph/models/huggingface/Qwen/Qwen3-VL-30B-A3B-Instruct
CKPTS_DIR=/data/dbn-ceph/exp/qwen3vl/ckpt/${exp_name}
train_path=/data/dbn-ceph/datasets/data/geo3k/train.parquet
test_path=/data/dbn-ceph/datasets/data/geo3k/test.parquet
# ray job submit \
#     --runtime-env=verl/trainer/runtime_env.yaml \
#     --no-wait \
#     -- \
    python3 -m verl.trainer.main_ppo --config-path=config \
    --config-name='ppo_megatron_trainer.yaml'\
    algorithm.adv_estimator=grpo \
    data.train_files="$train_path" \
    data.val_files="$test_path" \
    data.train_batch_size=32 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.filter_overlong_prompts_workers=128 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.model.use_fused_kernels=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
    actor_rollout_ref.actor.megatron.context_parallel_size=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=5120 \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=20480 \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=20480 \
    actor_rollout_ref.rollout.name=vllm \
    +actor_rollout_ref.rollout.engine_kwargs.vllm.disable_mm_preprocessor_cache=True \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.megatron.use_mbridge=True \
    actor_rollout_ref.actor.megatron.param_offload=True \
    actor_rollout_ref.actor.megatron.optimizer_offload=True \
    actor_rollout_ref.actor.megatron.grad_offload=True \
    actor_rollout_ref.ref.megatron.param_offload=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1 \
    +actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True \
    +actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=False \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type=alltoall \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
    +actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
    +actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True \
    +actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.val_before_train=False \
    trainer.logger='["console","tensorboard"]' \
    trainer.project_name='verl_grpo_example_geo3k' \
    trainer.experiment_name='qwen3_vl_30b_megatron' \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=4 \
    trainer.total_epochs=15 \
    2>&1 | tee /data/dbn-ceph/exp/qwen3vl/log/qwen3_vl_30b_megatron_$(date +'%Y%m%d_%H%M%S').log

@HollowMan6
Copy link
Collaborator Author

HollowMan6 commented Oct 24, 2025

@xichengpro Qwen3VL model context parallel support hasn't been merged to main branch of mbridge, you will need to use this PR's fork to get it working: ISEEKYAN/mbridge#36 Note that if you directly install from that PR, remember to manually patch the code there with ISEEKYAN/mbridge#37 to get fused kernel properly working.

Anyway, any other issues can be followed under that PR.

wuxibin89 pushed a commit that referenced this pull request Oct 25, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@asirgogogo
Copy link

When I use mbridge and your modified code to train Qwen3-VL-30B-A3B-Instruct, I get the following error:

image

@HollowMan6
Copy link
Collaborator Author

@asirgogogo Are you sure you are using the latest main branch version of mbridge? Please make sure you have included ISEEKYAN/mbridge@74093e3 in your local installed mbridge. This PR should have already fixed your problem: ISEEKYAN/mbridge#37

wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/14af8fe14b5c115e58b5967bb4f313ff0ab994a4/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 3, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 3, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/8c13e08d8a6c250c1b63da73fbdd120d27ab6820/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
leisuzz pushed a commit to leisuzz/verl that referenced this pull request Nov 18, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
leisuzz pushed a commit to leisuzz/verl that referenced this pull request Nov 18, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
chenhaiq pushed a commit to The-Hierophant/verl-1 that referenced this pull request Nov 18, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 26, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project/verl#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
albertimff pushed a commit to albertimff/verl that referenced this pull request Dec 1, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/981d781db932ff53a0c584fd501dcd73ce2a8077/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
albertimff pushed a commit to albertimff/verl that referenced this pull request Dec 1, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/62e68885e570a0c9297f9cf414645455d2f46c83/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

Currently, we will have error regarding to unexpected keyword argument
'visual_pos_masks', this is because mbridge did some customization of
the `GPTModel` forward as well for Qwen3VL to support deepstack:

https://github.com/ISEEKYAN/mbridge/blob/ecbdfbdfdc8027004702149d6dc87fbad7417708/mbridge/models/qwen3_vl/gpt_model.py#L84

Since mcore v0.13.0 introduced `_postprocess` and `_preprocess`, and our
patch focuses on `_postprocess`, I also cleaned up the function for
better maintainability and to fix this extra deepstack argument issue.
We can't simply patch `_postprocess` as we will need to pass
`temperature` argument as well:

```logs
output = self.forward_backward_batch(
/verl_megatron/verl/workers/actor/megatron_actor.py", line 598, in forward_backward_batch
    losses_reduced = forward_backward_func(
miniconda/envs/qwenvl/lib/python3.10/site-packages/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
    output_tensor, num_tokens = forward_step(
.......
verl_megatron/verl/models/mcore/model_forward_fused.py", line 136, in fused_forward_qwen2_5_vl
    output_orig: CausalLMOutputForPPO = model(
......
mbridge-main/mbridge/models/qwen3_vl/model.py", line 323, in forward
    output = self.language_model(
envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
/envs/qwenvl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: _fused_GPTModel_forward() got an unexpected keyword argument 'visual_pos_masks'
```

In addition, there will be shape mismatch error when calculating
`mrope`, if we pass `position_ids` in `fused_forward_qwen2_5_vl`, I
tried to debug but the shape passed here doesn't make sense, and since
according to
https://github.com/volcengine/verl/blob/7cb00e5933e686411c829205e8dfc8e7adc99585/verl/models/mcore/model_forward.py#L117
it says model will calculate position_ids, I just follow the code there
to not pass the position ids, and it works both for Qwen2.5VL and
Qwen3VL without throwing further errors.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…project#3901)

### What does this PR do?

> Add **concise** overview of what this PR aims to achieve or
accomplish. Reference related GitHub issues and PRs that help with the
review.

- Fix typos in the verl image README: `nvidia-cudnn-cu12`
- Upgrade to use mcore >= 0.13.0 since the fused forward patch now
requires that
verl-project#3849 (comment)
- Upgrade vllm to the latest, sglang to the version indicated in
`setup.py`, which means now we will use Torch 2.8, CUDA 12.8, cuDNN 9.10
and flashinfer 0.3.1.
- Use flash-attn 2.8.1 since that's the oldest version that has Torch
2.8 precompiled support, also it's the latest flash-attn version
supported by TransformerEngine (starting from 2.6 till latest 2.8)
- Fix the documentation since apex should not be needed in FSDP backend.
- Add note mentioning that pyext is lack of maintainace and cannot work
with python 3.12, and they can install from a separate fork when.

### Checklist Before Starting

- [X] Search for similar PRs. Paste at least one query link here: ...
- [X] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.

I installed the environments in this way and it works fine.

### API and Usage Example

> Demonstrate how the API changes if any, and provide usage example(s)
if possible.

```python
# Add code snippet or script demonstrating how to use this
```

### Design & Code Changes

> Demonstrate the high-level design if this PR is complex, and list the
specific changes.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [X] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [X] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants