[model] fix: Handle Qwen VL MTP with context parallelism#3895
Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/ok to test e4dce6f |
| expected_position_ids = torch.cat([position_ids[..., 4:8], position_ids[..., 8:12]], dim=-1) | ||
| assert output == "ok" | ||
| assert torch.equal(dummy.postprocess_args["input_ids"], expected_input_ids) | ||
| assert torch.equal(dummy.postprocess_args["position_ids"], expected_position_ids) |
There was a problem hiding this comment.
Nit: consider adding a small test for the position_ids=None edge case with CP > 1 and MTP enabled. The production code guards this explicitly (line 193: if postprocess_position_ids is not None), but there's no test exercising that branch — a regression there would pass silently.
Something like:
@pytest.mark.unit
def test_mtp_postprocess_with_none_position_ids():
dummy = _DummyModel(mtp_process=True, cp_size=2, cp_rank=0)
input_ids = torch.arange(16, dtype=torch.long).view(1, 16)
attention_mask = torch.ones((1, 16), dtype=torch.long)
output = Qwen3VLGPTModel.forward(
dummy,
input_ids=input_ids,
position_ids=None,
attention_mask=attention_mask,
)
assert output == "ok"
assert dummy.postprocess_args["position_ids"] is None|
Light Review Clean fix. The CP-localization of input_ids and position_ids before MTP postprocess mirrors the zigzag pattern already used in model.py for combined embeddings (line 504-505), and the guard conditions (mtp_process and cp_size > 1 and packed_seq_params is None) are consistent. The try/finally for shadow embedding cleanup is a good hardening improvement. One minor gap: the position_ids is None branch (text_model.py:193) is untested. See inline comment for a suggested test. Suggested test cases:
|
Summary
input_idsandposition_idsbefore MCore postprocess when context parallelism is active.try/finally.Fixes #3881.
Validation
11920458, log/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/chcui/logs/issue3881_mtp_cp_qwen_vl/repro_pre_fix_20260519_161753.log: ranks 0-3 failed inMultiTokenPredictionLayer._concat_embeddingswithExpected size 128 but got size 64.11920502, log/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_llm/users/chcui/logs/issue3881_mtp_cp_qwen_vl/repro_post_fix_20260519_162314.log: completed0:0;iteration 1/1,lm loss: 1.354103E+01,mtp_1 loss: 6.454011E+00,grad norm: 7.894,number of nan iterations: 0.git diff --checkpython3 -m py_compile src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_text_model_forward.pyuv tool run ruff check tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_text_model_forward.py src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.pyuv tool run ruff format --check tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_text_model_forward.py src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.pyNotes
uv run pre-commit run --all-filesanduv run --group test python -m pytest tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_text_model_forward.py -qcould not start on the local host because the lockednvidia-resiliency-ext==0.6.0wheel requiresmanylinux_2_39, while this host resolves asmanylinux_2_31_x86_64.uveither.