[data, model, training] fix: Stabilize Qwen3-VL packed SFT#3869
Conversation
bf8360c to
5e1fd46
Compare
5e1fd46 to
b6d035d
Compare
122b751 to
9823be7
Compare
|
/ok to test 9823be7 |
256bc87 to
a74299e
Compare
|
/ok to test a74299e |
|
/claude review |
Light Review — PR 3869Clean, well-motivated bugfix PR. Five distinct issues addressed with matching tests. A few observations: 1. Test gap — explicit moe_padding_mask propagation: The new test_forward_packed_thd_rope_uses_padded_cu_seqlens test covers the packed THD MRoPE path thoroughly, but does not pass an explicit moe_padding_mask. This means only the fallback path (lm_input_ids.eq(0) at model.py:672) is exercised in unit tests. The primary path — where the collate-derived moe_padding_mask is passed through preprocess_packed_seqs in two code branches (model.py lines 536-542 and 598-604) — lacks unit coverage. A companion test that supplies moe_padding_mask and asserts it arrives at language_model.last_kwargs[padding_mask] correctly after THD packing would close this gap. 2. Dist-train pack order fix is well-tested: The test_pack_dist_train_vision_module_output_matches_unpack_contract test clearly validates the deepstack-first, vision-last ordering contract. The existing test_pack_dist_train_vision_module_output assertion update is consistent. 3. Collate attention_mask preservation: The test_qwen2_5_collate_fn_preserves_attention_mask_for_mixed_image_text_batch test covers the mixed image/text merge path. The pad_to refactor to accept a pad_value argument is clean. 4. preproc_output[:6] index coupling (text_model.py:162): This assumes the base class _preprocess returns padding_mask at index 5. The test covers this, but this is a fragile coupling to the upstream Megatron-Core tuple layout — worth a defensive comment if the upstream contract is not stable. Suggested test cases: No perf tests impacted. |
Signed-off-by: jinliangl <jinliangl@nvidia.com>
a74299e to
0b9d864
Compare
|
/ok to test 0b9d864 |
|
/claude review |
Review SummarySolid bugfix PR with good test coverage across all five issues. The fixes are well-scoped and the validation methodology (THD MRoPE probe + zero-LR parity check) is thorough. Observations
Both are minor robustness notes, not blockers. Suggested test cases No perf tests impacted. |
What does this PR do ?
Fixes several Qwen3-VL packed SFT issues that can corrupt MRoPE position ids, route padding tokens through MoE experts, or swap dist-train vision/deepstack tensors. These issues are most visible with packed THD training, mixed image/text-only VLM batches, MoE, and dist-train vision-module splitting.
Bugfix Summary
lm_input_ids.eq(0)when no mask exists.attention_maskalongsideinput_ids. This lets packed SFT derivepadding_maskandmoe_padding_maskfrom real padding metadata instead of token id heuristics.vision_embedslast, matchingset_dist_train_input_tensors, which unpackschunks[:-1]as deepstack andchunks[-1]as final vision embeddings.deepstack_visual_embeds=Noneinsplit_deepstack_embsso pure-text microbatches do not crash.Validation Findings
THD MRoPE probe, micro-batch size 3
I added a tiny local probe that feeds already-packed THD
input_idswith shape[1, 18],micro_batch_size=3,seq_len=6, andcu_seqlens=[0, 6, 12, 18]. The probe stubs the language model and inspects the exactposition_idspassed fromQwen3VLModel.forwardinto the LM.With this PR:
Without this PR, at base
39b79eb78:So the fix makes packed THD MRoPE reset per packed sample using
cu_seqlens; without it, the flat THD row receives one continuous RoPE index across the whole microbatch.Zero-LR MoE BSHD/THD parity check
I also ran a fake-small Qwen3-VL MoE model with padded input,
micro_batch_size=2, 4 experts, top-2 routing, andlr=0, comparing BSHD against THD while holding weights fixed.The tiny gradient delta is float32 roundoff scale from different layout/reduction order, while forward output and loss are exactly aligned for the same padded tokens and MoE padding mask.
Tests
Local checks run:
uv run pre-commit run --all-filesuv --project /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge run python -m pytest /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge/tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py -k "packed_thd" -quv --project /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge run python -m pytest /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge/tests/unit_tests/data/vlm_datasets/test_collate.py -k "qwen2_5_collate_fn" -quv --project /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge run python -m pytest /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge/tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_text_model_forward.py -quv --project /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge run ruff check /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge/tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_model.py /lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_mcore/users/jinliangl/repos/Megatron-Bridge/src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/text_model.pyNote: pytest was run from
/tmpbecause this checkout already hasnemo_experiments/in the repo root and the global test fixture refuses to run when that directory exists.GitHub Actions CI
External branch from wplf fork. An NVIDIA developer may need to trigger CI with
/ok to test 0b9d864fif copy-pr-bot does not start automatically.Before your PR is "Ready for review"
Pre checks: