Chcui/dsv4 train pr3562 pr4518 by weijiac0619 · Pull Request #3893 · NVIDIA-NeMo/Megatron-Bridge

weijiac0619 · 2026-05-19T22:11:14Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: weijiac <weijiac@NVIDIA.com>

- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458) - Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518) - Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch) - Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping - Handle transformers 5.x nested rope_scaling format - Handle compress_ratios/compress_rates naming + length trim - Explicit errors for missing config fields instead of silent fallbacks - AutoConfig.register: re-raise non-"already registered" errors

- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py) - Add new validation scripts: - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim - dsv4_fresh_generate.py: import + greedy generation with answer verification - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py - Update copyright year to 2026

float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code applied an extra 2^(x-127), producing near-zero scales that zeroed all expert weights. Also fix tiny model test init and seq_len. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: weijiac <weijiac@nvidia.com>

5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj. This guard prevents crash during fresh import. Revert when MTP mappings are fully implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Active test scripts for DSv4 validation: - dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report - dsv4_cosine_analysis.py: capture hidden states from both official and MCore - dsv4_last_hidden_cmp.py: post-contraction + logit comparison - dsv4_fresh_import_save.py: fresh HF import with checkpoint save - dsv4-bridge-handoff.md: handoff doc with results and instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Map the three previously-unmapped per-MTP-layer Hyper-Connection head parameters (hc_head_fn / hc_head_base / hc_head_scale) using ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet. Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with enable_hyper_connections=True instantiates these as separate ColumnParallelLinear modules; AutoMapping auto-detects column parallelism and shards along dim 0. The dead _MTPEHProjMapping class is removed. Also clarify the bridge module docstring: parity is verified only for DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture and quant dispatch but logit parity is unmeasured. Document the two quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales). Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly with all five previously-unmapped MTP params now resolving. Signed-off-by: chcui <chcui@nvidia.com> Signed-off-by: root <root@nvl72160-T13.cm.cluster>

The two `if task is None or task.megatron_module is None:` guards in build_conversion_tasks consumers were a temporary stopgap absorbing the five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj). With those mappings now in place (preceding commit), no None tasks reach the load/export loops, so revert to the standard `if task.megatron_module is None:` check. Missing mappings now fail loudly again with AttributeError, restoring the safety property the guard had bypassed. Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with no None-task interceptions. Signed-off-by: chcui <chcui@nvidia.com>

Add unit tests that assert the DSv4 bridge's mapping_registry contains: - decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing) - mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new) - mtp.layers.N.{e,h}_proj.weight as AutoMappings (new) - no reference to the deprecated concatenated eh_proj path anywhere Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings are absent (regression guard for environments without MTP). mapping_registry only reads num_nextn_predict_layers from hf_config, so mocking with SimpleNamespace is sufficient — no fixtures or GPU needed. Signed-off-by: chcui <chcui@nvidia.com>

- docs/models/llm/deepseek-v4.md: variants table with parity status, architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP), conversion / inference / parallelism guidance, MCore prerequisite list - docs/models/llm/index.md: register the new page in the toctree - examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model HF<->Megatron round-trip and generation, parameterized via WORKSPACE / MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200 - src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants - src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe Slurm launch scripts intentionally deferred until the recipe has a finalized training-config name and parallelism layout for Pro / Pro-Base. Signed-off-by: chcui <chcui@nvidia.com>

DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before attention, one before MLP) of shape hidden -> hidden * num_residual_streams, plus negligible alpha scalars and sinkhorn iterations. Without modeling this overhead, training-throughput TFLOPs/GPU readouts under-report work and overstate hardware efficiency. Add a conditional term in transformer_flops() gated on enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 * num_residual_streams batch * seq_length FLOPs. The CSA per-layer attention reduction (sparse top-k context instead of dense O(s^2)) is intentionally not modeled here — that would be a smaller correction in the opposite direction, and leaving it out keeps the throughput estimate on the safe (over-estimating) side. Tests: - test_hc_flops_increase_when_enabled: HC-on > HC-off - test_hc_exact_overhead: matches the closed-form formula - test_hc_scales_with_residual_streams: doubling streams doubles delta Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron roundtrip. It is auto-skipped today because (a) transformers does not yet ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land. Signed-off-by: chcui <chcui@nvidia.com>

- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: - deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash (the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP disabled for fine-tuning, fp32 master weights, max_lr=5e-6 - deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1 in the recipe (override via slurm to EP>=4 for Flash; the frozen base model still has to fit across ranks even though only adapters train), max_lr=1e-4 - shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings, bf16, and disables MTP for fine-tune codepaths - src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes - examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers modelled on examples/models/gpt_oss/, parameterized via env vars (WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS) and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit error since DSv4 only supports TP=1. - examples/models/deepseek_v4/README.md: updated file table to reflect the finalized recipe names and default layouts Verified that both recipes instantiate cleanly against the local Flash HF config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with LoRA scheme; MTP is None and PEFT-specific assertions hold. Signed-off-by: chcui <chcui@nvidia.com>

…irect function reference Signed-off-by: weijiac <weijiac@nvidia.com>

Add HSG launchers for DSv4 toy/full training, native Transformers DSv4 config handling, and runtime dependency setup for the PR3562/MCore PR4518 training branch.

copy-pr-bot · 2026-05-19T22:11:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

weijiac0619 and others added 30 commits April 28, 2026 16:31

dsv4 import

803adff

Signed-off-by: weijiac <weijiac@NVIDIA.com>

Merge branch 'main' into weijia_dsv4

1abfaa6

test forward

7b48341

Signed-off-by: weijiac <weijiac@NVIDIA.com>

Guard GlmMoeDsa import for containers with transformers <5.0

3b86c12

Tiny model test: restore all layer types, fix zero embedding init

faedf46

Enable fused RoPE for DSv4 (non-fused path broken for inverse RoPE)

d84f58f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clean

92adb56

fix mscale

3fe2933

Signed-off-by: weijiac <weijiac@nvidia.com>

wip

34a7bd3

Signed-off-by: weijiac <weijiac@nvidia.com>

Update handoff doc: all changes committed and pushed

65a187d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove superseded/broken test scripts

43a2b5f

- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[dsv4] defer training recipes and launchers

7d7eec2

[docs] simplify DeepSeek V4 examples and parity status

1395c71

docs: move DeepSeek V4 usage details to examples readme

d3eacd2

cleanup DeepSeek V4 PR artifacts

75a1603

Replaces a lambda (can't be serialized into run_config.yaml) with a d…

f1142df

…irect function reference Signed-off-by: weijiac <weijiac@nvidia.com>

Add DeepSeek V4 training smoke harness

c4973bc

Add HSG launchers for DSv4 toy/full training, native Transformers DSv4 config handling, and runtime dependency setup for the PR3562/MCore PR4518 training branch.

Add DSv4 training benchmark handoff

88889d0

Remove vendored DSv4 training dependencies

d6263cb

cuichenx added 2 commits May 14, 2026 09:50

Remove DSv4 CPU-init export experiment

125e1af

Fix distributed HF export for large sharded saves

eb78232

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chcui/dsv4 train pr3562 pr4518#3893

Chcui/dsv4 train pr3562 pr4518#3893
weijiac0619 wants to merge 32 commits into
mainfrom
chcui/dsv4-train-pr3562-pr4518

weijiac0619 commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weijiac0619 commented May 19, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants