Skip to content

Chcui/dsv4 train pr3562 pr4518#3893

Draft
weijiac0619 wants to merge 32 commits into
mainfrom
chcui/dsv4-train-pr3562-pr4518
Draft

Chcui/dsv4 train pr3562 pr4518#3893
weijiac0619 wants to merge 32 commits into
mainfrom
chcui/dsv4-train-pr3562-pr4518

Conversation

@weijiac0619
Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

weijiac0619 and others added 30 commits April 28, 2026 16:31
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Signed-off-by: weijiac <weijiac@NVIDIA.com>
- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458)
- Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518)
- Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch)
- Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping
- Handle transformers 5.x nested rope_scaling format
- Handle compress_ratios/compress_rates naming + length trim
- Explicit errors for missing config fields instead of silent fallbacks
- AutoConfig.register: re-raise non-"already registered" errors
- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py)
- Add new validation scripts:
  - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim
  - dsv4_fresh_generate.py: import + greedy generation with answer verification
  - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py
- Update copyright year to 2026
float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code
applied an extra 2^(x-127), producing near-zero scales that zeroed
all expert weights. Also fix tiny model test init and seq_len.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after
MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj.
This guard prevents crash during fresh import. Revert when MTP mappings are
fully implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Active test scripts for DSv4 validation:
- dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report
- dsv4_cosine_analysis.py: capture hidden states from both official and MCore
- dsv4_last_hidden_cmp.py: post-contraction + logit comparison
- dsv4_fresh_import_save.py: fresh HF import with checkpoint save
- dsv4-bridge-handoff.md: handoff doc with results and instructions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py
- dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map the three previously-unmapped per-MTP-layer Hyper-Connection head
parameters (hc_head_fn / hc_head_base / hc_head_scale) using
ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet.

Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain
AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with
enable_hyper_connections=True instantiates these as separate
ColumnParallelLinear modules; AutoMapping auto-detects column parallelism
and shards along dim 0. The dead _MTPEHProjMapping class is removed.

Also clarify the bridge module docstring: parity is verified only for
DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture
and quant dispatch but logit parity is unmeasured. Document the two
quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with
F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales).

Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly
with all five previously-unmapped MTP params now resolving.

Signed-off-by: chcui <chcui@nvidia.com>
Signed-off-by: root <root@nvl72160-T13.cm.cluster>
The two `if task is None or task.megatron_module is None:` guards in
build_conversion_tasks consumers were a temporary stopgap absorbing the
five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj).

With those mappings now in place (preceding commit), no None tasks reach
the load/export loops, so revert to the standard
`if task.megatron_module is None:` check. Missing mappings now fail
loudly again with AttributeError, restoring the safety property the
guard had bypassed.

Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with
no None-task interceptions.

Signed-off-by: chcui <chcui@nvidia.com>
Add unit tests that assert the DSv4 bridge's mapping_registry contains:
  - decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing)
  - mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new)
  - mtp.layers.N.{e,h}_proj.weight as AutoMappings (new)
  - no reference to the deprecated concatenated eh_proj path anywhere

Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings
are absent (regression guard for environments without MTP).

mapping_registry only reads num_nextn_predict_layers from hf_config, so
mocking with SimpleNamespace is sufficient — no fixtures or GPU needed.

Signed-off-by: chcui <chcui@nvidia.com>
- docs/models/llm/deepseek-v4.md: variants table with parity status,
  architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP),
  conversion / inference / parallelism guidance, MCore prerequisite list
- docs/models/llm/index.md: register the new page in the toctree
- examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model
  HF<->Megatron round-trip and generation, parameterized via WORKSPACE /
  MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config
  with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer
  with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants
- src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe

Slurm launch scripts intentionally deferred until the recipe has a finalized
training-config name and parallelism layout for Pro / Pro-Base.

Signed-off-by: chcui <chcui@nvidia.com>
DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before
attention, one before MLP) of shape hidden -> hidden * num_residual_streams,
plus negligible alpha scalars and sinkhorn iterations. Without modeling
this overhead, training-throughput TFLOPs/GPU readouts under-report
work and overstate hardware efficiency.

Add a conditional term in transformer_flops() gated on
enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 *
num_residual_streams batch * seq_length FLOPs. The CSA per-layer
attention reduction (sparse top-k context instead of dense O(s^2)) is
intentionally not modeled here — that would be a smaller correction in
the opposite direction, and leaving it out keeps the throughput estimate
on the safe (over-estimating) side.

Tests:
- test_hc_flops_increase_when_enabled: HC-on > HC-off
- test_hc_exact_overhead: matches the closed-form formula
- test_hc_scales_with_residual_streams: doubling streams doubles delta

Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron
roundtrip. It is auto-skipped today because (a) transformers does not yet
ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the
DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land.

Signed-off-by: chcui <chcui@nvidia.com>
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py:
  - deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash
    (the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP
    disabled for fine-tuning, fp32 master weights, max_lr=5e-6
  - deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1
    in the recipe (override via slurm to EP>=4 for Flash; the frozen base
    model still has to fit across ranks even though only adapters train),
    max_lr=1e-4
  - shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings,
    bf16, and disables MTP for fine-tune codepaths
- src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes
- examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers
  modelled on examples/models/gpt_oss/, parameterized via env vars
  (WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS)
  and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit
  error since DSv4 only supports TP=1.
- examples/models/deepseek_v4/README.md: updated file table to reflect the
  finalized recipe names and default layouts

Verified that both recipes instantiate cleanly against the local Flash HF
config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with
LoRA scheme; MTP is None and PEFT-specific assertions hold.

Signed-off-by: chcui <chcui@nvidia.com>
…irect function reference

Signed-off-by: weijiac <weijiac@nvidia.com>
Add HSG launchers for DSv4 toy/full training, native Transformers DSv4 config handling, and runtime dependency setup for the PR3562/MCore PR4518 training branch.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants