Chcui/dsv4 train pr3562 pr4518#3893
Draft
weijiac0619 wants to merge 32 commits into
Draft
Conversation
Signed-off-by: weijiac <weijiac@NVIDIA.com>
Signed-off-by: weijiac <weijiac@NVIDIA.com>
- Rename linear_kv_up_proj → linear_kv_proj (MCore PR #4458) - Restore HC head mappings (hc_head_fn/base/scale) for learned_output_contract (MCore PR #4518) - Remove task-is-None guards in model_bridge.py (root cause fixed via allow_hf_name_mismatch) - Add allow_hf_name_mismatch to _HCAlphaSecondaryMapping - Handle transformers 5.x nested rope_scaling format - Handle compress_ratios/compress_rates naming + length trim - Explicit errors for missing config fields instead of silent fallbacks - AutoConfig.register: re-raise non-"already registered" errors
- Remove old test scripts (dsv4_generate.py, test_dsv4_bridge_smoke.py, test_dsv4_full_import.py) - Add new validation scripts: - dsv4_fresh_import_test.py: import + save ckpt + round-trip (all weights) + cosine sim - dsv4_fresh_generate.py: import + greedy generation with answer verification - dsv4_tiny_ref_vs_mg.py: tiny model layer-by-layer comparison vs official inference/model.py - Update copyright year to 2026
float8_e8m0fnu.to(float32) already returns 2^(e-127). The old code applied an extra 2^(x-127), producing near-zero scales that zeroed all expert weights. Also fix tiny model test init and seq_len. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: weijiac <weijiac@nvidia.com>
5 MTP params (hc_head_fn/base/scale, e_proj, h_proj) are unmapped after MCore PR #4518 changed MTP from concatenated eh_proj to separate e_proj/h_proj. This guard prevents crash during fresh import. Revert when MTP mappings are fully implemented. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Active test scripts for DSv4 validation: - dsv4_full_cosine_report.py: per-layer + sub-layer cosine sim report - dsv4_cosine_analysis.py: capture hidden states from both official and MCore - dsv4_last_hidden_cmp.py: post-contraction + logit comparison - dsv4_fresh_import_save.py: fresh HF import with checkpoint save - dsv4-bridge-handoff.md: handoff doc with results and instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- dsv4_fresh_import_test.py: superseded by dsv4_fresh_import_save.py - dsv4_tiny_ref_vs_mg.py: broken on TE GroupedLinear split, used mocked kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map the three previously-unmapped per-MTP-layer Hyper-Connection head parameters (hc_head_fn / hc_head_base / hc_head_scale) using ReplicatedMapping, mirroring the existing decoder.hc_head_* triplet. Replace the concatenated _MTPEHProjMapping (legacy eh_proj) with two plain AutoMappings for e_proj and h_proj. After MCore #4518, the MTP layer with enable_hyper_connections=True instantiates these as separate ColumnParallelLinear modules; AutoMapping auto-detects column parallelism and shards along dim 0. The dead _MTPEHProjMapping class is removed. Also clarify the bridge module docstring: parity is verified only for DeepSeek-V4-Flash; Flash-Base, Pro, and Pro-Base share the architecture and quant dispatch but logit parity is unmeasured. Document the two quant variants explicitly (Flash uses FP8 attn + MXFP4 experts with F8_E8M0 scales; Flash-Base/Pro/Pro-Base use uniform FP8 with F32 scales). Verified: 4-rank EP=4 import on DeepSeek-V4-Flash-Base completes cleanly with all five previously-unmapped MTP params now resolving. Signed-off-by: chcui <chcui@nvidia.com> Signed-off-by: root <root@nvl72160-T13.cm.cluster>
The two `if task is None or task.megatron_module is None:` guards in build_conversion_tasks consumers were a temporary stopgap absorbing the five unmapped DSv4 MTP parameters (hc_head_fn/base/scale, e_proj, h_proj). With those mappings now in place (preceding commit), no None tasks reach the load/export loops, so revert to the standard `if task.megatron_module is None:` check. Missing mappings now fail loudly again with AttributeError, restoring the safety property the guard had bypassed. Verified: 4-rank EP=4 import on DSv4-Flash-Base completes cleanly with no None-task interceptions. Signed-off-by: chcui <chcui@nvidia.com>
Add unit tests that assert the DSv4 bridge's mapping_registry contains:
- decoder.hc_head_{fn,base,scale} as ReplicatedMappings (existing)
- mtp.layers.N.hc_head_{fn,base,scale} as ReplicatedMappings (new)
- mtp.layers.N.{e,h}_proj.weight as AutoMappings (new)
- no reference to the deprecated concatenated eh_proj path anywhere
Also asserts that with num_nextn_predict_layers=0 the MTP-side mappings
are absent (regression guard for environments without MTP).
mapping_registry only reads num_nextn_predict_layers from hf_config, so
mocking with SimpleNamespace is sufficient — no fixtures or GPU needed.
Signed-off-by: chcui <chcui@nvidia.com>
- docs/models/llm/deepseek-v4.md: variants table with parity status,
architecture features (HC, CSA/DSA, hybrid attention, hash MoE, MTP),
conversion / inference / parallelism guidance, MCore prerequisite list
- docs/models/llm/index.md: register the new page in the toctree
- examples/models/deepseek_v4/{README,conversion.sh,inference.sh}: real-model
HF<->Megatron round-trip and generation, parameterized via WORKSPACE /
MODEL_VARIANT / EP, defaulting to DeepSeek-V4-Flash-Base on 4xB200
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py: deepseek_v4_pretrain_config
with TP=1 (DSv4 constraint), EP=8 default, MTP=1, precision-aware optimizer
with bf16 moments, alltoall dispatcher; parameterized for Flash / Pro variants
- src/megatron/bridge/recipes/deepseek/__init__.py: re-export the new recipe
Slurm launch scripts intentionally deferred until the recipe has a finalized
training-config name and parallelism layout for Pro / Pro-Base.
Signed-off-by: chcui <chcui@nvidia.com>
DeepSeek-V4 layers add two mapping_proj matmuls per layer (one before attention, one before MLP) of shape hidden -> hidden * num_residual_streams, plus negligible alpha scalars and sinkhorn iterations. Without modeling this overhead, training-throughput TFLOPs/GPU readouts under-report work and overstate hardware efficiency. Add a conditional term in transformer_flops() gated on enable_hyper_connections, contributing 3 * 2 * num_layers * 2 * H^2 * num_residual_streams batch * seq_length FLOPs. The CSA per-layer attention reduction (sparse top-k context instead of dense O(s^2)) is intentionally not modeled here — that would be a smaller correction in the opposite direction, and leaving it out keeps the throughput estimate on the safe (over-estimating) side. Tests: - test_hc_flops_increase_when_enabled: HC-on > HC-off - test_hc_exact_overhead: matches the closed-form formula - test_hc_scales_with_residual_streams: doubling streams doubles delta Also add a placeholder functional-test scaffold for the toy DSv4 HF<->Megatron roundtrip. It is auto-skipped today because (a) transformers does not yet ship DeepseekV4ForCausalLM, and (b) the bridge's pinned MCore lacks the DSv4 prerequisites. Both skipif conditions auto-clear when prereqs land. Signed-off-by: chcui <chcui@nvidia.com>
- src/megatron/bridge/recipes/deepseek/deepseek_v4.py:
- deepseek_v4_sft_config: full SFT, defaults to deepseek-ai/DeepSeek-V4-Flash
(the post-trained variant), TP=1 EP=8 for a 2-node 4xB200 layout, MTP
disabled for fine-tuning, fp32 master weights, max_lr=5e-6
- deepseek_v4_peft_config: LoRA/DoRA via default_peft_config, TP=1 EP=1
in the recipe (override via slurm to EP>=4 for Flash; the frozen base
model still has to fit across ranks even though only adapters train),
max_lr=1e-4
- shared _apply_dsv4_finetune_common helper enforces TP=1, MoE settings,
bf16, and disables MTP for fine-tune codepaths
- src/megatron/bridge/recipes/deepseek/__init__.py: export the new recipes
- examples/models/deepseek_v4/{slurm_sft.sh, slurm_peft.sh}: Slurm launchers
modelled on examples/models/gpt_oss/, parameterized via env vars
(WORKSPACE / MODEL_VARIANT / PRETRAINED_CHECKPOINT / PARALLELISM_CONFIGS)
and tuned for 4-GPU GB200/B200 nodes. Both refuse TP>1 with an explicit
error since DSv4 only supports TP=1.
- examples/models/deepseek_v4/README.md: updated file table to reflect the
finalized recipe names and default layouts
Verified that both recipes instantiate cleanly against the local Flash HF
config: SFT yields TP=1 EP=8 lr=5e-6, PEFT yields TP=1 EP=1 lr=1e-4 with
LoRA scheme; MTP is None and PEFT-specific assertions hold.
Signed-off-by: chcui <chcui@nvidia.com>
…irect function reference Signed-off-by: weijiac <weijiac@nvidia.com>
Add HSG launchers for DSv4 toy/full training, native Transformers DSv4 config handling, and runtime dependency setup for the PR3562/MCore PR4518 training branch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information