feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl… by rsalagame-nvidia · Pull Request #3878 · NVIDIA-NeMo/Megatron-Bridge

rsalagame-nvidia · 2026-05-19T02:29:06Z

Adds --mlperf_flavor to scripts/performance/setup_experiment.py for MLPerf v6.0 apples-to-apples Llama3.1 runs on GB200. New utils/mlperf_flavor.py resolves the v6.0 shape per (model_recipe_name, compute_dtype, num_gpus) for Llama3 8B (8/16/32/64/72/128 GPU FP8 + 8 GPU NVFP4) and Llama3.1 405B (256/512 GPU FP8+NVFP4), wires the MLPerf preprocessed C4 dataset, appends container mounts, and triggers gated parity knobs in utils/overrides.py (up to 6 recipe knobs incl. CUDA graphs) + perf_plugins.py (parity env vars matching the v5.1 effective env). Gated to gpu=gb200 since shapes are derived from v6.0 GB200 reference configs; other GPU types not yet validated. Fully opt-in; no behavior change without the flag.

…es Llama3.1 runs (GB200 only) Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>

copy-pr-bot · 2026-05-19T02:29:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jepio · 2026-05-19T15:25:29Z

+
+# (model_recipe_name, compute_dtype, num_gpus) -> (TP, PP, VP, CP, MBS, GBS, parity_mode); shapes derived from v5.1 NVIDIA submission configs.
+_MLPERF_V51_SHAPES: Dict[Tuple[str, str, int], Tuple[int, int, int, int, int, int, str]] = {
+    ("llama3_8b",   "fp8_cs",   8):   (1, 1, 1, 1, 1, 8,    "F16_ATTN"),


Suggested change

("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 1, 8, "F16_ATTN"),

("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 2, 16, "F16_ATTN"),

jepio · 2026-05-19T18:02:11Z

+    ("llama3_8b",   "fp8_cs",   16):  (1, 1, 1, 2, 1, 8,    "F16_ATTN"),
+    ("llama3_8b",   "fp8_cs",   32):  (1, 1, 1, 2, 1, 16,   "F16_ATTN"),
+    ("llama3_8b",   "fp8_cs",   64):  (1, 1, 1, 2, 1, 32,   "F16_ATTN"),
+    ("llama3_8b",   "fp8_cs",   72):  (1, 1, 1, 2, 1, 36,   "F16_ATTN"),
+    ("llama3_8b",   "fp8_cs",   128): (2, 1, 1, 4, 1, 16,   "F16_ATTN"),
+    ("llama3_8b",   "nvfp4",    8):   (1, 1, 1, 1, 2, 16,   "FP4_ATTN"),
+    ("llama31_405b","fp8_cs",   256): (4, 8, 8, 2, 1, 576,  "405B"),
+    ("llama31_405b","fp8_cs",   512): (4, 8, 8, 2, 1, 1152, "405B"),
+    ("llama31_405b","nvfp4",    256): (4, 8, 8, 2, 1, 576,  "405B"),
+    ("llama31_405b","nvfp4",    512): (4, 8, 8, 2, 1, 1152, "405B"),


For maintainability it would be helpful if we had a variable at the end of the datastructure that encodes the mlperf tuning source: llama31_8b_2x4, llama31_8b_18x4, llama31_8b_512x4 (or _small,_med,_large). And then in the functions where we apply the settings we'd have blocks (or function calls) that configure all overrides for that specific config. This would be instead of parity_mode.

jepio · 2026-05-19T18:05:09Z

+    """Apply MLPerf v5.1 apples-to-apples recipe knobs; gated by MLPERF_PARITY_{F16_ATTN,FP4_ATTN,405B} env vars (set by perf_plugins)."""
+    f16_only = bool(os.environ.get("MLPERF_PARITY_F16_ATTN"))
+    fp4_attn = bool(os.environ.get("MLPERF_PARITY_FP4_ATTN"))
+    parity_405b = bool(os.environ.get("MLPERF_PARITY_405B"))
+    if not (f16_only or fp4_attn or parity_405b):
+        return recipe


Do we have to pass custom env variables between stages - can we have a get_mlperf_flavor_config?

…ataset optional (falls back to mock); fix 8 GPU FP8 shape; add 16/32/64 FP4 shape entries Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>

feat(performance): add --mlperf_flavor for MLPerf v5.1 apples-to-appl…

43ce3e7

…es Llama3.1 runs (GB200 only) Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>

rsalagame-nvidia requested review from bdubauski, nv-mollys and sudostock May 19, 2026 02:29

rsalagame-nvidia self-assigned this May 19, 2026

rsalagame-nvidia requested review from a team, erhoo82 and malay-nagda as code owners May 19, 2026 02:29

yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work waiting-on-maintainers Waiting on maintainers to respond labels May 19, 2026

jepio reviewed May 19, 2026

View reviewed changes

feat(performance): retarget --mlperf_flavor to MLPerf v6.0; make C4 d…

cd018fe

…ataset optional (falls back to mock); fix 8 GPU FP8 shape; add 16/32/64 FP4 shape entries Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>

rsalagame-nvidia changed the title ~~feat(performance): add --mlperf_flavor for MLPerf v5.1 apples-to-appl…~~ feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl… May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl…#3878

feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl…#3878
rsalagame-nvidia wants to merge 2 commits into
llmb-r0.4.0from
feat/mlperf-parity-knobs

rsalagame-nvidia commented May 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

jepio May 19, 2026

Uh oh!

jepio May 19, 2026

Uh oh!

jepio May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 1, 8, "F16_ATTN"),
	("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 2, 16, "F16_ATTN"),

Conversation

rsalagame-nvidia commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

jepio May 19, 2026

Choose a reason for hiding this comment

Uh oh!

jepio May 19, 2026

Choose a reason for hiding this comment

Uh oh!

jepio May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rsalagame-nvidia commented May 19, 2026 •

edited

Loading