Skip to content

feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl…#3878

Open
rsalagame-nvidia wants to merge 2 commits into
llmb-r0.4.0from
feat/mlperf-parity-knobs
Open

feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl…#3878
rsalagame-nvidia wants to merge 2 commits into
llmb-r0.4.0from
feat/mlperf-parity-knobs

Conversation

@rsalagame-nvidia
Copy link
Copy Markdown
Contributor

@rsalagame-nvidia rsalagame-nvidia commented May 19, 2026

Adds --mlperf_flavor to scripts/performance/setup_experiment.py for MLPerf v6.0 apples-to-apples Llama3.1 runs on GB200. New utils/mlperf_flavor.py resolves the v6.0 shape per (model_recipe_name, compute_dtype, num_gpus) for Llama3 8B (8/16/32/64/72/128 GPU FP8 + 8 GPU NVFP4) and Llama3.1 405B (256/512 GPU FP8+NVFP4), wires the MLPerf preprocessed C4 dataset, appends container mounts, and triggers gated parity knobs in utils/overrides.py (up to 6 recipe knobs incl. CUDA graphs) + perf_plugins.py (parity env vars matching the v5.1 effective env). Gated to gpu=gb200 since shapes are derived from v6.0 GB200 reference configs; other GPU types not yet validated. Fully opt-in; no behavior change without the flag.

…es Llama3.1 runs (GB200 only)

Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work waiting-on-maintainers Waiting on maintainers to respond labels May 19, 2026

# (model_recipe_name, compute_dtype, num_gpus) -> (TP, PP, VP, CP, MBS, GBS, parity_mode); shapes derived from v5.1 NVIDIA submission configs.
_MLPERF_V51_SHAPES: Dict[Tuple[str, str, int], Tuple[int, int, int, int, int, int, str]] = {
("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 1, 8, "F16_ATTN"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 1, 8, "F16_ATTN"),
("llama3_8b", "fp8_cs", 8): (1, 1, 1, 1, 2, 16, "F16_ATTN"),

Comment on lines +20 to +29
("llama3_8b", "fp8_cs", 16): (1, 1, 1, 2, 1, 8, "F16_ATTN"),
("llama3_8b", "fp8_cs", 32): (1, 1, 1, 2, 1, 16, "F16_ATTN"),
("llama3_8b", "fp8_cs", 64): (1, 1, 1, 2, 1, 32, "F16_ATTN"),
("llama3_8b", "fp8_cs", 72): (1, 1, 1, 2, 1, 36, "F16_ATTN"),
("llama3_8b", "fp8_cs", 128): (2, 1, 1, 4, 1, 16, "F16_ATTN"),
("llama3_8b", "nvfp4", 8): (1, 1, 1, 1, 2, 16, "FP4_ATTN"),
("llama31_405b","fp8_cs", 256): (4, 8, 8, 2, 1, 576, "405B"),
("llama31_405b","fp8_cs", 512): (4, 8, 8, 2, 1, 1152, "405B"),
("llama31_405b","nvfp4", 256): (4, 8, 8, 2, 1, 576, "405B"),
("llama31_405b","nvfp4", 512): (4, 8, 8, 2, 1, 1152, "405B"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For maintainability it would be helpful if we had a variable at the end of the datastructure that encodes the mlperf tuning source: llama31_8b_2x4, llama31_8b_18x4, llama31_8b_512x4 (or _small,_med,_large). And then in the functions where we apply the settings we'd have blocks (or function calls) that configure all overrides for that specific config. This would be instead of parity_mode.

Comment thread scripts/performance/utils/overrides.py Outdated
Comment on lines +500 to +505
"""Apply MLPerf v5.1 apples-to-apples recipe knobs; gated by MLPERF_PARITY_{F16_ATTN,FP4_ATTN,405B} env vars (set by perf_plugins)."""
f16_only = bool(os.environ.get("MLPERF_PARITY_F16_ATTN"))
fp4_attn = bool(os.environ.get("MLPERF_PARITY_FP4_ATTN"))
parity_405b = bool(os.environ.get("MLPERF_PARITY_405B"))
if not (f16_only or fp4_attn or parity_405b):
return recipe
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to pass custom env variables between stages - can we have a get_mlperf_flavor_config?

…ataset optional (falls back to mock); fix 8 GPU FP8 shape; add 16/32/64 FP4 shape entries

Signed-off-by: Rahul Salagame <rsalagame@login-ptyche01.ptyche.clusters.nvidia.com>
@rsalagame-nvidia rsalagame-nvidia changed the title feat(performance): add --mlperf_flavor for MLPerf v5.1 apples-to-appl… feat(performance): add --mlperf_flavor for MLPerf v6.0 apples-to-appl… May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants