The launcher submits ModelOpt quantization, training, and evaluation jobs to Slurm clusters or runs them locally with Docker.
| File | Role |
|---|---|
launch.py |
Public entrypoint — accepts --yaml or pipeline=@ |
core.py |
Shared dataclasses, executor builders, run loop, version reporting |
slurm_config.py |
SlurmConfig dataclass and env-var-driven slurm_factory |
common/ |
Shell scripts and query.py packaged to the cluster |
modules/Megatron-LM/ |
Git submodule |
modules/Model-Optimizer |
Symlink to ../.. (auto-created by launch.py if missing) |
# Run locally with Docker
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml hf_local=/mnt/hf-local --yes
# Run on Slurm (set env vars first)
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes
# Dry run — preview resolved config
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --dryrun --yes -v
# Dump resolved config
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --to-yaml resolved.yaml
# Run unit tests
uv pip install pytest
uv run python3 -m pytest tests/ -vThe --yaml format maps top-level keys to launch() function arguments:
job_name: Qwen3-8B_NVFP4_DEFAULT_CFG
pipeline:
global_vars:
hf_local: /hf-local/
task_0:
script: common/megatron_lm/quantize/quantize.sh
args:
- --calib-dataset-path-or-name <<global_vars.hf_local>>abisee/cnn_dailymail
environment:
- MLM_MODEL_CFG: Qwen/Qwen3-8B
- HF_MODEL_CKPT: <<global_vars.hf_local>>Qwen/Qwen3-8B
- TP: 4
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 4
gpus_per_node: 4Key conventions:
- Scripts go in
common/(notservices/) <<global_vars.X>>interpolation for shared values across tasks_factory_: "slurm_factory"— resolved viaregister_factory()incore.py- Environment is list-of-single-key-dicts:
- KEY: value - CLI overrides:
pipeline.task_0.slurm_config.nodes=2
launch.py → imports core.py + slurm_config.py
↓
core.run_jobs()
↓
build_docker_executor() or build_slurm_executor()
↓
nemo_run.Experiment → Docker or Slurm
set_slurm_config_type(SlurmConfig)— patchesSandboxTaskannotation at import timeregister_factory("slurm_factory", slurm_factory)— enables YAML_factory_resolutionreport_versions(base_dir)— prints git commit/branch for launcher + submodulesget_default_env(title)— returns(slurm_env, local_env)dicts
- Create
examples/<Org>/<Model>/megatron_lm_ptq.yamlfollowing the format above - Set
MLM_MODEL_CFGto the HuggingFace repo ID - Set
QUANT_CFG(e.g.,NVFP4_DEFAULT_CFG,INT8_DEFAULT_CFG) - Set GPU/node counts based on model size
- Test:
uv run launch.py --yaml <path> --dryrun --yes -v
65 unit tests in tests/. Run standalone without installing modelopt:
From the launcher directory:
uv run python3 -m pytest tests/ -vTests cover: core dataclasses, factory registry, global_vars interpolation, YAML formats, Docker/Slurm executor construction (mocked), environment merging, metadata writing, and end-to-end Docker launch via subprocess.
- docs/configuration.md — YAML formats, overrides, hf_local
- docs/architecture.md — Shared core, factory system, typed tasks, mount mechanism
- docs/testing.md — Running tests locally and in CI
- docs/claude_code.md — Claude Code workflows
- docs/contributing.md — Adding models, typed tasks, bug reporting