Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Refine Nemotron Customizer airgap image flow
- Rename airgap artifacts to use launcher and execution image terminology
- Update runner stages, manifests, README, and config keys to match the new naming
- Keep execution image generation scoped to selected Nemotron Customizer steps
- Preserve external handling for models, datasets, checkpoints, and customer storage paths
- Refresh SFT Megatron Bridge airgap overlay configs
- Update tests for launcher/execution image behavior and staged runner flow

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
  • Loading branch information
rapaul-nv committed May 11, 2026
commit f4b8f50910c616d8fe48842ef2cf2ea3fc4bed1b
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Derivative task image for Nemotron Customizer airgap.
# Derivative execution image for Nemotron Customizer airgap.
# Built from the real training/runtime image and only adds small missing
# wrapper packages.

ARG BASE_IMAGE
FROM ${BASE_IMAGE}

ARG TASK_REQUIREMENTS
ARG EXECUTION_REQUIREMENTS
ARG REPO_OVERLAYS
ARG REPO_OVERLAYS_DIR
ARG PYTHON_BIN=python
Expand All @@ -16,16 +16,16 @@ ENV TRANSFORMERS_OFFLINE=1
ENV HF_DATASETS_OFFLINE=1
ENV WANDB_MODE=offline

COPY ${TASK_REQUIREMENTS} /opt/nemotron-airgap/task-requirements.txt
COPY ${EXECUTION_REQUIREMENTS} /opt/nemotron-airgap/execution-requirements.txt
COPY ${REPO_OVERLAYS} /opt/nemotron-airgap/repo-overlays.json
COPY ${REPO_OVERLAYS_DIR}/ /opt/nemotron-airgap/repo-overlays/

# Build-time installs keep --no-cache-dir so derivative image layers stay small.
RUN if [ -s /opt/nemotron-airgap/task-requirements.txt ]; then \
RUN if [ -s /opt/nemotron-airgap/execution-requirements.txt ]; then \
if [ "${PIP_NO_DEPS}" = "true" ]; then \
${PYTHON_BIN} -m pip install --no-cache-dir --no-deps -r /opt/nemotron-airgap/task-requirements.txt; \
${PYTHON_BIN} -m pip install --no-cache-dir --no-deps -r /opt/nemotron-airgap/execution-requirements.txt; \
else \
${PYTHON_BIN} -m pip install --no-cache-dir -r /opt/nemotron-airgap/task-requirements.txt; \
${PYTHON_BIN} -m pip install --no-cache-dir -r /opt/nemotron-airgap/execution-requirements.txt; \
fi; \
fi && \
${PYTHON_BIN} - <<'PY'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
!deploy/nemotron-customizer
!deploy/nemotron-customizer/airgap
!deploy/nemotron-customizer/airgap/out
!deploy/nemotron-customizer/airgap/out/task-context
!deploy/nemotron-customizer/airgap/out/task-context/**
!deploy/nemotron-customizer/airgap/out/execution-context
!deploy/nemotron-customizer/airgap/out/execution-context/**
!deploy/nemotron-customizer/airgap/out/repo-overlays
!deploy/nemotron-customizer/airgap/out/repo-overlays/**

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Submitter image for Nemotron Customizer airgap.
# Launcher image for Nemotron Customizer airgap.
# It contains the repo and a uv-synced environment. It does not run training.

ARG BASE_IMAGE=python:3.12-slim
Expand Down
40 changes: 20 additions & 20 deletions deploy/nemotron-customizer/airgap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,22 @@ This folder is scoped only to Nemotron Customizer steps under

The flow is intentionally small:

1. Build one **submitter image** with this repo and `uv.lock`.
2. Build one or more **task images** by grouping selected workflow stages by base image.
1. Build one **launcher image** with this repo and `uv.lock`.
2. Build one or more **execution images** by grouping selected workflow stages by base image.
3. Save those images as tarballs for the airgapped side.
4. Keep models, datasets, checkpoints, and customer files on persistent storage.

Edit `airgap.yaml` first:

- `workflow.stages`: the Nemotron Customizer steps the customer wants to run
- `dependencies`: central step dependency map, for example SFT training needs SFT packing
- `step_images`: which task image each step should use
- `task_images`: the base image, output tag, and known/import-probed Python requirements
- `step_execution_images`: which execution image each step should use
- `execution_images`: the base image, output tag, and known/import-probed Python requirements

Only steps reached from `workflow.stages` are built. Steps are grouped by
`base_image + repo_overlays`; each group gets one derivative image with the
union of its small missing packages. If two selected step families share the
same base image and repo overlays, the runner emits one combined task image for
same base image and repo overlays, the runner emits one combined execution image for
both.

Run from the repo root:
Expand All @@ -45,7 +45,7 @@ To run only a few stages:
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml \
--stage validate \
--stage discover-task-deps
--stage discover-execution-deps
```

To override the workflow without editing YAML, pass one or more selected
Expand All @@ -63,18 +63,18 @@ uv run python deploy/nemotron-customizer/airgap/runner.py \
Outputs are written under `deploy/nemotron-customizer/airgap/out/` by default:

- `airgap-manifest.yaml`: what was validated and built
- `airgap-progress.yaml`: incomplete execute run state used for resume
- `airgap-complete.yaml`: final execute run state after success
- `requirements-<task-group>.txt`: small missing packages per task image
- `repo-overlays-<task-group>.json`: git auto-mounts discovered from selected step configs
- `submitter-image.tar`
- `task-*.tar`
- `airgap-build-state.yaml`: incomplete execute run state used for resume
- `airgap-build-complete.yaml`: final execute run state after success
- `requirements-<execution-group>.txt`: small missing packages per execution image
- `repo-overlays-<execution-group>.json`: git auto-mounts discovered from selected step configs
- `launcher-image.tar`
- `execution-*.tar`
- SHA256 checksums for saved image tarballs in `airgap-manifest.yaml`

If an execute run fails midway, leave `airgap-progress.yaml` in place and rerun
If an execute run fails midway, leave `airgap-build-state.yaml` in place and rerun
the same command. Completed expensive actions are reused when their artifacts
still exist. If you intentionally change the workflow or image plan before
finishing, move or remove `airgap-progress.yaml` first; the runner will not
finishing, move or remove `airgap-build-state.yaml` first; the runner will not
silently overwrite incomplete state from a different plan.

Runtime dependency probes use Docker volumes named
Expand All @@ -88,19 +88,19 @@ executor-visible persistent storage and reference them through config overrides
and `run.env.mounts`.

During dependency discovery, the runner mounts the connected-machine checkout
into each task image only to probe imports. The final task image deliberately
does not bake this repo; the submitter image and the normal nemo-run/nemo-runspec
into each execution image only to probe imports. The final execution image deliberately
does not bake this repo; the launcher image and the normal nemo-run/nemo-runspec
code transport provide the repo to the remote job at submission time.

Repo logistics stay outside `airgap.yaml`. If a selected step config contains
`${auto_mount:git+...}`, the runner treats it as a connected-machine build input:
it fetches that pinned repo and bakes it into the derivative task image at the
it fetches that pinned repo and bakes it into the derivative execution image at the
requested target path. Runtime jobs then use the baked image and do not clone
from GitHub. Site-specific data/model mounts remain in env profiles or step
overrides.

If the connected machine is not the same architecture as the target cluster,
set `platform: linux/amd64` on the submitter or task image entry in
set `platform: linux/amd64` on the `launcher_image` or execution image entry in
`airgap.yaml`. If you need to minimize transfer size for several images that
share layers, `docker save -o all-images.tar tag1 tag2 ...` can be used after
the runner builds the images; a single tar deduplicates shared layers better
Expand All @@ -124,8 +124,8 @@ workflow:
When submitting inside the airgap, use the deploy overlay config so those git
auto-mounts are cleared at runtime while persistent storage mounts from the env
profile still apply. Use the image printed by the runner under
`selected step images`, or read it from `out/airgap-manifest.yaml` under
`step_images`.
`selected execution images`, or read it from `out/airgap-manifest.yaml` under
`step_execution_images`.

```bash
uv run nemotron step run sft/megatron_bridge \
Expand Down
42 changes: 21 additions & 21 deletions deploy/nemotron-customizer/airgap/airgap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#
# Change workflow.stages to the steps the customer wants. The runner expands
# dependencies, validates those step files/configs, groups selected steps by
# task image, then builds only the images needed for that selection.
# execution image, then builds only the images needed for that selection.

workflow:
name: sft-megatron-bridge
Expand All @@ -18,18 +18,18 @@ workflow:

build_stages:
- validate
- discover-task-deps
- build-submitter
- build-task-images
- discover-execution-deps
- build-launcher-image
- build-execution-images
- save-images

paths:
output_dir: deploy/nemotron-customizer/airgap/out

submitter:
launcher_image:
base_image: python:3.12-slim
tag: nemotron-customizer-submit-airgap:latest
tar: submitter-image.tar
tag: nemotron-customizer-launcher-airgap:latest
tar: launcher-image.tar

# Central dependency map. Keep this small and explicit: it is only for steps
# that naturally require a previous Nemotron Customizer step output.
Expand All @@ -51,15 +51,15 @@ dependencies:
# SDG can feed SFT or RL prep, but it is not forced as a dependency because
# many customers bring their own JSONL on persistent storage.

# Step -> task-image mapping. The runner only uses entries reached from
# Step -> execution-image mapping. The runner only uses entries reached from
# workflow.stages after dependency expansion.
step_images:
step_execution_images:
byob: nemo-data-designer
convert/hf_to_megatron: nemo-megatron
convert/megatron_to_hf: nemo-megatron
convert/merge_lora: nemo-megatron
curate/nemo_curator: nemo-curator
env/env_toml: submitter-python
env/env_toml: launcher-python
eval/model_eval: nemo-eval
optimize/modelopt/distill: nemo-modelopt
optimize/modelopt/prune: nemo-modelopt
Expand All @@ -80,51 +80,51 @@ step_images:
translate/nemo_skills: nemo-curator
translate/translation: nemo-curator

task_images:
submitter-python:
execution_images:
launcher-python:
base_image: python:3.12-slim
tag: nemotron-customizer-python-task-airgap:latest
tar: task-python-image.tar
tag: nemotron-customizer-python-execution-airgap:latest
tar: execution-python-image.tar

nemo-megatron:
base_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
tag: nemotron-customizer-nemo-megatron-airgap:latest
tar: task-nemo-megatron-image.tar
tar: execution-nemo-megatron-image.tar
required_imports: []

nemo-automodel:
base_image: nvcr.io/nvidia/nemo-automodel:26.04
tag: nemotron-customizer-nemo-automodel-airgap:latest
tar: task-nemo-automodel-image.tar
tar: execution-nemo-automodel-image.tar
required_imports: []

nemo-rl:
base_image: nvcr.io/nvidia/nemo-rl:v0.6.0
tag: nemotron-customizer-nemo-rl-airgap:latest
tar: task-nemo-rl-image.tar
tar: execution-nemo-rl-image.tar
required_imports: []

nemo-modelopt:
base_image: nvcr.io/nvidia/nemo:26.02
tag: nemotron-customizer-nemo-modelopt-airgap:latest
tar: task-nemo-modelopt-image.tar
tar: execution-nemo-modelopt-image.tar
required_imports: []

nemo-curator:
base_image: nvcr.io/nvidia/nemo-curator:25.07
tag: nemotron-customizer-nemo-curator-airgap:latest
tar: task-nemo-curator-image.tar
tar: execution-nemo-curator-image.tar
required_imports: []

nemo-data-designer:
base_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
tag: nemotron-customizer-nemo-data-designer-airgap:latest
tar: task-nemo-data-designer-image.tar
tar: execution-nemo-data-designer-image.tar
required_imports:
- data_designer

nemo-eval:
base_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
tag: nemotron-customizer-nemo-eval-airgap:latest
tar: task-nemo-eval-image.tar
tar: execution-nemo-eval-image.tar
required_imports: []
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Airgap runtime overlay for sft/megatron_bridge:default.
#
# The connected-machine airgap runner bakes the auto_mount repos from the base
# config into the derivative task image. At runtime, clear those git auto-mounts
# config into the derivative execution image. At runtime, clear those git auto-mounts
# so the airgapped job does not clone from GitHub. Env-profile persistent
# storage mounts still append normally.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Airgap runtime overlay for sft/megatron_bridge:tiny.
#
# The connected-machine airgap runner bakes the auto_mount repos from the base
# config into the derivative task image. At runtime, clear those git auto-mounts
# config into the derivative execution image. At runtime, clear those git auto-mounts
# so the airgapped job does not clone from GitHub. Env-profile persistent
# storage mounts still append normally.

Expand Down
Loading