Airgap SKILL addition

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
NVIDIA-NeMo · rapaul-nv · May 8, 2026 · May 8, 2026 · May 8, 2026 · May 11, 2026
commit 6332e3b3eb5156e6dbece685c019786cf4633dfc
diff --git a/deploy/nemotron-customizer/airgap/SKILL.md b/deploy/nemotron-customizer/airgap/SKILL.md
@@ -0,0 +1,115 @@
+---
+name: nemotron-customizer-airgap
+description: Prepare, validate, build, and use Nemotron Customizer airgap image bundles for offline clusters. Use when planning airgapped deployments, editing deploy/nemotron-customizer/airgap/airgap.yaml, selecting workflow targets, grouping step execution images, baking repo overlays or wheel additions, resuming airgap runner builds, or submitting `nemotron steps run` jobs inside an airgapped environment.
+---
+
+# Nemotron Customizer Airgap
+
+Use this skill to help an agent produce a connected-machine airgap bundle and
+then submit Nemotron Customizer steps from the airgapped side. Keep it grounded
+in the checked-in runner and manifests; do not invent a parallel packaging flow.
+
+## Read First
+
+- `deploy/nemotron-customizer/airgap/README.md` for the operator flow.
+- `deploy/nemotron-customizer/airgap/airgap.yaml` for the current image map.
+- `deploy/nemotron-customizer/airgap/runner.py` when changing behavior.
+- `tests/deploy/test_airgap_runner.py` before editing runner logic.
+- `deploy/nemotron-customizer/airgap/configs/` for runtime overlay configs.
+
+For selected steps, inspect the catalog through the CLI:
+
+```bash
+uv run nemotron steps show <step_id> --json
+```
+
+## Workflow
+
+1. Establish the side of the workflow:
+   - Connected machine: validate, build, save image tarballs.
+   - Airgapped side: load images, set env profiles, run selected steps.
+
+2. Gather the minimum inputs:
+   - Target steps and config names, for example `sft/megatron_bridge:tiny`.
+   - Target architecture or Docker platform, for example `linux/amd64`.
+   - Available base images and whether the connected machine can pull them.
+   - Airgapped env profile name, mounts, model/data/checkpoint locations.
+   - Whether destructive or expensive actions such as `--execute`, Docker build,
+     Docker volume cleanup, or state-file removal are explicitly allowed.
+
+3. Plan with the runner first:
+
+```bash
+uv run python deploy/nemotron-customizer/airgap/runner.py \
+  --config deploy/nemotron-customizer/airgap/airgap.yaml
+```
+
+Use `--target <step_id>:<config>` for one-off selections without editing YAML.
+The runner expands dependencies from `dependencies`, validates selected step
+files/configs, groups execution images, and prints selected execution images.
+
+4. Edit `airgap.yaml` only where the runner expects configuration:
+   - `workflow.stages` or CLI `--target` for selected customer steps.
+   - `dependencies` for explicit upstream Nemotron Customizer step outputs.
+   - `step_execution_images` for step-to-image mapping.
+   - `execution_images` for base image, tag, tar, platform, and import probes.
+   - `launcher_image` for the launcher container.
+
+5. Execute only when the user asks for a real build:
+
+```bash
+uv run python deploy/nemotron-customizer/airgap/runner.py \
+  --config deploy/nemotron-customizer/airgap/airgap.yaml \
+  --execute
+```
+
+If a build fails midway, keep `airgap-build-state.yaml` and rerun the same
+command. Remove or move that state only when intentionally changing the plan.
+
+6. On the airgapped side, use images from `out/airgap-manifest.yaml` under
+`step_execution_images`. Submit with the plural CLI:
+
+```bash
+uv run nemotron steps run <step_id> \
+  -c <config-or-airgap-overlay> \
+  -b <airgap-profile> \
+  run.env.container_image=<image-from-manifest>
+```
+
+For `sft/megatron_bridge`, prefer the airgap overlay configs under
+`deploy/nemotron-customizer/airgap/configs/`; they clear runtime git auto-mounts
+because the runner bakes those repos into the execution image.
+
+## Guardrails
+
+- Keep models, datasets, checkpoints, secrets, and customer files out of images.
+  Put them on persistent storage and reference them through config overrides and
+  `run.env.mounts`.
+- Treat `${auto_mount:git+...}` as a connected-machine build input. The runner
+  bakes pinned repo overlays into execution images so airgapped jobs do not clone
+  from GitHub.
+- Do not add missing packages blindly. Let `discover-execution-deps` and
+  import probes determine small additions; keep heavyweight framework deps in
+  the base image choice.
+- Preserve offline defaults unless the user has an internal mirror:
+  `HF_HUB_OFFLINE=1`, `TRANSFORMERS_OFFLINE=1`, `HF_DATASETS_OFFLINE=1`,
+  and `WANDB_MODE=offline`.
+- Use `nemotron steps ...`; do not reintroduce `nemotron step ...`.
+
+## Validation
+
+After edits to runner logic, YAML structure, or airgap docs, run:
+
+```bash
+uv run pytest tests/deploy/test_airgap_runner.py -q
+```
+
+For CLI-facing examples, also smoke the command shape:
+
+```bash
+uv run nemotron steps --help
+uv run nemotron steps show prep/sft_packing --json
+```
+
+Do not run Docker build/save stages during validation unless the user explicitly
+asked for a real connected-machine bundle build.
diff --git a/skills/nemotron-customize/SKILL.md b/skills/nemotron-customize/SKILL.md
@@ -40,6 +40,7 @@ Concise. Technical. No fluff.
 | Cross-step constraint (tokenizer lock, eval bookends, ...) | `src/nemotron/steps/patterns/<id>.md` |
 | Artifact compatibility / `is_a` / `convert_to` | [src/nemotron/steps/types.toml](../../src/nemotron/steps/types.toml) |
 | GPU memory / parallelism heuristics | [src/nemotron/steps/hardware.md](../../src/nemotron/steps/hardware.md) |
+| Explicit airgap/offline bundle request only | [deploy/nemotron-customizer/airgap/SKILL.md](../../deploy/nemotron-customizer/airgap/SKILL.md) |
 | Library API extracts for code generation | [context/index.toml](context/index.toml) → `context/<pack>.txt` |
 | Project scaffold rules (CLI, pyproject, README, deploy) | [act/PROJECT.md](act/PROJECT.md) |
 | Per-stage code rules (R1–R5, dry-run, W&B) | [act/STAGE.md](act/STAGE.md) |
@@ -144,7 +145,6 @@ Goal: produce a markdown plan the user reviews before any code is written.
 | 6 | RL warm-starts from SFT; rewards validated before scale. | [patterns/rl-validate-rewards-before-scale.md](../../src/nemotron/steps/patterns/rl-validate-rewards-before-scale.md) |
 | 7 | GPU count ≥ chosen model's `min_gpus` (from `[[models]]` block in each `step.toml`). | step.toml + [hardware.md](../../src/nemotron/steps/hardware.md) |
 | 8 | Sovereign / customization patterns checked: `cpt-data-blend-scoping`, `sft-data-blending`, `multilingual-tokenizer-check`, `data-quality-before-quantity`, `sdg-pipeline-versioning`, `byob-benchmark-design`, `pretrain-token-budget-before-scale`, `sft-small-dataset-prefer-lora`, `convert-checkpoint-safety`. | [patterns/](../../src/nemotron/steps/patterns/) |
-
 When a check fails: surface it as a `⚠` warning in the plan and propose a
 fix. When the user can't satisfy it (e.g. hardware), propose alternatives in
 descending preference: smaller model → AutoModel instead of Megatron-Bridge →
@@ -187,6 +187,7 @@ graph LR
 | Resource | Required by | Notes |
 |---|---|---|
 | <resource> | <stage> | <status / question> |
+
 ````
 
 **Step 2.5 — Present the plan and wait.** Don't proceed to Act until the
@@ -356,6 +357,17 @@ catalog-based stage."
 If the same Explorer build keeps appearing across projects, suggest the user
 run `/nemotron-add-step` to land it in the catalog.
 
+### Explicit airgap handoff
+
+Do this only when the user explicitly asks for airgap, offline/no-internet
+execution, image tarballs, or Nemotron Customizer airgap bundle work. Do not
+include it in normal local, Slurm, Lepton, Airflow, or Kubeflow planning.
+
+When triggered, stop the generic project-generation path and load
+[deploy/nemotron-customizer/airgap/SKILL.md](../../deploy/nemotron-customizer/airgap/SKILL.md).
+Use the approved catalog step IDs as airgap runner `--target <step_id>:<config>`
+values, then follow that skill's validate/build/run workflow.
+
 ### Choosing a mode
 
 | User says | Mode |
@@ -367,6 +379,7 @@ run `/nemotron-add-step` to land it in the catalog.
 | "Translate EN → \<lang\>" | Catalog ([translate/nemo_skills](../../src/nemotron/steps/translate/nemo_skills/)) |
 | "Curate web text" | Catalog ([curate/nemo_curator](../../src/nemotron/steps/curate/nemo_curator/)) |
 | "Deploy to TensorRT-LLM" | Explorer (no step yet — derive from upstream library docs and add a `convert/*` step if the path stabilizes) |
+| "Build an airgap bundle", "offline cluster", "no internet", "image tarballs for these steps" | Explicit airgap handoff |
 | "Train with X exotic backend" | Explorer or **ask** |
 | Ambiguous | **Ask** |
 
@@ -437,6 +450,8 @@ configs.
 - Tune parallelism beyond what `hardware.md` and `[[strategies]]` advise.
 - Assume GPU count, type, or interconnect.
 - Generate Slurm/Airflow/Kubeflow wrappers unless requested.
+- Route to airgap for generic deployment requests; require an explicit airgap,
+  offline, no-internet, or image-tar bundle ask.
 - Modify [src/nemotron/steps/](../../src/nemotron/steps/). To extend the catalog, route the user to `/nemotron-add-step`.
 - Restate per-step rules in this skill — link to the step's `SKILL.md` instead.