Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Airgap SKILL addition
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
  • Loading branch information
rapaul-nv committed May 11, 2026
commit 6332e3b3eb5156e6dbece685c019786cf4633dfc
115 changes: 115 additions & 0 deletions deploy/nemotron-customizer/airgap/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
name: nemotron-customizer-airgap
description: Prepare, validate, build, and use Nemotron Customizer airgap image bundles for offline clusters. Use when planning airgapped deployments, editing deploy/nemotron-customizer/airgap/airgap.yaml, selecting workflow targets, grouping step execution images, baking repo overlays or wheel additions, resuming airgap runner builds, or submitting `nemotron steps run` jobs inside an airgapped environment.
---

# Nemotron Customizer Airgap

Use this skill to help an agent produce a connected-machine airgap bundle and
then submit Nemotron Customizer steps from the airgapped side. Keep it grounded
in the checked-in runner and manifests; do not invent a parallel packaging flow.

## Read First

- `deploy/nemotron-customizer/airgap/README.md` for the operator flow.
- `deploy/nemotron-customizer/airgap/airgap.yaml` for the current image map.
- `deploy/nemotron-customizer/airgap/runner.py` when changing behavior.
- `tests/deploy/test_airgap_runner.py` before editing runner logic.
- `deploy/nemotron-customizer/airgap/configs/` for runtime overlay configs.

For selected steps, inspect the catalog through the CLI:

```bash
uv run nemotron steps show <step_id> --json
```

## Workflow

1. Establish the side of the workflow:
- Connected machine: validate, build, save image tarballs.
- Airgapped side: load images, set env profiles, run selected steps.

2. Gather the minimum inputs:
- Target steps and config names, for example `sft/megatron_bridge:tiny`.
- Target architecture or Docker platform, for example `linux/amd64`.
- Available base images and whether the connected machine can pull them.
- Airgapped env profile name, mounts, model/data/checkpoint locations.
- Whether destructive or expensive actions such as `--execute`, Docker build,
Docker volume cleanup, or state-file removal are explicitly allowed.

3. Plan with the runner first:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml
```

Use `--target <step_id>:<config>` for one-off selections without editing YAML.
The runner expands dependencies from `dependencies`, validates selected step
files/configs, groups execution images, and prints selected execution images.

4. Edit `airgap.yaml` only where the runner expects configuration:
- `workflow.stages` or CLI `--target` for selected customer steps.
- `dependencies` for explicit upstream Nemotron Customizer step outputs.
- `step_execution_images` for step-to-image mapping.
- `execution_images` for base image, tag, tar, platform, and import probes.
- `launcher_image` for the launcher container.

5. Execute only when the user asks for a real build:

```bash
uv run python deploy/nemotron-customizer/airgap/runner.py \
--config deploy/nemotron-customizer/airgap/airgap.yaml \
--execute
```

If a build fails midway, keep `airgap-build-state.yaml` and rerun the same
command. Remove or move that state only when intentionally changing the plan.

6. On the airgapped side, use images from `out/airgap-manifest.yaml` under
`step_execution_images`. Submit with the plural CLI:

```bash
uv run nemotron steps run <step_id> \
-c <config-or-airgap-overlay> \
-b <airgap-profile> \
run.env.container_image=<image-from-manifest>
```

For `sft/megatron_bridge`, prefer the airgap overlay configs under
`deploy/nemotron-customizer/airgap/configs/`; they clear runtime git auto-mounts
because the runner bakes those repos into the execution image.

## Guardrails

- Keep models, datasets, checkpoints, secrets, and customer files out of images.
Put them on persistent storage and reference them through config overrides and
`run.env.mounts`.
- Treat `${auto_mount:git+...}` as a connected-machine build input. The runner
bakes pinned repo overlays into execution images so airgapped jobs do not clone
from GitHub.
- Do not add missing packages blindly. Let `discover-execution-deps` and
import probes determine small additions; keep heavyweight framework deps in
the base image choice.
- Preserve offline defaults unless the user has an internal mirror:
`HF_HUB_OFFLINE=1`, `TRANSFORMERS_OFFLINE=1`, `HF_DATASETS_OFFLINE=1`,
and `WANDB_MODE=offline`.
- Use `nemotron steps ...`; do not reintroduce `nemotron step ...`.

## Validation

After edits to runner logic, YAML structure, or airgap docs, run:

```bash
uv run pytest tests/deploy/test_airgap_runner.py -q
```

For CLI-facing examples, also smoke the command shape:

```bash
uv run nemotron steps --help
uv run nemotron steps show prep/sft_packing --json
```

Do not run Docker build/save stages during validation unless the user explicitly
asked for a real connected-machine bundle build.
17 changes: 16 additions & 1 deletion skills/nemotron-customize/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Concise. Technical. No fluff.
| Cross-step constraint (tokenizer lock, eval bookends, ...) | `src/nemotron/steps/patterns/<id>.md` |
| Artifact compatibility / `is_a` / `convert_to` | [src/nemotron/steps/types.toml](../../src/nemotron/steps/types.toml) |
| GPU memory / parallelism heuristics | [src/nemotron/steps/hardware.md](../../src/nemotron/steps/hardware.md) |
| Explicit airgap/offline bundle request only | [deploy/nemotron-customizer/airgap/SKILL.md](../../deploy/nemotron-customizer/airgap/SKILL.md) |
| Library API extracts for code generation | [context/index.toml](context/index.toml) → `context/<pack>.txt` |
| Project scaffold rules (CLI, pyproject, README, deploy) | [act/PROJECT.md](act/PROJECT.md) |
| Per-stage code rules (R1–R5, dry-run, W&B) | [act/STAGE.md](act/STAGE.md) |
Expand Down Expand Up @@ -144,7 +145,6 @@ Goal: produce a markdown plan the user reviews before any code is written.
| 6 | RL warm-starts from SFT; rewards validated before scale. | [patterns/rl-validate-rewards-before-scale.md](../../src/nemotron/steps/patterns/rl-validate-rewards-before-scale.md) |
| 7 | GPU count ≥ chosen model's `min_gpus` (from `[[models]]` block in each `step.toml`). | step.toml + [hardware.md](../../src/nemotron/steps/hardware.md) |
| 8 | Sovereign / customization patterns checked: `cpt-data-blend-scoping`, `sft-data-blending`, `multilingual-tokenizer-check`, `data-quality-before-quantity`, `sdg-pipeline-versioning`, `byob-benchmark-design`, `pretrain-token-budget-before-scale`, `sft-small-dataset-prefer-lora`, `convert-checkpoint-safety`. | [patterns/](../../src/nemotron/steps/patterns/) |

When a check fails: surface it as a `⚠` warning in the plan and propose a
fix. When the user can't satisfy it (e.g. hardware), propose alternatives in
descending preference: smaller model → AutoModel instead of Megatron-Bridge →
Expand Down Expand Up @@ -187,6 +187,7 @@ graph LR
| Resource | Required by | Notes |
|---|---|---|
| <resource> | <stage> | <status / question> |

````

**Step 2.5 — Present the plan and wait.** Don't proceed to Act until the
Expand Down Expand Up @@ -356,6 +357,17 @@ catalog-based stage."
If the same Explorer build keeps appearing across projects, suggest the user
run `/nemotron-add-step` to land it in the catalog.

### Explicit airgap handoff

Do this only when the user explicitly asks for airgap, offline/no-internet
execution, image tarballs, or Nemotron Customizer airgap bundle work. Do not
include it in normal local, Slurm, Lepton, Airflow, or Kubeflow planning.

When triggered, stop the generic project-generation path and load
[deploy/nemotron-customizer/airgap/SKILL.md](../../deploy/nemotron-customizer/airgap/SKILL.md).
Use the approved catalog step IDs as airgap runner `--target <step_id>:<config>`
values, then follow that skill's validate/build/run workflow.

### Choosing a mode

| User says | Mode |
Expand All @@ -367,6 +379,7 @@ run `/nemotron-add-step` to land it in the catalog.
| "Translate EN → \<lang\>" | Catalog ([translate/nemo_skills](../../src/nemotron/steps/translate/nemo_skills/)) |
| "Curate web text" | Catalog ([curate/nemo_curator](../../src/nemotron/steps/curate/nemo_curator/)) |
| "Deploy to TensorRT-LLM" | Explorer (no step yet — derive from upstream library docs and add a `convert/*` step if the path stabilizes) |
| "Build an airgap bundle", "offline cluster", "no internet", "image tarballs for these steps" | Explicit airgap handoff |
| "Train with X exotic backend" | Explorer or **ask** |
| Ambiguous | **Ask** |

Expand Down Expand Up @@ -437,6 +450,8 @@ configs.
- Tune parallelism beyond what `hardware.md` and `[[strategies]]` advise.
- Assume GPU count, type, or interconnect.
- Generate Slurm/Airflow/Kubeflow wrappers unless requested.
- Route to airgap for generic deployment requests; require an explicit airgap,
offline, no-internet, or image-tar bundle ask.
- Modify [src/nemotron/steps/](../../src/nemotron/steps/). To extend the catalog, route the user to `/nemotron-add-step`.
- Restate per-step rules in this skill — link to the step's `SKILL.md` instead.

Expand Down