Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# AGENTS.md -- Nemotron Repository Agent Context

## What This Repo Does

Nemotron is NVIDIA's open-source repository for reproducible LLM training pipelines. It provides:

1. **Training recipes** for NVIDIA model families (Nano3, Super3, Embed) -- full pretrain/SFT/RL pipelines
2. **Customization recipes** for adapting models to new languages, domains, and use cases (Sovereign AI Playbook)
3. **Data preparation** infrastructure for tokenization, packing, and format conversion
4. **Evaluation** via NeMo Evaluator with benchmark suites

## Repository Layout

```
Nemotron/
AGENTS.md <-- You are here
pyproject.toml <-- Package config; entry point: nemotron CLI
src/
nemo_runspec/ <-- Config loading, execution, PEP 723 metadata parsing
nemotron/
cli/
bin/nemotron.py <-- CLI root (Typer app)
commands/
nano3/ <-- Nano3 commands: pretrain, sft, rl, eval, pipe
super3/ <-- Super3 commands: pretrain, sft, rl (rlhf/rlvr/swe)
embed/ <-- Embedding model commands: sdg, prep, finetune, eval, export, deploy
customize/ <-- Customization CLI: translate, data-prep, cpt, sft, sdg, rl, byob, eval, quantize
kit/ <-- CLI utilities (app, squash)
kit/ <-- Domain toolkit: Artifact types, lineage tracking, W&B, recipe loading
data_prep/ <-- Distributed data prep library (bin/idx, packed parquet, JSONL)
recipes/
nano3/ <-- Nano3 recipe scripts + configs
stage0_pretrain/ <-- train.py, data_prep.py, config/
stage1_sft/
stage2_rl/
stage3_eval/
super3/ <-- Super3 recipe scripts + configs
stage0_pretrain/
stage1_sft/
stage2_rl/ <-- Sub-stages: rlvr, swe1, swe2, rlhf
stage3_eval/
embed/ <-- Embedding model recipes
stage0_sdg/ .. stage5_deploy/
data_curation/ <-- NeMo Curator recipes (nemotron-cc)
customization_recipes/ <-- Sovereign AI customization pipelines
nemotron/ <-- Nemotron model customization (7 stages: 0-6)
SKILL.md <-- E2E customization pipeline skill definition
stage0_data_prep/ <-- Data Preparation & Translation
stage1_cpt/ <-- Continued Pretraining
stage2_sft/ <-- Supervised Fine-Tuning + SDG
stage3_rl/ <-- Reinforcement Learning (DPO/GRPO)
stage4_byob/ <-- Build Your Own Benchmark
stage5_eval/ <-- Evaluation
stage6_quantization/ <-- Quantization for deployment
llama/ <-- Llama model customization (same stage structure)
qwen/ <-- Qwen model customization (same stage structure)
data_prep/ <-- Shared data prep utilities for customization
tests/
docs/
deploy/ <-- Deployment configs (Docker, Helm)
tools/
usage-cookbook/
use-case-examples/
```

## Key Infrastructure

### nemotron CLI

Entry point: `nemotron` (defined in `pyproject.toml` as `nemotron.__main__:main`).

```bash
# Pattern: nemotron <model> <stage> [options] [overrides]
nemotron nano3 pretrain -c default # Local execution
nemotron nano3 pretrain -c default --run MY-CLUSTER # Remote via nemo-run (attached)
nemotron nano3 pretrain -c default --batch MY-CLUSTER # Remote via nemo-run (detached)
nemotron nano3 pretrain -c default --dry-run # Preview compiled config
nemotron nano3 sft -c default --run MY-CLUSTER train.train_iters=5000 # Override params
nemotron nano3 pipe --run MY-CLUSTER # Compose pretrain + sft
nemotron nano3 eval --run MY-CLUSTER # Run evaluation suite

# Data prep (run directly, not via CLI)
python src/nemotron/recipes/nano3/stage0_pretrain/data_prep.py --config <yaml>
```

Global options: `-c/--config`, `-r/--run`, `-b/--batch`, `-d/--dry-run`, `--stage`, `--force-squash`.

### nemo_runspec

Module: `src/nemo_runspec/`

Parses PEP 723 `[tool.runspec]` metadata from recipe scripts. Provides:
- `nemo_runspec.parse(script_path)` -- returns `Runspec` with name, image, config_dir, resources
- `nemo_runspec.config` -- OmegaConf YAML loading, job config building, artifact URI resolution
- `nemo_runspec.execution` -- local (torchrun) and remote (Slurm/Lepton/Run:AI/Ray via nemo-run) execution
- `nemo_runspec.packaging` -- SelfContainedPackager for remote code shipping

Config resolution chain: script `[tool.runspec]` -> `config/<name>.yaml` -> `env.toml` profile -> CLI overrides.

### nemotron.kit

Module: `src/nemotron/kit/`

Domain-specific toolkit:
- `nemotron.kit.Artifact` -- base class for typed artifacts (pydantic)
- `nemotron.kit.ModelArtifact`, `PretrainDataArtifact`, `SFTDataArtifact` -- typed artifact classes
- `nemotron.kit.init(backend="fsspec"|"wandb", root=...)` -- initialize artifact registry
- `nemotron.kit.recipe_loader` -- `import_recipe_function(target)`, `extract_recipe_config(config)`
- `nemotron.kit.train_script` -- `parse_config_and_overrides()`, `load_omegaconf_yaml()`, `apply_hydra_overrides()`
- `nemotron.kit.wandb_kit` -- W&B initialization, monkey patches, lineage tracking

### nemotron.data_prep

Module: `src/nemotron/data_prep/`

Distributed data prep built on cosmos-xenna pipelines:
- `nemotron.data_prep.api` -- `run_pretrain_pipeline()`, `run_sft_pipeline()`
- Three-phase pattern: `setup_*_run()` -> xenna pipeline stages -> `finalize_*_run()`
- Output formats: bin/idx (pretrain), packed Parquet (SFT), JSONL (RL)
- Stages: PlanStage -> DownloadStage -> terminal stage (BinIdxTokenization / PackedSftParquet / JsonlShard)

## Task Routing

| Task | Go to |
|------|-------|
| Train Nano3 from scratch | `src/nemotron/recipes/nano3/` |
| Train Super3 from scratch | `src/nemotron/recipes/super3/` |
| Train embedding model | `src/nemotron/recipes/embed/` |
| Curate web data (CommonCrawl) | `src/nemotron/recipes/data_curation/nemotron-cc/` |
| Translate data for customization | `src/nemotron/customization_recipes/nemotron/stage0_data_prep/SKILL.md` |
| Customize Nemotron for a language/domain | `src/nemotron/customization_recipes/nemotron/SKILL.md` |
| Customize Llama for a language/domain | `src/nemotron/customization_recipes/llama/SKILL.md` |
| Customize Qwen for a language/domain | `src/nemotron/customization_recipes/qwen/SKILL.md` |
| Prepare training data (tokenize, pack) | `src/nemotron/data_prep/` |
| Add a new CLI command | `src/nemotron/cli/commands/` + register in `cli/bin/nemotron.py` |
| Add a new recipe | Create `<stage>/train.py` with `[tool.runspec]` + `<stage>/config/default.yaml` |
| Modify execution backend | Edit `_execute_*()` in the relevant CLI command module |
| Evaluate a model | `src/nemotron/recipes/<model>/stage*_eval/` |
| Build custom benchmarks (MCQ) | `src/nemotron/customization_recipes/nemotron/stage4_byob/SKILL.md` |
| Quantize a model | `src/nemotron/customization_recipes/nemotron/stage6_quantization/SKILL.md` |

## SKILL.md References

| Skill | Path |
|-------|------|
| E2E Nemotron Customization | `src/nemotron/customization_recipes/nemotron/SKILL.md` |
| Stage 0: Data Preparation & Translation | `src/nemotron/customization_recipes/nemotron/stage0_data_prep/SKILL.md` |
| Stage 1: Continued Pretraining | `src/nemotron/customization_recipes/nemotron/stage1_cpt/SKILL.md` |
| Stage 2: SFT + SDG | `src/nemotron/customization_recipes/nemotron/stage2_sft/SKILL.md` |
| Stage 3: RL (DPO/GRPO) | `src/nemotron/customization_recipes/nemotron/stage3_rl/SKILL.md` |
| Stage 4: BYOB Benchmarks | `src/nemotron/customization_recipes/nemotron/stage4_byob/SKILL.md` |
| Stage 5: Evaluation | `src/nemotron/customization_recipes/nemotron/stage5_eval/SKILL.md` |
| Stage 6: Quantization | `src/nemotron/customization_recipes/nemotron/stage6_quantization/SKILL.md` |
| Shared Data Prep | `src/nemotron/customization_recipes/data_prep/SKILL.md` |
| Llama Customization | `src/nemotron/customization_recipes/llama/SKILL.md` |
| Qwen Customization | `src/nemotron/customization_recipes/qwen/SKILL.md` |

## Execution Backends

| Backend | Flag | Infrastructure | Notes |
|---------|------|---------------|-------|
| Local | (default) | torchrun on local GPUs | For dev/debug; single-node |
| Docker | `--run <profile>` | nemo-run + DockerExecutor | Local GPU container execution |
| Slurm (attached) | `--run <profile>` | nemo-run + SlurmExecutor | Logs streamed to terminal |
| Slurm (detached) | `--batch <profile>` | nemo-run + SlurmExecutor | Submit and exit |
| Lepton (DGX Cloud) | `--run <profile>` | nemo-run + LeptonExecutor | DGX Cloud via Lepton API; requires `node_group` |
| Run:AI | `--run <profile>` | nemo-run + KubeflowExecutor | Kubernetes GPU orchestration via Run:AI; requires `cluster` + `project` |
| Ray | (auto for RL) | nemo-run + RayJob | Used by GRPO/RL stages |

Env profiles are stored in `env.toml` at repo root (not checked in). Examples:

```toml
# --- Slurm cluster ---
[MY-CLUSTER]
executor = "slurm"
host = "login.cluster.example.com"
user = "myuser"
account = "myaccount"
partition = "batch"
remote_job_dir = "/lustre/myuser/jobs"
container = "nvcr.io/nvidia/nemo:26.02.super.rc1"
gpus_per_node = 8
nodes = 2

[MY-CLUSTER.wandb]
entity = "my-team"
project = "my-project"

# --- Lepton (DGX Cloud) ---
[lepton-dgx]
executor = "lepton"
container_image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
node_group = "my-dgx-group"
resource_shape = "gpu.8xh100-80gb"
nodes = 2
gpus_per_node = 8

[[lepton-dgx.mounts]]
path = "/shared-storage/data"
mount_path = "/data"

# --- Run:AI (Kubernetes) ---
[runai-cluster]
executor = "runai"
container_image = "nvcr.io/nvidia/nemo:25.11.nemotron_3_nano"
cluster = "my-runai-cluster"
project = "my-team"
nodes = 2
gpus_per_node = 8
node_pool = "h100-pool"

[[runai-cluster.pvc_mounts]]
name = "training-data-pvc"
mount_path = "/data"
```

## Config Resolution Order

1. Recipe script `[tool.runspec]` PEP 723 metadata (name, image, config_dir, default config)
2. YAML config file from `config/` directory (selected via `-c` flag)
3. `env.toml` profile (selected via `--run`/`--batch` flag) -- merged into `run.env`
4. CLI key=value overrides (Hydra-style, e.g., `train.train_iters=5000`)

Artifact URIs (`${art:data,path}`, `${art:model,path}`) are resolved at config load time via `nemo_runspec.config.resolvers`.

## Container Images

| Model | Stage | Image |
|-------|-------|-------|
| Nano3 | Pretrain/SFT | `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` |
| Nano3 | RL | `nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano` |
| Super3 | Pretrain/SFT | `nvcr.io/nvidian/nemo:26.02.super.rc1` |
| Customization | CPT/SFT | `nvcr.io/nvidia/nemo:25.11.nemotron_3_nano` (or model-specific) |
| Customization | SDG | Requires NeMo DataDesigner |
| Customization | Eval | NeMo Evaluator launcher pulls its own containers |
105 changes: 105 additions & 0 deletions deploy/nemotron/customization_recipes/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# =============================================================================
# Nemotron Orchestrator Container (nemotron-orchestrator)
#
# Lightweight CLI + orchestration container. Routes work to the curator,
# trainer, evaluator, and NIM service containers. Does NOT include heavy
# ML frameworks (NeMo, Megatron, PyTorch) -- those live in dedicated
# service containers.
#
# This is part of the multi-container customization deployment:
# - nemotron-orchestrator (this image) — CLI, orchestration, Docker client
# - nemotron-curator — NeMo Curator, data prep, SDG, BYOB
# - nemotron-trainer — NeMo + Megatron, CPT/SFT/RL training
# - nemotron-evaluator — Model evaluation, benchmarks
# - nemotron-nim — NIM for local LLM inference
#
# Build:
# docker compose build nemotron-orchestrator
#
# Run:
# docker compose run --rm nemotron-orchestrator nemotron customize --help
# =============================================================================

FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

ARG REMOTE_USER=nemotron
ARG REMOTE_UID=1000
ARG REMOTE_GID=1000

# Create the user/group (ignore if they already exist)
RUN groupadd --gid $REMOTE_GID $REMOTE_USER -f && \
if [ -z "$(id -u $REMOTE_UID 2>/dev/null)" ]; then \
useradd --uid $REMOTE_UID --gid $REMOTE_GID -m $REMOTE_USER; \
fi

# System dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
sudo \
ca-certificates \
curl \
git \
git-lfs \
wget \
unzip \
python3 \
python3-pip \
python3-dev \
&& update-ca-certificates \
&& ln -sf /usr/bin/python3 /usr/bin/python \
&& rm -rf /var/lib/apt/lists/*

# Add user to sudoers
RUN REAL_USER=$(id -u -n ${REMOTE_UID} 2>/dev/null || echo $REMOTE_USER) && \
echo "$REAL_USER ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/$REAL_USER && \
chmod 0440 /etc/sudoers.d/$REAL_USER

# Install Docker CLI (for orchestrating other containers)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg && \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
> /etc/apt/sources.list.d/docker.list && \
apt-get update && \
apt-get install -y --no-install-recommends docker-ce-cli docker-compose-plugin && \
rm -rf /var/lib/apt/lists/*

# Install NGC CLI (required for data-designer persona downloads and model access)
RUN cd /tmp && \
wget -q -O ngccli_linux.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.4/files/ngccli_linux.zip && \
unzip -q ngccli_linux.zip && \
mv ngc-cli /opt/ngc-cli && \
rm ngccli_linux.zip
ENV PATH="/opt/ngc-cli:${PATH}"

# Copy the Nemotron repo into the container
COPY --chown=$REMOTE_UID:$REMOTE_GID . /workspace/nemotron

WORKDIR /workspace/nemotron

# Install Nemotron CLI (lightweight — no heavy ML deps)
# The [customize] extras pull in orchestration + config deps only;
# heavy training/inference deps are in the trainer/curator containers.
RUN pip install --no-cache-dir -e ".[customize]"

# Mark this container as the orchestrator so the dispatcher knows to route
# commands to sibling containers via docker exec instead of running locally.
ENV NEMOTRON_ORCHESTRATOR=1
ENV NEMOTRON_CONTAINER=orchestrator

# Switch to the user
USER $REMOTE_UID

CMD ["tail", "-f", "/dev/null"]
Loading