Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions docs/contributor/validator.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ The validator engine mounts snapshot and recipe data as ConfigMaps:
| `AICR_SNAPSHOT_PATH` | Override snapshot mount path |
| `AICR_RECIPE_PATH` | Override recipe mount path |
| `AICR_VALIDATOR_IMAGE_REGISTRY` | Override image registry prefix (set by user) |
| `AICR_VALIDATOR_IMAGE_TAG` | Override the resolved image tag (e.g. `latest`). Bypasses the default `:v<version>` / `:sha-<commit>` resolution for feature-branch dev builds whose commit has no published image. |
| `AICR_NODE_SELECTOR` | User-provided node selector override for inner workloads (comma-separated `key=value` pairs). Set by the `--node-selector` CLI flag. Use `ctx.NodeSelector` to access the parsed value. |
| `AICR_TOLERATIONS` | User-provided toleration override for inner workloads (comma-separated `key=value:effect` entries). Set by the `--toleration` CLI flag. Use `ctx.Tolerations` to access the parsed value. |

Expand Down Expand Up @@ -227,8 +228,24 @@ Each entry in `recipes/validators/catalog.yaml`:
**Image tag resolution** (applied by `catalog.Load`):

1. `:latest` tags are replaced with the CLI version (e.g., `:v0.9.5`) for release builds
2. Explicit version tags (e.g., `:v1.2.3`) are never modified
3. `AICR_VALIDATOR_IMAGE_REGISTRY` overrides the registry prefix
2. On non-release dev builds with a valid commit, `:latest` becomes `:sha-<commit>` (matches the tags `on-push.yaml` pushes for merges to `main`)
3. Explicit version tags (e.g., `:v1.2.3`) are not modified by steps 1-2
4. `AICR_VALIDATOR_IMAGE_TAG` overrides the resolved tag on every validator image, including explicit catalog tags. Use this when running `aicr validate` from a feature-branch dev build whose commit has not been merged to `main` (no `:sha-<commit>` image has been published). Typical value: `latest`. Example: `AICR_VALIDATOR_IMAGE_TAG=latest aicr validate --phase performance ...`
5. `AICR_VALIDATOR_IMAGE_REGISTRY` overrides the registry prefix

**Digest-pinned references** (`name@sha256:…`) are not rewritten by step 4. A tag override is meaningless against a content-addressable pin, and naive rewriting would corrupt the digest. Step 5's registry override still applies — only the registry prefix changes, the digest is preserved verbatim.

**Env-var forwarding to the validator pod:** `AICR_CLI_VERSION`, `AICR_CLI_COMMIT`, `AICR_VALIDATOR_IMAGE_REGISTRY`, and `AICR_VALIDATOR_IMAGE_TAG` are forwarded from the CLI invocation into the validator container so that validators resolving inner workload images at runtime (e.g. `inference-perf`'s AIPerf benchmark Job) apply the same semantics as `catalog.Load`. If you set `AICR_VALIDATOR_IMAGE_TAG=latest` on the CLI, the override reaches both the outer validator Job and the inner benchmark Job — they always travel together.

**Pull-policy behavior when the override is set:** both the outer validator Job and every inner workload Job it dispatches route through the shared `catalog.ImagePullPolicy(image)` helper (`pkg/validator/catalog/catalog.go`). The rule, in precedence order, is:

1. **Side-loaded refs** (`ko.local/*`, `kind.local/*`) → `Never` (no registry to pull from).
2. **Digest-pinned refs** (`name@sha256:…`) → `IfNotPresent`. Cryptographic immutability means a cached copy is always correct; forcing `Always` here would make kubelet re-contact the registry every run, which breaks disconnected / air-gapped clusters even though the image itself was never overridden.
3. **`AICR_VALIDATOR_IMAGE_TAG` is set** → `Always`. Override values are typically mutable (`latest`, `edge`, `main`, or any tag `on-push.yaml` recreates on every merge), so `IfNotPresent` would let a node's previously cached image win over the tag's current target.
4. **`:latest` suffix** → `Always`. Mutable tag by convention.
5. **Otherwise** → `IfNotPresent`. Versioned tag assumed immutable enough that caching is a win.

Callers in this repo: the outer validator Job's `Deployer.imagePullPolicy()` (`pkg/validator/job/deployer.go`) and the inner AIPerf benchmark pod spec in `buildAIPerfJob` (`validators/performance/inference_perf_constraint.go`). They both delegate to the same helper so their policy can't drift. When adding a new inner workload Job in `validators/<phase>/*`, set `ImagePullPolicy: catalog.ImagePullPolicy(<resolved image>)` on the container to keep the invariant.

**Performance phase example — inference perf:**

Expand Down
91 changes: 88 additions & 3 deletions pkg/validator/catalog/catalog.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/NVIDIA/aicr/pkg/errors"
"github.com/NVIDIA/aicr/pkg/recipe"
"gopkg.in/yaml.v3"
corev1 "k8s.io/api/core/v1"
)

const (
Expand Down Expand Up @@ -103,9 +104,14 @@ type EnvVar struct {
// the tag is replaced with the CLI version for reproducibility.
// 2. If version is a non-release dev build and commit is a valid short SHA,
// the tag is replaced with :sha-<commit> to match on-push.yaml image tags.
// 3. If AICR_VALIDATOR_IMAGE_REGISTRY is set, the registry prefix is replaced.
// 3. If AICR_VALIDATOR_IMAGE_TAG is set, the resolved tag is overridden.
// Useful for feature-branch dev builds whose commit SHA has no published
// image (on-push.yaml only pushes SHA tags for commits merged to main).
// Common value: `latest`.
// 4. If AICR_VALIDATOR_IMAGE_REGISTRY is set, the registry prefix is replaced.
//
// Entries with explicit version tags (e.g., :v1.2.3) are never modified.
// Entries with explicit version tags (e.g., :v1.2.3) are never modified by
// steps 1-2 but are replaced by step 3 if that env var is set.
func Load(version, commit string) (*ValidatorCatalog, error) {
data, err := recipe.GetDataProvider().ReadFile("validators/catalog.yaml")
if err != nil {
Expand All @@ -131,7 +137,11 @@ func Load(version, commit string) (*ValidatorCatalog, error) {
//
// 1. :latest tag replacement with version if version is a release (vX.Y.Z).
// 2. If non-release and commit is a valid SHA, :latest → :sha-<commit>.
// 3. Registry prefix override if AICR_VALIDATOR_IMAGE_REGISTRY is set.
// 3. Tag override if AICR_VALIDATOR_IMAGE_TAG is set (overrides steps 1-2
// AND explicit catalog tags). Intended for feature-branch dev builds
// where no :sha-<commit> image has been published; typical value:
// `latest`.
// 4. Registry prefix override if AICR_VALIDATOR_IMAGE_REGISTRY is set.
//
// Images with explicit version tags are not modified by steps 1-2.
func ResolveImage(image, version, commit string) string {
Expand All @@ -141,12 +151,58 @@ func ResolveImage(image, version, commit string) string {
} else if isValidCommit(commit) {
image = replaceLatestWithSHA(image, commit)
}
if tag := os.Getenv("AICR_VALIDATOR_IMAGE_TAG"); tag != "" {
image = replaceTag(image, tag)
}
if override := os.Getenv("AICR_VALIDATOR_IMAGE_REGISTRY"); override != "" {
image = replaceRegistry(image, override)
}
return image
}

// ImagePullPolicy returns the appropriate Kubernetes pull policy for a
// resolved validator image. The caller should pass the image that
// ResolveImage would return (i.e. after any env-var rewriting); callers that
// just installed an image from the catalog can reuse this helper so the
// outer validator Job and any inner workload Jobs (e.g. inference-perf's
// aiperf-bench Job) stay in lockstep.
//
// Precedence (first match wins):
//
// 1. Side-loaded refs (ko.local/*, kind.local/*) → Never. No registry to
// pull from — the image is preloaded via `kind load docker-image`.
// 2. Digest-pinned refs (name@sha256:...) → IfNotPresent. The digest is
// cryptographically immutable, so a cached copy is always correct;
// forcing Always here would break disconnected/private clusters that
// preload images and make kubelet re-contact the registry every run.
// 3. AICR_VALIDATOR_IMAGE_TAG is set → Always. The override is intended
// for mutable published tags (e.g. `latest`, `edge`, `main` — tags
// on-push.yaml recreates on every merge); re-pulling prevents
// node-local caches from serving stale images.
// 4. `:latest` suffix → Always. Mutable tag by convention.
// 5. Otherwise → IfNotPresent. Versioned tag assumed immutable enough
// that caching is a win.
func ImagePullPolicy(image string) corev1.PullPolicy {
// Trailing slash anchors the match to the full registry segment so a
// real registry like `ko.localhost:5000/...` is not mistaken for a
// side-loaded `ko.local/...` ref and wrongly forced to PullNever.
if strings.HasPrefix(image, "ko.local/") || strings.HasPrefix(image, "kind.local/") {
return corev1.PullNever
}
if strings.Contains(image, "@") {
// Digest pin — immutable by construction. Caching is safe and
// also required for disconnected/air-gapped deployments.
return corev1.PullIfNotPresent
}
if os.Getenv("AICR_VALIDATOR_IMAGE_TAG") != "" {
return corev1.PullAlways
}
if strings.HasSuffix(image, ":latest") {
return corev1.PullAlways
}
return corev1.PullIfNotPresent
}

// releaseVersionPattern matches strict semantic versions: vX.Y.Z or X.Y.Z
// with no pre-release suffix. This ensures snapshot strings like
// v0.0.0-12-gabc1234 or pre-release tags like v1.0.0-rc1 are not treated
Expand Down Expand Up @@ -192,6 +248,35 @@ func isValidCommit(commit string) bool {
return true
}

// replaceTag forces the image's tag to newTag, regardless of what tag (if
// any) the image currently carries. Unlike replaceLatestTag / replaceLatestWithSHA,
// which only rewrite :latest, this helper supports the AICR_VALIDATOR_IMAGE_TAG
// env-var escape hatch: a user running a feature-branch dev build (where no
// :sha-<commit> image was published by on-push.yaml) can set the env var
// to `latest` and force every validator image to a published tag.
//
// Digest-pinned references (`name@sha256:…`) are cryptographic pins and are
// intentionally left untouched — a tag override is meaningless against a
// content-addressable ref, and naively rewriting would corrupt the digest.
// For non-digest refs, the tag separator is found as the last ':' that sits
// after the last '/' to avoid colliding with the registry port (`:5001` in
// `localhost:5001/...`).
func replaceTag(image, newTag string) string {
if strings.Contains(image, "@") {
// Digest-pinned ref (e.g. ghcr.io/foo/bar@sha256:deadbeef, or the
// mixed form name:tag@sha256:…). The digest is the authoritative
// pin; preserve it verbatim.
return image
}
slash := strings.LastIndex(image, "/")
colon := strings.LastIndex(image, ":")
if colon <= slash {
// No tag on the image (just an image reference) — append one.
return image + ":" + newTag
}
return image[:colon] + ":" + newTag
}

// replaceLatestWithSHA replaces :latest with :sha-<commit> to match the
// image tags pushed by the on-push CI workflow.
// Images with explicit version tags are not modified.
Expand Down
Loading
Loading