Skip to content

chore(recipes): bump 6 components to upstream latest (phase 1)#715

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/recipes-bump-minor
Apr 30, 2026
Merged

chore(recipes): bump 6 components to upstream latest (phase 1)#715
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:chore/recipes-bump-minor

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 29, 2026

Summary

Bump 6 Helm components in recipes/registry.yaml and overlay/mixin pins to upstream latest. No values schema changes. Pure recipe-pin diff (7 files; example recipes refreshed; orphan-manifest reference in aks-training cleaned up).

Motivation / Context

Phase 1 of the version refresh campaign tracked in #698. Low-risk minor/patch bumps batched together; major bumps (network-operator, gpu-operator) follow in subsequent phases.

Component Current New
aws-ebs-csi-driver 2.55.0 2.59.0
cert-manager v1.17.2 v1.20.2
kube-prometheus-stack 82.8.0 84.4.0
kueue 0.17.0 0.17.1
nodewright-operator (skyhook) v0.14.0 v0.15.1
nvsentinel v1.1.0 v1.3.0

Excluded from this PR (deferred to follow-ups):

  • kgateway / kgateway-crds (v2.0.0 → v2.2.3) — v2.2.3 silently drops the inferenceExtension.enabled value. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so the silent feature regression would break inference bundles. Migration requires a values + RBAC rework.
  • aws-efa (v0.5.3 → v0.5.26) — 23 minor bumps; requires both image.tag cleanup and a securityContext cleanup (chart now defaults to privileged: true, conflicting with our hardened allowPrivilegeEscalation: false). Deferred for proper EKS / security review.
  • kai-scheduler (v0.13.0 → v0.14.1) — repo transferred from NVIDIA/ to kai-scheduler/ org; chart publishing moved to ghcr.io/kai-scheduler/kai-scheduler. Migration + bump belongs in a small standalone follow-up.
  • kubeflow-trainer (2.1.0 → 2.2.0) — coupled to a Go change in validators/performance/trainer_lifecycle.go. Belong together in a follow-up to keep this PR pure config / no Go changes.

Example recipes: examples/recipes/{kind,eks-training,aks-training,eks-gb200-ubuntu-training-with-validation}.yaml refreshed to match the bumped registry defaults (cert-manager, nodewright-operator, kube-prometheus-stack, nvsentinel). Matches the convention from prior bump PRs (#283, #336, #450).

Orphan-manifest fix in aks-training.yaml (supersedes #716): also removes a stale manifestFiles: reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since #415. The workaround source file was deleted in #309. aicr bundle examples/recipes/aks-training.yaml was failing on main; this fix restores it.

Fixes: aks-training example bundling failure on main
Related: #698, supersedes #716

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

Testing

make tidy    # no-op (clean)
make lint    # 0 issues
make test    # all packages pass
go test -count=1 ./pkg/recipe/...    # ok
helm pull oci://...                  # verified all 6 bumped charts pullable
aicr bundle -r examples/recipes/aks-training.yaml -o /tmp/bundle  # succeeds

End-to-end on real EKS H100 clusters (post-rebase):

  • Deployed and tested inference performance on aicr2 (eks/h100/inference/ubuntu/dynamo): bundle 20/20 components healthy, aicr validate --phase performance passed — throughput 38,771 tok/s, TTFT p99 133 ms.
  • Deployed and tested pytorch-mnist on aicr1 (eks/h100/training/ubuntu/kubeflow): bundle 16/16 components healthy, demo TrainJob from demos/cuj1-eks.md ran end-to-end on 1× H100 (accuracy 0.74). The pytorch job requires the fix in fix(recipes): drop hook-succeeded from torch-distributed runtime #719 (pre-existing main bug, unrelated to this PR — surfaced when exercising the kubeflow path).

Risk Assessment

  • Low — All bumps are minor/patch within the same major; no values schema changes; isolated to recipe pins.

Rollout notes: No migration steps. Bundles regenerated from this branch will pull the new chart versions; existing installations are unaffected until re-bundled.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — version bumps only)
  • I updated docs if user-facing behavior changed (N/A — pins only)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 added enhancement New feature or request area/recipes labels Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request updates Helm chart versions across multiple YAML recipe files and adjusts Go module dependencies. The changes include version bumps for 15 Helm components across four recipe configuration files (platform-inference, base overlay, EKS overlay, and registry), and two modifications to the Go module: promoting golang.org/x/term v0.42.0 from an indirect to a direct dependency, and updating github.com/go-openapi/strfmt from v0.26.1 to v0.26.2.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title "chore(recipes): bump 6 components to upstream latest (phase 1)" is partially related to the changeset—it refers to component bumping, which is the main type of change, but understates the scope (the PR actually bumps multiple components and includes vendor/go.mod fixes beyond the 6 mentioned).
Description check ✅ Passed The PR description comprehensively documents version bumps across multiple Helm components, test results, and risk assessment aligned with the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch 3 times, most recently from 1969a93 to a27aca5 Compare April 29, 2026 21:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@go.mod`:
- Line 20: You modified go.mod to pin golang.org/x/term v0.42.0 but did not
include the resulting module metadata and vendored files; run the repository's
make tidy (which runs go mod tidy and go mod vendor) to regenerate go.sum and
vendor/, verify the changes, and commit go.mod, go.sum, and the vendor/
directory together so module metadata and vendored dependencies remain
reproducible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8e229029-dfe9-494c-ab4e-800520ab21ef

📥 Commits

Reviewing files that changed from the base of the PR and between 1969a93 and a27aca5.

📒 Files selected for processing (6)
  • go.mod
  • recipes/components/aws-efa/values.yaml
  • recipes/mixins/platform-inference.yaml
  • recipes/overlays/base.yaml
  • recipes/overlays/eks.yaml
  • recipes/registry.yaml

Comment thread go.mod
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from a27aca5 to c6e2d8f Compare April 29, 2026 22:09
@github-actions github-actions Bot added size/L and removed size/S labels Apr 29, 2026
@yuanchen8911 yuanchen8911 changed the title chore(recipes): bump 10 components to upstream latest (phase 1) chore(recipes): bump 9 components to upstream latest (phase 1) Apr 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/overlays/base.yaml`:
- Line 64: Replace the unavailable version string "v0.15.1" for the
nodewright-operator Helm chart with a published release (e.g., "v0.14.0");
locate the YAML key "version: v0.15.1" in the overlay that references the
nodewright-operator / skyhook-operator chart and update that value to an
available tag such as v0.14.0, then save and re-run your Helm/CI checks to
confirm the chart resolves.

In `@recipes/registry.yaml`:
- Line 146: The kube-prometheus-stack entry uses a non-existent version
(defaultVersion: 84.4.0) and the nvsentinel chart points to the wrong repo;
update the kube-prometheus-stack defaultVersion to 84.3.0 (or the intended
available release) and change the nvsentinel defaultRepository to the correct
OCI registry URL (oci://ghcr.io/nvidia/nvsentinel) in recipes/registry.yaml by
editing the kube-prometheus-stack defaultVersion key and the nvsentinel
defaultRepository key accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 95d71aa7-3a52-4d6b-a8df-1383329f6b3b

📥 Commits

Reviewing files that changed from the base of the PR and between a27aca5 and c6e2d8f.

⛔ Files ignored due to path filters (6)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/go-openapi/strfmt/.gitignore is excluded by !vendor/**
  • vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.md is excluded by !vendor/**
  • vendor/github.com/go-openapi/strfmt/README.md is excluded by !vendor/**
  • vendor/github.com/go-openapi/strfmt/duration.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (5)
  • go.mod
  • recipes/mixins/platform-inference.yaml
  • recipes/overlays/base.yaml
  • recipes/overlays/eks.yaml
  • recipes/registry.yaml

Comment thread recipes/overlays/base.yaml
Comment thread recipes/registry.yaml
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from c6e2d8f to fda302d Compare April 29, 2026 22:18
@github-actions github-actions Bot added size/S and removed size/L labels Apr 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from fda302d to 5336725 Compare April 29, 2026 22:24
@github-actions github-actions Bot added size/M and removed size/S labels Apr 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from 5336725 to 68896a7 Compare April 29, 2026 22:34
@github-actions github-actions Bot added size/S and removed size/M labels Apr 29, 2026
@yuanchen8911 yuanchen8911 changed the title chore(recipes): bump 9 components to upstream latest (phase 1) chore(recipes): bump 7 components to upstream latest (phase 1) Apr 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from 68896a7 to 98153a5 Compare April 29, 2026 22:46
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from 98153a5 to c12b5b3 Compare April 29, 2026 22:59
@github-actions github-actions Bot added size/M and removed size/S labels Apr 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from c12b5b3 to 212aed8 Compare April 29, 2026 23:12
@yuanchen8911 yuanchen8911 changed the title chore(recipes): bump 7 components to upstream latest (phase 1) chore(recipes): bump 6 components to upstream latest (phase 1) Apr 29, 2026
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch 5 times, most recently from a5e8ca9 to 25aa20f Compare April 30, 2026 00:05
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from 25aa20f to a5e8ca9 Compare April 30, 2026 00:06
@github-actions github-actions Bot removed the area/ci label Apr 30, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:

    - name: gke-nccl-tcpxo
      ...
      manifest:
        defaultNamespace: kube-system

`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:

    helm:
      defaultRepository: ""
      defaultNamespace: kube-system

Bug surfacing timeline:

- Pre-NVIDIA#706, manifest-only components were installed by the root
  `deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
  carry inline `metadata.namespace: kube-system`, so the empty registry
  default was harmless; `kubectl apply` did not need
  `ComponentRef.Namespace` for routing.

- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
  localformat`) wraps every component — manifest-only included — as a
  local Helm chart. The generated `install.sh` now always emits
  `helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
  which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
  block, that field is empty, producing:

      helm upgrade --install gke-nccl-tcpxo ./ \
        --namespace  --create-namespace \

  Shell argument collapsing makes Helm parse the literal
  `--create-namespace` as the namespace name and fails with:

      Error: create: failed to create:
      namespaces "--create-namespace" not found

- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
  used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
  PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
  H100 GKE-COS training jobs (its registry/base.yaml changes
  auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
  failure.

Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:

    $ aicr recipe --service gke --accelerator h100 \
                  --intent training --os cos -o /tmp/recipe.yaml
    $ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
    $ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
        helm upgrade --install gke-nccl-tcpxo ./ \
          --namespace kube-system --create-namespace \

Refs: NVIDIA#706, NVIDIA#715
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch 2 times, most recently from 102fb8f to 5c9dc23 Compare April 30, 2026 01:31
@github-actions github-actions Bot added size/S and removed size/M labels Apr 30, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 30, 2026 15:43
@yuanchen8911 yuanchen8911 requested review from a team as code owners April 30, 2026 15:43
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

Refs: NVIDIA#698
Closes: NVIDIA#716
@yuanchen8911 yuanchen8911 force-pushed the chore/recipes-bump-minor branch from 5c9dc23 to fb83549 Compare April 30, 2026 15:50
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 30, 2026 16:46
@yuanchen8911 yuanchen8911 merged commit fa5c02b into NVIDIA:main Apr 30, 2026
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants