chore(recipes): bump 6 components to upstream latest (phase 1)#715
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request updates Helm chart versions across multiple YAML recipe files and adjusts Go module dependencies. The changes include version bumps for 15 Helm components across four recipe configuration files (platform-inference, base overlay, EKS overlay, and registry), and two modifications to the Go module: promoting Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
1969a93 to
a27aca5
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 20: You modified go.mod to pin golang.org/x/term v0.42.0 but did not
include the resulting module metadata and vendored files; run the repository's
make tidy (which runs go mod tidy and go mod vendor) to regenerate go.sum and
vendor/, verify the changes, and commit go.mod, go.sum, and the vendor/
directory together so module metadata and vendored dependencies remain
reproducible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8e229029-dfe9-494c-ab4e-800520ab21ef
📒 Files selected for processing (6)
go.modrecipes/components/aws-efa/values.yamlrecipes/mixins/platform-inference.yamlrecipes/overlays/base.yamlrecipes/overlays/eks.yamlrecipes/registry.yaml
a27aca5 to
c6e2d8f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/overlays/base.yaml`:
- Line 64: Replace the unavailable version string "v0.15.1" for the
nodewright-operator Helm chart with a published release (e.g., "v0.14.0");
locate the YAML key "version: v0.15.1" in the overlay that references the
nodewright-operator / skyhook-operator chart and update that value to an
available tag such as v0.14.0, then save and re-run your Helm/CI checks to
confirm the chart resolves.
In `@recipes/registry.yaml`:
- Line 146: The kube-prometheus-stack entry uses a non-existent version
(defaultVersion: 84.4.0) and the nvsentinel chart points to the wrong repo;
update the kube-prometheus-stack defaultVersion to 84.3.0 (or the intended
available release) and change the nvsentinel defaultRepository to the correct
OCI registry URL (oci://ghcr.io/nvidia/nvsentinel) in recipes/registry.yaml by
editing the kube-prometheus-stack defaultVersion key and the nvsentinel
defaultRepository key accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 95d71aa7-3a52-4d6b-a8df-1383329f6b3b
⛔ Files ignored due to path filters (6)
go.sumis excluded by!**/*.sumvendor/github.com/go-openapi/strfmt/.gitignoreis excluded by!vendor/**vendor/github.com/go-openapi/strfmt/CONTRIBUTORS.mdis excluded by!vendor/**vendor/github.com/go-openapi/strfmt/README.mdis excluded by!vendor/**vendor/github.com/go-openapi/strfmt/duration.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**
📒 Files selected for processing (5)
go.modrecipes/mixins/platform-inference.yamlrecipes/overlays/base.yamlrecipes/overlays/eks.yamlrecipes/registry.yaml
c6e2d8f to
fda302d
Compare
fda302d to
5336725
Compare
5336725 to
68896a7
Compare
68896a7 to
98153a5
Compare
98153a5 to
c12b5b3
Compare
c12b5b3 to
212aed8
Compare
a5e8ca9 to
25aa20f
Compare
25aa20f to
a5e8ca9
Compare
The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:
- name: gke-nccl-tcpxo
...
manifest:
defaultNamespace: kube-system
`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:
helm:
defaultRepository: ""
defaultNamespace: kube-system
Bug surfacing timeline:
- Pre-NVIDIA#706, manifest-only components were installed by the root
`deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
carry inline `metadata.namespace: kube-system`, so the empty registry
default was harmless; `kubectl apply` did not need
`ComponentRef.Namespace` for routing.
- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
localformat`) wraps every component — manifest-only included — as a
local Helm chart. The generated `install.sh` now always emits
`helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
block, that field is empty, producing:
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace --create-namespace \
Shell argument collapsing makes Helm parse the literal
`--create-namespace` as the namespace name and fails with:
Error: create: failed to create:
namespaces "--create-namespace" not found
- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
H100 GKE-COS training jobs (its registry/base.yaml changes
auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
failure.
Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:
$ aicr recipe --service gke --accelerator h100 \
--intent training --os cos -o /tmp/recipe.yaml
$ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
$ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace kube-system --create-namespace \
Refs: NVIDIA#706, NVIDIA#715
102fb8f to
5c9dc23
Compare
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Refs: NVIDIA#698 Closes: NVIDIA#716
5c9dc23 to
fb83549
Compare
Summary
Bump 6 Helm components in
recipes/registry.yamland overlay/mixin pins to upstream latest. No values schema changes. Pure recipe-pin diff (7 files; example recipes refreshed; orphan-manifest reference in aks-training cleaned up).Motivation / Context
Phase 1 of the version refresh campaign tracked in #698. Low-risk minor/patch bumps batched together; major bumps (network-operator, gpu-operator) follow in subsequent phases.
2.55.02.59.0v1.17.2v1.20.282.8.084.4.00.17.00.17.1v0.14.0v0.15.1v1.1.0v1.3.0Excluded from this PR (deferred to follow-ups):
kgateway/kgateway-crds(v2.0.0 → v2.2.3) —v2.2.3silently drops theinferenceExtension.enabledvalue. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so the silent feature regression would break inference bundles. Migration requires a values + RBAC rework.aws-efa(v0.5.3 → v0.5.26) — 23 minor bumps; requires bothimage.tagcleanup and asecurityContextcleanup (chart now defaults toprivileged: true, conflicting with our hardenedallowPrivilegeEscalation: false). Deferred for proper EKS / security review.kai-scheduler(v0.13.0 → v0.14.1) — repo transferred fromNVIDIA/tokai-scheduler/org; chart publishing moved toghcr.io/kai-scheduler/kai-scheduler. Migration + bump belongs in a small standalone follow-up.kubeflow-trainer(2.1.0 → 2.2.0) — coupled to a Go change invalidators/performance/trainer_lifecycle.go. Belong together in a follow-up to keep this PR pure config / no Go changes.Example recipes:
examples/recipes/{kind,eks-training,aks-training,eks-gb200-ubuntu-training-with-validation}.yamlrefreshed to match the bumped registry defaults (cert-manager, nodewright-operator, kube-prometheus-stack, nvsentinel). Matches the convention from prior bump PRs (#283, #336, #450).Orphan-manifest fix in
aks-training.yaml(supersedes #716): also removes a stalemanifestFiles:reference tocomponents/nvsentinel/manifests/allow-intra-namespace.yamlthat has been broken since #415. The workaround source file was deleted in #309.aicr bundle examples/recipes/aks-training.yamlwas failing onmain; this fix restores it.Fixes: aks-training example bundling failure on
mainRelated: #698, supersedes #716
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
cert-managerspans 3 minors (1.17 → 1.20); spot-checked changelog — no breaking API changes affecting our values surface.kube-prometheus-stack82.8 → 84.4: chart-level bumps; values used in this repo unaffected.helm pull.Testing
End-to-end on real EKS H100 clusters (post-rebase):
aicr2(eks/h100/inference/ubuntu/dynamo): bundle 20/20 components healthy,aicr validate --phase performancepassed — throughput 38,771 tok/s, TTFT p99 133 ms.aicr1(eks/h100/training/ubuntu/kubeflow): bundle 16/16 components healthy, demo TrainJob fromdemos/cuj1-eks.mdran end-to-end on 1× H100 (accuracy 0.74). The pytorch job requires the fix in fix(recipes): drop hook-succeeded from torch-distributed runtime #719 (pre-existingmainbug, unrelated to this PR — surfaced when exercising the kubeflow path).Risk Assessment
Rollout notes: No migration steps. Bundles regenerated from this branch will pull the new chart versions; existing installations are unaffected until re-bundled.
Checklist
make testwith-race)make lint)git commit -S)