fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo#718
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:
- name: gke-nccl-tcpxo
...
manifest:
defaultNamespace: kube-system
`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:
helm:
defaultRepository: ""
defaultNamespace: kube-system
Bug surfacing timeline:
- Pre-NVIDIA#706, manifest-only components were installed by the root
`deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
carry inline `metadata.namespace: kube-system`, so the empty registry
default was harmless; `kubectl apply` did not need
`ComponentRef.Namespace` for routing.
- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
localformat`) wraps every component — manifest-only included — as a
local Helm chart. The generated `install.sh` now always emits
`helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
block, that field is empty, producing:
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace --create-namespace \
Shell argument collapsing makes Helm parse the literal
`--create-namespace` as the namespace name and fails with:
Error: create: failed to create:
namespaces "--create-namespace" not found
- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
H100 GKE-COS training jobs (its registry/base.yaml changes
auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
failure.
Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:
$ aicr recipe --service gke --accelerator h100 \
--intent training --os cos -o /tmp/recipe.yaml
$ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
$ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace kube-system --create-namespace \
Refs: NVIDIA#706, NVIDIA#715
4746872 to
afc1026
Compare
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. - recipes/registry.yaml: also fix the gke-nccl-tcpxo registry entry to use the established manifest-only Helm pattern (empty `helm.defaultRepository` plus `defaultNamespace: kube-system`) instead of the unparsed `manifest:` block. The `manifest:` field is not on the ComponentConfig struct, so its `defaultNamespace` was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only components were installed via raw `kubectl apply`, which routed via inline `metadata.namespace`). After NVIDIA#706 wraps every component as a local Helm chart, the generated install.sh emits `--namespace --create-namespace` (empty) and Helm fails. This blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when registry.yaml or base.yaml change. Switches to the same pattern used by `nodewright-customizations`. Verified bundled install.sh now contains `--namespace kube-system`. Supersedes NVIDIA#718. Refs: NVIDIA#698 Closes: NVIDIA#716, NVIDIA#718
|
Absorbing this fix into #715 — same 3-line registry change now lands as part of the phase-1 PR (commit |
Summary
The registry entry for
gke-nccl-tcpxouses a top-levelmanifest:block that is not parsed by theComponentConfigstruct, somanifest.defaultNamespace: kube-systemis silently ignored. After #706 wrapped manifest-only components as local Helm charts, the generated install.sh emitshelm upgrade --install ... --namespace --create-namespace(empty namespace value) → bundler fails on every post-#706 GKE-COS H100 training run.Fix: declare
gke-nccl-tcpxowith the established manifest-only Helm pattern (emptydefaultRepository), matchingnodewright-customizations.Motivation / Context
PR #715's CI failed both
Tier 2: h100-gke-cos-trainingandTier 2: h100-gke-cos-training-kubeflowwith:Root-cause:
recipes/registry.yamldeclares the namespace undermanifest:, but the registry's Go schema only parseshelm:andkustomize:.manifest:is dead config.kubectl apply -f .../manifests/. The manifest YAML carries inlinemetadata.namespace: kube-system, so kubectl routed correctly without needingComponentRef.Namespace.feat(bundler)\!: uniform NNN-folder bundle layout via localformat) wraps every component — manifest-only included — as a local Helm chart. The generatedinstall.shnow always useshelm upgrade --install ... --namespace <ns> --create-namespace, which requiresComponentRef.Namespace. With the unparsedmanifest:block, that field is empty.The result on a post-#706 GKE-COS bundle is:
Shell argument collapsing makes Helm parse
--create-namespaceas the namespace name → fails.gke-nccl-tcpxois the only component still using the unparsedmanifest:form.nodewright-customizationsalready uses the correct pattern.Change
- name: gke-nccl-tcpxo displayName: gke-nccl-tcpxo valueOverrideKeys: - gkenccltcpxo - manifest: + helm: + # Manifest-only component - no external Helm chart, uses manifestFiles + defaultRepository: "" defaultNamespace: kube-systemNet change: 1 file, +3/-1 lines. No other component is affected.
Fixes: post-#706 GKE-COS H100 training bundle generation
Related: #706 (the deploy-path migration that exposed the bug), #715 (the first PR whose CI completed far enough to surface it)
Type of Change
Component(s) Affected
pkg/recipe) — registry data onlypkg/bundler,pkg/component/*) — fixes generated install.shImplementation Notes
nodewright-customizations(also a manifest-only component): declare underhelm:with an emptydefaultRepository. This routes correctly through the post-feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706 local-Helm wrapper without changing the manifest content itself.Testing
End-to-end repro and fix verification:
Pre-fix the same command produced
--namespace --create-namespace(empty).Sanity-checked AKS-training bundle still generates correctly (no regression on non-GKE recipes).
Risk Assessment
gke-nccl-tcpxofrom a never-parsedmanifest:block to the well-established manifest-onlyhelm:pattern (same asnodewright-customizations). No code, no values, no overlays touched.Rollout notes: Bundles regenerated post-merge for any GKE-COS recipe will produce a working
install.shforgke-nccl-tcpxo. Existing installations are unaffected until re-bundled.Checklist
make testwith-race)make lint)--namespacefor manifest-only install.sh)nodewright-customizations)git commit -S)