fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo by yuanchen8911 · Pull Request #718 · NVIDIA/aicr

yuanchen8911 · 2026-04-30T00:42:30Z

Summary

The registry entry for gke-nccl-tcpxo uses a top-level manifest: block that is not parsed by the ComponentConfig struct, so manifest.defaultNamespace: kube-system is silently ignored. After #706 wrapped manifest-only components as local Helm charts, the generated install.sh emits helm upgrade --install ... --namespace --create-namespace (empty namespace value) → bundler fails on every post-#706 GKE-COS H100 training run.

Fix: declare gke-nccl-tcpxo with the established manifest-only Helm pattern (empty defaultRepository), matching nodewright-customizations.

Motivation / Context

PR #715's CI failed both Tier 2: h100-gke-cos-training and Tier 2: h100-gke-cos-training-kubeflow with:

Error: create: failed to create: namespaces "--create-namespace" not found

Root-cause:

Latent metadata bug. recipes/registry.yaml declares the namespace under manifest:, but the registry's Go schema only parses helm: and kustomize:. manifest: is dead config.
Exposed by deploy-path migration. Pre-feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706, manifest-only components were installed via raw kubectl apply -f .../manifests/. The manifest YAML carries inline metadata.namespace: kube-system, so kubectl routed correctly without needing ComponentRef.Namespace.
feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706 (feat(bundler)\!: uniform NNN-folder bundle layout via localformat) wraps every component — manifest-only included — as a local Helm chart. The generated install.sh now always uses helm upgrade --install ... --namespace <ns> --create-namespace, which requires ComponentRef.Namespace. With the unparsed manifest: block, that field is empty.

The result on a post-#706 GKE-COS bundle is:

helm upgrade --install gke-nccl-tcpxo ./ \
  --namespace  --create-namespace \      # ← empty namespace

Shell argument collapsing makes Helm parse --create-namespace as the namespace name → fails. gke-nccl-tcpxo is the only component still using the unparsed manifest: form. nodewright-customizations already uses the correct pattern.

Change

   - name: gke-nccl-tcpxo
     displayName: gke-nccl-tcpxo
     valueOverrideKeys:
       - gkenccltcpxo
-    manifest:
+    helm:
+      # Manifest-only component - no external Helm chart, uses manifestFiles
+      defaultRepository: ""
       defaultNamespace: kube-system

Net change: 1 file, +3/-1 lines. No other component is affected.

Fixes: post-#706 GKE-COS H100 training bundle generation
Related: #706 (the deploy-path migration that exposed the bug), #715 (the first PR whose CI completed far enough to surface it)

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Recipe engine / data (pkg/recipe) — registry data only
Bundlers (pkg/bundler, pkg/component/*) — fixes generated install.sh

Implementation Notes

The fix uses the same pattern already established for nodewright-customizations (also a manifest-only component): declare under helm: with an empty defaultRepository. This routes correctly through the post-feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706 local-Helm wrapper without changing the manifest content itself.
No values schema or recipe changes; no overlay touched.

Testing

End-to-end repro and fix verification:

$ aicr recipe --service gke --accelerator h100 --intent training --os cos -o /tmp/recipe.yaml
$ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
$ grep -A1 'helm upgrade' /tmp/bundle/*-gke-nccl-tcpxo/install.sh
helm upgrade --install gke-nccl-tcpxo ./ \
  --namespace kube-system --create-namespace \

Pre-fix the same command produced --namespace --create-namespace (empty).

make lint                        # 0 issues
go test ./pkg/recipe/... ./pkg/bundler/...   # all pass

Sanity-checked AKS-training bundle still generates correctly (no regression on non-GKE recipes).

Risk Assessment

Low — Single-file, 3-line registry data change. Switches gke-nccl-tcpxo from a never-parsed manifest: block to the well-established manifest-only helm: pattern (same as nodewright-customizations). No code, no values, no overlays touched.

Rollout notes: Bundles regenerated post-merge for any GKE-COS recipe will produce a working install.sh for gke-nccl-tcpxo. Existing installations are unaffected until re-bundled.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — registry data fix; consider follow-up regression test asserting non-empty --namespace for manifest-only install.sh)
I updated docs if user-facing behavior changed (N/A — internal data fix)
Changes follow existing patterns in the codebase (matches nodewright-customizations)
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-04-30T00:44:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: adbfddba-887f-44a1-9b0f-09abe2b86c4c

📥 Commits

Reviewing files that changed from the base of the PR and between 4746872 and afc1026.

📒 Files selected for processing (1)

recipes/registry.yaml

📝 Walkthrough

Walkthrough

The recipes/registry.yaml entry for component gke-nccl-tcpxo is modified: the top-level manifest block is replaced with a helm block marked as a manifest-only component. The helm block adds defaultRepository: "" and preserves defaultNamespace: kube-system. Total lines changed: +3/-1.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: converting gke-nccl-tcpxo from manifest to Helm manifest-only pattern.
Description check	✅ Passed	The description thoroughly explains the bug, root cause, fix, testing, and impact, directly relating to the changeset in recipes/registry.yaml.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The registry entry for `gke-nccl-tcpxo` declared its namespace under a top-level `manifest:` block: - name: gke-nccl-tcpxo ... manifest: defaultNamespace: kube-system `manifest:` is **not** a parsed field on the registry's `ComponentConfig` struct — only `helm:` and `kustomize:` are recognized — so `manifest.defaultNamespace` was silently ignored. The established manifest-only Helm-wrapper pattern (used today by `nodewright-customizations`) is to declare the component as `helm:` with an empty `defaultRepository`: helm: defaultRepository: "" defaultNamespace: kube-system Bug surfacing timeline: - Pre-NVIDIA#706, manifest-only components were installed by the root `deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests carry inline `metadata.namespace: kube-system`, so the empty registry default was harmless; `kubectl apply` did not need `ComponentRef.Namespace` for routing. - NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via localformat`) wraps every component — manifest-only included — as a local Helm chart. The generated `install.sh` now always emits `helm upgrade --install <name> ./ --namespace <ns> --create-namespace`, which requires `ComponentRef.Namespace`. With the unparsed `manifest:` block, that field is empty, producing: helm upgrade --install gke-nccl-tcpxo ./ \ --namespace --create-namespace \ Shell argument collapsing makes Helm parse the literal `--create-namespace` as the namespace name and fails with: Error: create: failed to create: namespaces "--create-namespace" not found - The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs used the pre-NVIDIA#706 deployer path where the empty namespace was inert. PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the H100 GKE-COS training jobs (its registry/base.yaml changes auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the failure. Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm pattern, matching `nodewright-customizations`. Verified locally: $ aicr recipe --service gke --accelerator h100 \ --intent training --os cos -o /tmp/recipe.yaml $ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle $ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh helm upgrade --install gke-nccl-tcpxo ./ \ --namespace kube-system --create-namespace \ Refs: NVIDIA#706, NVIDIA#715

Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. - recipes/registry.yaml: also fix the gke-nccl-tcpxo registry entry to use the established manifest-only Helm pattern (empty `helm.defaultRepository` plus `defaultNamespace: kube-system`) instead of the unparsed `manifest:` block. The `manifest:` field is not on the ComponentConfig struct, so its `defaultNamespace` was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only components were installed via raw `kubectl apply`, which routed via inline `metadata.namespace`). After NVIDIA#706 wraps every component as a local Helm chart, the generated install.sh emits `--namespace --create-namespace` (empty) and Helm fails. This blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when registry.yaml or base.yaml change. Switches to the same pattern used by `nodewright-customizations`. Verified bundled install.sh now contains `--namespace kube-system`. Supersedes NVIDIA#718. Refs: NVIDIA#698 Closes: NVIDIA#716, NVIDIA#718

yuanchen8911 · 2026-04-30T00:53:21Z

Absorbing this fix into #715 — same 3-line registry change now lands as part of the phase-1 PR (commit 102fb8fa in #715, with Closes: #718 in the trailer). Same diff, same verification; folding it in to unblock #715's CI without a sequencing dependency. Bug history preserved in the #715 commit body.

yuanchen8911 requested a review from a team as a code owner April 30, 2026 00:42

yuanchen8911 added bug area/recipes area/bundler labels Apr 30, 2026

github-actions Bot added size/XS and removed area/bundler labels Apr 30, 2026

mchmarny approved these changes Apr 30, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Apr 30, 2026

yuanchen8911 force-pushed the fix/registry-gke-nccl-tcpxo-namespace branch from 4746872 to afc1026 Compare April 30, 2026 00:50

yuanchen8911 mentioned this pull request Apr 30, 2026

chore(recipes): bump 6 components to upstream latest (phase 1) #715

Merged

11 tasks

yuanchen8911 closed this Apr 30, 2026

yuanchen8911 deleted the fix/registry-gke-nccl-tcpxo-namespace branch April 30, 2026 00:53

yuanchen8911 reopened this Apr 30, 2026

yuanchen8911 merged commit 2038255 into NVIDIA:main Apr 30, 2026
215 of 222 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo#718

fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo#718
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/registry-gke-nccl-tcpxo-namespace

yuanchen8911 commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 30, 2026

Summary

Motivation / Context

Change

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

yuanchen8911 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading