Skip to content

fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo#718

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/registry-gke-nccl-tcpxo-namespace
Apr 30, 2026
Merged

fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo#718
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/registry-gke-nccl-tcpxo-namespace

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

The registry entry for gke-nccl-tcpxo uses a top-level manifest: block that is not parsed by the ComponentConfig struct, so manifest.defaultNamespace: kube-system is silently ignored. After #706 wrapped manifest-only components as local Helm charts, the generated install.sh emits helm upgrade --install ... --namespace --create-namespace (empty namespace value) → bundler fails on every post-#706 GKE-COS H100 training run.

Fix: declare gke-nccl-tcpxo with the established manifest-only Helm pattern (empty defaultRepository), matching nodewright-customizations.

Motivation / Context

PR #715's CI failed both Tier 2: h100-gke-cos-training and Tier 2: h100-gke-cos-training-kubeflow with:

Error: create: failed to create: namespaces "--create-namespace" not found

Root-cause:

  1. Latent metadata bug. recipes/registry.yaml declares the namespace under manifest:, but the registry's Go schema only parses helm: and kustomize:. manifest: is dead config.
  2. Exposed by deploy-path migration. Pre-feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706, manifest-only components were installed via raw kubectl apply -f .../manifests/. The manifest YAML carries inline metadata.namespace: kube-system, so kubectl routed correctly without needing ComponentRef.Namespace.
  3. feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706 (feat(bundler)\!: uniform NNN-folder bundle layout via localformat) wraps every component — manifest-only included — as a local Helm chart. The generated install.sh now always uses helm upgrade --install ... --namespace <ns> --create-namespace, which requires ComponentRef.Namespace. With the unparsed manifest: block, that field is empty.

The result on a post-#706 GKE-COS bundle is:

helm upgrade --install gke-nccl-tcpxo ./ \
  --namespace  --create-namespace \      # ← empty namespace

Shell argument collapsing makes Helm parse --create-namespace as the namespace name → fails. gke-nccl-tcpxo is the only component still using the unparsed manifest: form. nodewright-customizations already uses the correct pattern.

Change

   - name: gke-nccl-tcpxo
     displayName: gke-nccl-tcpxo
     valueOverrideKeys:
       - gkenccltcpxo
-    manifest:
+    helm:
+      # Manifest-only component - no external Helm chart, uses manifestFiles
+      defaultRepository: ""
       defaultNamespace: kube-system

Net change: 1 file, +3/-1 lines. No other component is affected.

Fixes: post-#706 GKE-COS H100 training bundle generation
Related: #706 (the deploy-path migration that exposed the bug), #715 (the first PR whose CI completed far enough to surface it)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — registry data only
  • Bundlers (pkg/bundler, pkg/component/*) — fixes generated install.sh

Implementation Notes

  • The fix uses the same pattern already established for nodewright-customizations (also a manifest-only component): declare under helm: with an empty defaultRepository. This routes correctly through the post-feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) #706 local-Helm wrapper without changing the manifest content itself.
  • No values schema or recipe changes; no overlay touched.

Testing

End-to-end repro and fix verification:

$ aicr recipe --service gke --accelerator h100 --intent training --os cos -o /tmp/recipe.yaml
$ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
$ grep -A1 'helm upgrade' /tmp/bundle/*-gke-nccl-tcpxo/install.sh
helm upgrade --install gke-nccl-tcpxo ./ \
  --namespace kube-system --create-namespace \

Pre-fix the same command produced --namespace --create-namespace (empty).

make lint                        # 0 issues
go test ./pkg/recipe/... ./pkg/bundler/...   # all pass

Sanity-checked AKS-training bundle still generates correctly (no regression on non-GKE recipes).

Risk Assessment

  • Low — Single-file, 3-line registry data change. Switches gke-nccl-tcpxo from a never-parsed manifest: block to the well-established manifest-only helm: pattern (same as nodewright-customizations). No code, no values, no overlays touched.

Rollout notes: Bundles regenerated post-merge for any GKE-COS recipe will produce a working install.sh for gke-nccl-tcpxo. Existing installations are unaffected until re-bundled.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — registry data fix; consider follow-up regression test asserting non-empty --namespace for manifest-only install.sh)
  • I updated docs if user-facing behavior changed (N/A — internal data fix)
  • Changes follow existing patterns in the codebase (matches nodewright-customizations)
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: adbfddba-887f-44a1-9b0f-09abe2b86c4c

📥 Commits

Reviewing files that changed from the base of the PR and between 4746872 and afc1026.

📒 Files selected for processing (1)
  • recipes/registry.yaml

📝 Walkthrough

Walkthrough

The recipes/registry.yaml entry for component gke-nccl-tcpxo is modified: the top-level manifest block is replaced with a helm block marked as a manifest-only component. The helm block adds defaultRepository: "" and preserves defaultNamespace: kube-system. Total lines changed: +3/-1.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: converting gke-nccl-tcpxo from manifest to Helm manifest-only pattern.
Description check ✅ Passed The description thoroughly explains the bug, root cause, fix, testing, and impact, directly relating to the changeset in recipes/registry.yaml.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:

    - name: gke-nccl-tcpxo
      ...
      manifest:
        defaultNamespace: kube-system

`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:

    helm:
      defaultRepository: ""
      defaultNamespace: kube-system

Bug surfacing timeline:

- Pre-NVIDIA#706, manifest-only components were installed by the root
  `deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
  carry inline `metadata.namespace: kube-system`, so the empty registry
  default was harmless; `kubectl apply` did not need
  `ComponentRef.Namespace` for routing.

- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
  localformat`) wraps every component — manifest-only included — as a
  local Helm chart. The generated `install.sh` now always emits
  `helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
  which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
  block, that field is empty, producing:

      helm upgrade --install gke-nccl-tcpxo ./ \
        --namespace  --create-namespace \

  Shell argument collapsing makes Helm parse the literal
  `--create-namespace` as the namespace name and fails with:

      Error: create: failed to create:
      namespaces "--create-namespace" not found

- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
  used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
  PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
  H100 GKE-COS training jobs (its registry/base.yaml changes
  auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
  failure.

Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:

    $ aicr recipe --service gke --accelerator h100 \
                  --intent training --os cos -o /tmp/recipe.yaml
    $ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
    $ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
        helm upgrade --install gke-nccl-tcpxo ./ \
          --namespace kube-system --create-namespace \

Refs: NVIDIA#706, NVIDIA#715
@yuanchen8911 yuanchen8911 force-pushed the fix/registry-gke-nccl-tcpxo-namespace branch from 4746872 to afc1026 Compare April 30, 2026 00:50
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

- recipes/registry.yaml: also fix the gke-nccl-tcpxo registry
  entry to use the established manifest-only Helm pattern (empty
  `helm.defaultRepository` plus `defaultNamespace: kube-system`)
  instead of the unparsed `manifest:` block. The `manifest:` field
  is not on the ComponentConfig struct, so its `defaultNamespace`
  was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only
  components were installed via raw `kubectl apply`, which routed
  via inline `metadata.namespace`). After NVIDIA#706 wraps every
  component as a local Helm chart, the generated install.sh emits
  `--namespace  --create-namespace` (empty) and Helm fails. This
  blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including
  this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when
  registry.yaml or base.yaml change. Switches to the same pattern
  used by `nodewright-customizations`. Verified bundled install.sh
  now contains `--namespace kube-system`. Supersedes NVIDIA#718.

Refs: NVIDIA#698
Closes: NVIDIA#716, NVIDIA#718
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Absorbing this fix into #715 — same 3-line registry change now lands as part of the phase-1 PR (commit 102fb8fa in #715, with Closes: #718 in the trailer). Same diff, same verification; folding it in to unblock #715's CI without a sequencing dependency. Bug history preserved in the #715 commit body.

@yuanchen8911 yuanchen8911 deleted the fix/registry-gke-nccl-tcpxo-namespace branch April 30, 2026 00:53
@yuanchen8911 yuanchen8911 reopened this Apr 30, 2026
@yuanchen8911 yuanchen8911 merged commit 2038255 into NVIDIA:main Apr 30, 2026
215 of 222 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants