Skip to content

feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662)#706

Merged
lockwobr merged 5 commits into
mainfrom
feat/generic-helm
Apr 29, 2026
Merged

feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662)#706
lockwobr merged 5 commits into
mainfrom
feat/generic-helm

Conversation

@lockwobr
Copy link
Copy Markdown
Contributor

Summary

Introduce pkg/bundler/deployer/localformat, a deployer-agnostic writer producing a uniform NNN-<component>/ bundle layout, and rewire --deployer helm to consume it. deploy.sh becomes a generic, branch-free install loop with name-matched special-case blocks around it.

Motivation / Context

Today's --deployer helm deploy.sh branches every component on three kinds (Helm / Kustomize / raw manifests) with distinct install mechanisms. The complexity makes the script fragile, hard to share with the planned helmfile deployer (#632) or a future Flux deployer, and conflates classification with execution. A uniform on-disk layout lets every deployer consume one shape and eliminates component-type branching from the install path.

This is scope A of #516 — pure format refactor. Vendoring upstream Helm chart bytes lands separately.

Fixes: #662
Related: #516, #632

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: chainsaw e2e fixtures (tests/chainsaw/**)

Implementation Notes

New package: pkg/bundler/deployer/localformat

Writes per-folder content (Chart.yaml, values.yaml, cluster-values.yaml, install.sh, templates/, upstream.env) for every deployer. Two folder kinds, distinguished by Chart.yaml presence:

  • upstream-helm — no Chart.yaml; upstream.env carries {CHART, REPO, VERSION}; install.sh installs the upstream chart via helm upgrade --install ... --repo ${REPO}.
  • local-helmChart.yaml + templates/ present; install.sh installs ./ as a local chart.

Mixed components (Helm chart + raw manifests, e.g., gpu-operator with dcgm-exporter) emit two adjacent folders: a primary upstream-helm NNN-<name>/ plus an injected (NNN+1)-<name>-post/ local-helm wrapper for the manifests. The "mixed" concept lives only at the bundle layer — recipe types don't know -post exists. Subsequent components shift by 1.

Kustomize-typed components run kustomize build at bundle time (via the already-vendored sigs.k8s.io/kustomize/api), producing a single templates/manifest.yaml inside a local-helm folder.

Helm deployer rewire

Generator.Generate delegates per-component content to localformat.Write and owns only top-level orchestration files (README.md, deploy.sh, undeploy.sh, checksums). Deletes ~195 lines of dead code: generateComponentDirectories, buildComponentDataList (full version), writeClusterValuesFile. ComponentData shrinks: HasChart, HasManifests, IsKustomize removed.

deploy.sh.tmpl install loop replaces ~90 lines of {{ if .IsKustomize/.HasChart/.HasManifests }} branching with a generic loop that delegates to each folder's rendered install.sh. The ~50-line CRD-race retry block is removed entirely — its trigger (pre-apply raw manifests) is structurally eliminated; mixed components now apply manifests post-install via the injected -post wrapped chart, where Helm owns CRD ordering natively.

undeploy.sh.tmpl similarly replaces three-way kind branching with a generic reverse loop. Name-matched blocks around the loops (skyhook taint cleanup, kai-scheduler async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD scan) are preserved; their migration into per-component install.sh hooks is deferred to a follow-up.

Hard-rule invariants documented in pkg/bundler/deployer/localformat/doc.go:

  1. localformat never writes deployer-specific files (deploy.sh, helmfile.yaml, Application CRs, etc. stay with their respective deployers).
  2. install.sh is never name-customized — rendered from exactly two templates parameterized only by data. Name-keyed quirks stay in deploy.sh.
  3. Write is deterministic and idempotent.

Test reorganization

  • New pkg/bundler/deployer/localformat/writer_test.go — 9 tests covering upstream-helm, local-helm-manifest-only, mixed -post injection, ordering, kustomize, kustomize+manifests rejection, determinism, path containment, ctx cancellation. 75.4% package coverage.
  • pkg/bundler/deployer/helm/helm_test.go rewritten to use golden-file harness (assertBundleGolden) covering 5 representative bundle scenarios; goldens under testdata/{upstream_helm_only,manifest_only,mixed_gpu_operator,kai_scheduler_present,skyhook_present}/ double as reference examples. 11 redundant tests deleted; net: 48 → 37 tests, 3408 → 2130 lines (−37%), coverage 87.6% → 87.9%.
  • pkg/bundler/bundler_test.go, pkg/bundler/handler_test.go, and 6 chainsaw e2e fixtures updated for the new NNN-<component>/ paths.

Testing

make qualify
  • make test-coverage — total 72.7% (threshold 70%); localformat 75.4%; combined deployer family 79.7%.
  • make lint — go vet, golangci-lint, license headers, AGENTS.md sync, docs sidebar all clean.
  • make e2e — 17/17 chainsaw tests pass (5 skipped, attestation-related, unchanged).
  • make scan — no new vulnerabilities introduced; 3 pre-existing minor findings (postcss/pygments/uuid in node/python deps).
  • make license-check — pass.
  • ✅ CodeRabbit review run — 23 findings triaged; 3 real bugs fixed (kai-resource-reservation namespace deletion now name-conditional; README.md.tmpl no longer
    hardcodes 001-gpu-operator; chainsaw grep -v patterns dropped trailing slashes).

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Breaking changes for --deployer helm consumers (called out in docs/user/cli-reference.md):

  • Bundle directory layout: <component>/NNN-<component>/. Tooling that parses bundle paths by bare component name must account for the NNN- prefix.
  • Mixed components emit two folders (primary + -post).
  • Raw manifests for mixed components now apply post-install only, via the injected -post wrapped chart. The earlier pre-apply mechanism with CRD-race retry is gone — Helm now owns CRD ordering natively.
  • Per-component README.md files no longer generated; the top-level bundle README.md remains.
  • --deployer argocd and --deployer argocd-helm outputs are unchanged in this PR.

No data migration; users regenerate bundles on every run anyway.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@ayuskauskas
Copy link
Copy Markdown
Contributor

The whole test_data folder under deployer seems like it would be very brittle. Was that supposed to be committed or are they generated on the fly from testing?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a deterministic numbered per-component bundle layout (NNN-<component>/) and a new localformat writer package under pkg/bundler/deployer/localformat/ that emits those folders and exposes types (Folder, FolderKind, Component) and Write(ctx, opts). Raw-manifest and Kustomize components are wrapped as local Helm charts; mixed upstream+manifest components produce a primary upstream folder plus an adjacent -post folder. The Helm deployer is refactored to use localformat.Write, ComponentData is simplified, deploy/undeploy templates are rewritten to iterate NNN-* directories, and tests, fixtures, templates, and docs are updated to the new layout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: introducing a uniform NNN-folder bundle layout via the new localformat package, which is the central focus of this changeset.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, explaining the motivation, implementation, testing, and breaking changes for the new localformat package and helm deployer rewiring.
Linked Issues check ✅ Passed The PR fully addresses the objectives from #662: establishes the Chart.yaml-presence rule, wraps manifest-only/Kustomize as local charts, emits two-folder layout for mixed components, provides a deterministic localformat writer, and rewires --deployer helm to simplify deploy.sh to branch only on Chart.yaml presence while preserving special-case blocks.
Out of Scope Changes check ✅ Passed All changes are in scope: new localformat package with writer/folder types, helm deployer rewiring to consume localformat, test reorganization with golden files, and documentation/chainsaw e2e updates. No unrelated refactoring or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/generic-helm

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 26

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

418-425: ⚠️ Potential issue | 🟠 Major

Include injected -post releases in the undeploy safety checks.

localformat.Write() can add *-post releases for mixed components, but these loops still walk .ComponentsReversed, which only contains the original recipe components. That means the wrapper release is skipped by the pre-flight finalizer scan and the post-flight CRD audit, so undeploy can still tear down the controller while CRs from the -post release are stuck. Build these checks from the actual generated release list, or synthesize -post entries before rendering the template.

Also applies to: 620-638

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 418 - 425,
The template currently iterates .ComponentsReversed which omits injected "-post"
wrapper releases added by localformat.Write, so update the undeploy safety loops
(the block using .ComponentsReversed and the similar block at 620-638) to
iterate the actual generated release list or a synthesized list that includes
"-post" variants before rendering; specifically, replace uses of
.ComponentsReversed with the generated releases collection (or generate extra
entries named "{{ .Name }}-post" for mixed components) so that
skip_preflight_for_release and check_release_for_stuck_crds are invoked for
those "-post" releases as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1152: Replace the phrase "local-helm wrapped chart" with the
compound-modifier form "local-helm-wrapped chart" in the docs text so the
adjective correctly modifies "chart"; locate the occurrence of the string
"local-helm wrapped chart" (in docs/user/cli-reference.md) and update it to
"local-helm-wrapped chart".

In `@Makefile`:
- Line 150: The makefile currently ignores '**/testdata/**' which broadly
disables addlicense checks; update the Makefile rule that sets the ignore
patterns for addlicense (remove or replace the '**/testdata/**' entry) and
instead list only explicit non-source/generated patterns inside testdata (e.g.,
generated, vendor, binary, or fixture extensions) so addlicense still runs on
real source-like fixtures and scripts; ensure the updated ignore list is used by
the addlicense invocation in the same Makefile target.

In `@pkg/bundler/bundler_test.go`:
- Around line 423-429: The test's inclusion check is too permissive because it
treats the legacy flat directory as acceptable; update the assertion in
bundler_test.go so inclusion is determined only by the numbered folder. Replace
the current logic that computes included from statErr1 and statErr2 with a
single check based solely on statErr1 (the numbered path returned by os.Stat on
filepath.Join(tmpDir, "002-aws-ebs-csi-driver")) and remove or ignore the
flat-path check (statErr2); this ensures the test fails if the numbered layout
is missing.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 338-352: The retry loop currently calls
dump_kai_scheduler_helm_diagnostics and cleanup_helm_hooks with only the release
name (variable name), causing namespace mismatches; update the calls inside the
while loop to pass the component namespace as the second argument (e.g.,
dump_kai_scheduler_helm_diagnostics "${name}" "${namespace}" and
cleanup_helm_hooks "${name}" "${namespace}") so the helpers operate on the
correct namespace (or use whatever component namespace variable exists in the
template, e.g., "${component_namespace}"); ensure the helper functions
(dump_kai_scheduler_helm_diagnostics, cleanup_helm_hooks) accept and use the
namespace parameter if they do not already.

In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh`:
- Around line 20-26: The helm upgrade/install invocation uses "${CHART}" which
is invalid for OCI refs; update the IsOCI branch in the
install-upstream-helm.sh.tmpl template so that when REPO starts with "oci://"
the helm command uses "${REPO}" (the full OCI chart reference) instead of
"${CHART}", and keep using "--repo ${REPO}" with "${CHART}" for non-OCI; change
the conditional branch that renders the helm upgrade --install line to
substitute ${REPO} for OCI installs while preserving the other flags
(${COMPONENT_WAIT_ARGS}, ${DRY_RUN_FLAG}, ${KUBECONFIG_FLAG},
${HELM_DEBUG_FLAG}).

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README for the kai_scheduler_present fixture contains stale
GPU/network/cert-manager checks (notably the paragraph around the deploy script
behavior and the block at lines 89–110); update those sections to be
kai-scheduler specific or generic component-agnostic checks: remove or rewrite
GPU/operator/cert-manager verification steps and replace them with kai-scheduler
health/status checks (e.g., checking kai-scheduler pods, logs, readiness probes,
and relevant controller events) or a short generic “verify component readiness
and check logs” paragraph; ensure the deploy script note and subsequent
diagnostics reference only kai-scheduler identifiers and commands so the fixture
output is internally consistent.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 472-475: The generated undeploy.sh contains a stale comment block
starting with "Remove skyhook node taints that persist after operator removal."
(mentioning runtimeRequiredTaint and skyhook.nvidia.com) that is incorrect for
the kai-scheduler fixture; remove that entire comment block from the undeploy.sh
output so the script no longer misstates it removes skyhook taints for the
kai-scheduler-only fixture.
- Around line 43-45: The parser currently stores the raw value into HELM_TIMEOUT
in the --timeout case and later forces an "s" suffix which turns e.g. "5m" into
"5ms" or accepts invalid tokens; update the --timeout handling in undeploy.sh
(the case branch that sets HELM_TIMEOUT) to validate the token before appending
a suffix: either enforce a pure integer using a numeric regex (e.g. match
^[0-9]+$) and reject non-integers with an error, or accept and preserve
user-provided units by not forcing an "s" downstream; implement the chosen
approach in the --timeout branch so HELM_TIMEOUT contains a safe, validated
value.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml`:
- Around line 1-2: Add the standard addlicense header to cluster-values.yaml
(above the existing generator banner) so the file contains the required license
block used across the repo (including SPDX/copyright and year) and keep the
existing "---" generator banner after the header; run the addlicense tool or
insert the exact header used in other *.yaml fixtures to ensure consistency and
valid YAML.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml`:
- Line 1: Replace the single-line copyright banner ("# Copyright (c) 2026,
NVIDIA CORPORATION.  All rights reserved.") with the full addlicense header used
across the repo: run the addlicense tool or insert the standard multi-line
license header at the top of customization.yaml (the template file containing
the current single-line banner) so the file complies with the repo-wide YAML
license-header requirement.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 247-253: The skip_preflight_for_release() case only matches
"skyhook-operator" and "kgateway" so calling skip_preflight_for_release with
"skyhook-customizations" never returns 0; update the case pattern inside
skip_preflight_for_release to include "skyhook-customizations" (e.g., add it to
the matching alternatives or use a pattern that covers all three names) so that
skip_preflight_for_release "skyhook-customizations" returns 0 and the preflight
is correctly skipped where echoed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml`:
- Around line 1-2: Add the repository-standard addlicense header to the YAML
fixture (cluster-values.yaml) so it matches other sources; run the addlicense
tool with the repository's license template (or insert the exact header used in
other .yaml/.yml files) at the top of the file before the existing front-matter
(the leading "# Generated by Cloud Native Stack" and "---") so the file now
begins with the standard license header followed by the comment and YAML
separator.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml`:
- Around line 1-4: This YAML file (values.yaml) is missing the required
addlicense header; add the standard Apache license header used across the repo
at the top of this file (or add an explicit addlicense exclusion comment if the
file should be exempt), ensuring the header is formatted exactly like other YAML
sources in this PR so the addlicense tool passes — update the file that contains
the top-level "driver" key accordingly.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 1-6: Add the repository's full addlicense YAML header at the top
of this Kubernetes manifest and make the Service resource installable by adding
a spec.ports section; specifically, update the dcgm-exporter Service
(metadata.name: dcgm-exporter) to include a valid spec with at least one port
entry (name, port, targetPort, protocol) so the rendered Service is valid and
installable, and ensure the license header matches the project's addlicense
output exactly.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml`:
- Around line 1-2: The file values.yaml is missing the required addlicense
header; update
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
by adding the project's standard license header at the top (use the same header
used across other YAML files) or run the addlicense tool to insert/update the
header so the generated preamble is preceded by the required license banner.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 87-110: Update the "Troubleshooting" section so it reflects only
the components present in this test bundle (gpu-operator) or clearly marks
commands as generic examples: edit the header and the grep/log/get-nodes
commands that currently reference 'gpu-operator|network-operator|cert-manager'
and the kubectl logs reference to ensure they only target gpu-operator (or
prepend a note like "Example commands for common components — adjust as
needed"); update the References list similarly by removing or annotating Network
Operator and cert-manager links so readers aren't confused by components not
included in this fixture.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 559-568: The current global sweep over all
mutating/validatingwebhookconfigurations (the loop that lists .items[] and
deletes webhooks when their service or namespace is missing) must be removed
because it can delete unrelated tenants' webhooks; instead, call the existing
helper delete_orphaned_webhooks_for_ns for each bundle namespace to limit
cleanup to this bundle's namespaces—replace the global jq/read/sort loop with
iterating over the bundle's namespace list and invoking
delete_orphaned_webhooks_for_ns <namespace> (referencing the
delete_orphaned_webhooks_for_ns helper) so only webhooks tied to those
namespaces are removed.
- Around line 527-537: The loop is force-clearing finalizers for every
Terminating namespace; change it to only act on namespaces this bundle created
by recording created namespaces (e.g., a CREATED_NAMESPACES or BUNDLE_NAMESPACES
array when you create them) and then, inside the loop where TERMINATING is
computed, filter TERMINATING to the intersection with that recorded list (or to
namespaces matching the bundle-specific label) before calling
force_clear_namespace_finalizers and kubectl delete; update the loop that uses
TERMINATING and the calls to force_clear_namespace_finalizers/kubectl delete to
use the filtered list so unrelated terminating namespaces are not mutated.

In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml`:
- Around line 1-2: This YAML test fixture (cluster-values.yaml) is missing the
required addlicense header; run the repo's addlicense tool or manually prepend
the standard license banner to the top of cluster-values.yaml so the file
contains the full license header followed by the existing generator banner
(---), ensuring the header exactly matches the project's canonical addlicense
text.

In `@pkg/bundler/deployer/helm/testdata/skyhook_present/deploy.sh`:
- Around line 326-340: The retry loop calls cleanup_helm_hooks with the
component name variable (${name}) but the function requires the component
namespace; update the call inside the while-true retry block to pass the
namespace variable (e.g., ${namespace}) instead of ${name} so
cleanup_helm_hooks("${namespace}") removes hook Jobs from the correct namespace;
adjust the call near cleanup_helm_hooks in the loop that also calls
dump_kai_scheduler_helm_diagnostics and helm_failed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 87-110: The Troubleshooting section in README.md references
gpu-operator and network-operator but the component table only lists
cert-manager; update the troubleshooting commands and references to match
cert-manager only (replace kubectl commands that grep for
gpu-operator|network-operator|cert-manager with cert-manager-specific checks,
e.g., kubectl -n cert-manager get pods and kubectl -n cert-manager logs -l
app=cert-manager) and remove or replace GPU/network operator references and
links with a cert-manager reference (e.g., cert-manager docs) so the README’s
component set and troubleshooting steps are consistent.

In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 56-58: The writeFile call that writes the upstream.env file
currently uses mode 0o600; change it to 0o644 so the generated upstream.env
(containing CHART/REPO/VERSION metadata) is world-readable but owner-writable
for consistency with other bundle files — locate the call using
writeFile(envPath, []byte(envContent), 0o600) and update the file mode to 0o644,
keeping envPath/envContent/unrelated logic unchanged.

In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The failing test dereferences folders[0].Kind while
formatting a t.Fatalf message, causing a panic if folders is empty; change the
assertions in writer_test.go to first check len(folders) and call t.Fatalf (or
t.Fatalf with a message that only uses len) if it's 0 or unexpected length, and
only after that check folders[0].Kind in a separate assertion; apply the same
split-check fix to the other occurrence around the later assertion (the block
currently at lines ~309-311) so you never access folders[0] unless the length
check passed.

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 81-101: The loop that builds folder names uses
fmt.Sprintf("%03d-%s", idx, c.Name) and will produce 4+ digit prefixes once idx
>= 1000 which breaks downstream glob/parse logic; modify the loop in the writer
that iterates over opts.Components (the block using idx, folders, classify, dir)
to fail fast when idx > 999 by returning a clear
errors.New/errors.ErrCodeInvalidRequest error (e.g., "too many components; max
999 supported") before formatting the dir, so no folder with more than three
digits is emitted; alternatively, if you prefer to support more components,
change the format to wider padding consistently with deploy/undeploy parsing,
but ensure both the dir creation (fmt.Sprintf("%03d-%s", ...)) and any
parsing/globs are updated together.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The test currently uses grep -q 'kind: Gateway' which also
matches 'kind: GatewayParameters' and can produce false positives; update the
assertion that checks for the Gateway kind to match the full line exactly (e.g.,
change the grep invocation that uses "${MANIFEST}" to use an
anchored/word-boundary pattern such as grep -q '^kind:[[:space:]]*Gateway$' or
grep -x 'kind: Gateway' so it only passes when a literal "kind: Gateway" line
exists in the MANIFEST).

In `@tests/chainsaw/bundle-templates/skyhook-customizations/chainsaw-test.yaml`:
- Line 69: The command using --resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-skyhook-customizations/templates/tuning.yaml
2>/dev/null | head -1)" can yield an empty string; assign that output to a
variable (e.g., MANIFEST=$(ls ... | head -1)), then add an explicit not-found
guard [ -n "${MANIFEST}" ] || exit 1 before using it, and replace the inline
subshell with --resource "${MANIFEST}"; apply the same change for the other
occurrences of the identical pattern (the lines corresponding to the other
tuning.yaml lookups).

---

Outside diff comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 418-425: The template currently iterates .ComponentsReversed which
omits injected "-post" wrapper releases added by localformat.Write, so update
the undeploy safety loops (the block using .ComponentsReversed and the similar
block at 620-638) to iterate the actual generated release list or a synthesized
list that includes "-post" variants before rendering; specifically, replace uses
of .ComponentsReversed with the generated releases collection (or generate extra
entries named "{{ .Name }}-post" for mixed components) so that
skip_preflight_for_release and check_release_for_stuck_crds are invoked for
those "-post" releases as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 060ea888-280a-471c-85e5-ff0c65d23be9

📥 Commits

Reviewing files that changed from the base of the PR and between 8ffea23 and a79ff85.

📒 Files selected for processing (76)
  • Makefile
  • docs/contributor/component.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler_test.go
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/component-README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/README.md
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/skyhook_present/README.md
  • pkg/bundler/deployer/helm/testdata/skyhook_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
  • pkg/bundler/deployer/localformat/doc.go
  • pkg/bundler/deployer/localformat/folder.go
  • pkg/bundler/deployer/localformat/kustomize.go
  • pkg/bundler/deployer/localformat/local_helm.go
  • pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl
  • pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env
  • pkg/bundler/deployer/localformat/upstream_helm.go
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/handler_test.go
  • tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/skyhook-customizations/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml
  • tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
💤 Files with no reviewable changes (1)
  • pkg/bundler/deployer/helm/templates/component-README.md.tmpl

Comment thread docs/user/cli-reference.md
Comment thread Makefile
Comment thread pkg/bundler/bundler_test.go
Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl Outdated
Comment thread pkg/bundler/deployer/localformat/upstream_helm.go
Comment thread pkg/bundler/deployer/localformat/writer_test.go
Comment thread pkg/bundler/deployer/localformat/writer.go
Comment thread tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml
Comment thread tests/chainsaw/bundle-templates/skyhook-customizations/chainsaw-test.yaml Outdated
mchmarny

This comment was marked as resolved.

@lockwobr
Copy link
Copy Markdown
Contributor Author

The whole test_data folder under deployer seems like it would be very brittle. Was that supposed to be committed or are they generated on the fly from testing?

@ayuskauskas
Nope, this was a big change I made. How is it brittle? bundle should output the same thing with same inputs. This also helps imho us humans understand what a change made by AI will do to the bundle output.

Comment thread pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl Outdated
@ayuskauskas
Copy link
Copy Markdown
Contributor

The whole test_data folder under deployer seems like it would be very brittle. Was that supposed to be committed or are they generated on the fly from testing?

@ayuskauskas Nope, this was a big change I made. How is it brittle? bundle should output the same thing with same inputs. This also helps imho us humans understand what a change made by AI will do to the bundle output.

It is unclear to me in the test_data what is the expected and actual content there. So I was thinking it is brittle as there are real component names and I thought it was referencing actual values we have in recipes. It is more obviously fake data once you look closer but I think a readme would help clarify what is going on and how I would update or add to these tests.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Coverage Report ✅

Metric Value
Coverage 74.8%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-74.8%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler 61.34% (ø)
github.com/NVIDIA/aicr/pkg/bundler/deployer/helm 87.85% (+0.22%) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat 67.42% (+67.42%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/helm/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/bundler/deployer/helm/helm.go 87.85% (+0.22%) 107 (-79) 94 (-69) 13 (-10) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/folder.go 75.00% (+75.00%) 4 (+4) 3 (+3) 1 (+1) 🌟
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/kustomize.go 72.73% (+72.73%) 11 (+11) 8 (+8) 3 (+3) 🌟
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/local_helm.go 73.08% (+73.08%) 52 (+52) 38 (+38) 14 (+14) 🌟
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/upstream_helm.go 63.64% (+63.64%) 22 (+22) 14 (+14) 8 (+8) 🌟
github.com/NVIDIA/aicr/pkg/bundler/deployer/localformat/writer.go 65.15% (+65.15%) 132 (+132) 86 (+86) 46 (+46) 🌟
github.com/NVIDIA/aicr/pkg/bundler/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/bundler/handler.go 75.20% (ø) 125 94 31

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Add Apache license header for consistency with other testdata fixtures.

While this is a testdata fixture, other testdata YAML files in this PR (e.g., pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml and pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml) include the Apache 2.0 license header. For consistency and to satisfy the addlicense tool, add the standard header here.

As per coding guidelines, "License headers must be present and up-to-date on all source files via addlicense tool" for **/*.{yaml,yml}.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml`
around lines 1 - 5, Add the standard Apache 2.0 license header to this YAML test
fixture so it matches other testdata files and satisfies the addlicense tool;
update the file containing the YAML keys "crds:" and "enabled: true" by
prepending the canonical Apache 2.0 header comment block at the top of
values.yaml (above the existing '---' marker) so the license is present and
up-to-date.
pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Add Apache license header for consistency.

Like other testdata YAML fixtures in this PR, this file should include the Apache 2.0 license header to satisfy the addlicense tool and maintain consistency across the testdata directory structure.

As per coding guidelines, "License headers must be present and up-to-date on all source files via addlicense tool" for **/*.{yaml,yml}.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml`
around lines 1 - 3, Add the standard Apache 2.0 license header to the top of the
values.yaml test fixture (the same header used in other testdata YAML files) so
it satisfies the addlicense tool and matches repository conventions; update the
file named values.yaml under the skyhook operator testdata to include the
multi-line Apache License, Version 2.0 comment block at the very top before the
existing YAML document marker.
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)

266-287: ⚠️ Potential issue | 🔴 Critical

Variable name mismatch will cause silent taint removal failure.

Line 266 sets SKYHOOK_TAINT_KEY, but lines 280, 282, and 285 reference the undefined NODEWRIGHT_TAINT_KEY. This will cause the taint discovery and removal to fail silently since ${NODEWRIGHT_TAINT_KEY} will be empty.

🐛 Fix variable name references
-  stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true)
+  stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${SKYHOOK_TAINT_KEY}" | awk '{print $1}' || true)
   if [[ -n "${stale_nodewright}" ]]; then
-    echo "WARNING: nodes with stale ${NODEWRIGHT_TAINT_KEY} taints (no running nodewright-operator): ${stale_nodewright}"
+    echo "WARNING: nodes with stale ${SKYHOOK_TAINT_KEY} taints (no running nodewright-operator): ${stale_nodewright}"
     echo "  Removing stale taints to unblock scheduling..."
     for node in ${stale_nodewright}; do
-      kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true
+      kubectl taint node "${node}" "${SKYHOOK_TAINT_KEY}-" 2>/dev/null || true
     done
   fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 266 - 287,
The script sets SKYHOOK_TAINT_KEY but later uses NODEWRIGHT_TAINT_KEY, causing
empty-variable behavior; update all usages of NODEWRIGHT_TAINT_KEY in the
kubectl jsonpath/grep/echo/kubectl taint commands to use SKYHOOK_TAINT_KEY so
the stale taint discovery and removal logic (variables: SKYHOOK_TAINT_KEY,
stale_nodewright, kubectl taint node) operates on the intended taint key.
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

482-500: ⚠️ Potential issue | 🔴 Critical

Variable name mismatch will cause silent taint removal failure.

Line 482 sets SKYHOOK_TAINT_KEY, but lines 497, 498, and 499 reference the undefined NODEWRIGHT_TAINT_KEY. This is the same bug as in deploy.sh.tmpl.

🐛 Fix variable name references
-echo "Removing ${NODEWRIGHT_TAINT_KEY} node taints..."
+echo "Removing ${SKYHOOK_TAINT_KEY} node taints..."
 for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do
-  kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true
+  kubectl taint node "${node}" "${SKYHOOK_TAINT_KEY}-" 2>/dev/null || true
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 482 - 500,
The script sets and possibly overrides SKYHOOK_TAINT_KEY but later echoes and
uses NODEWRIGHT_TAINT_KEY, causing silent failures; update the references in the
echo and kubectl taint loop to use SKYHOOK_TAINT_KEY (or make the initial
variable NODEWRIGHT_TAINT_KEY and keep it consistent) so the taint removal uses
the same variable (check occurrences around the echo "Removing ..." and the
kubectl taint node "${node}" ... lines).
♻️ Duplicate comments (19)
Makefile (1)

150-150: ⚠️ Potential issue | 🟠 Major

Narrow the testdata exclusion to non-source patterns only.

This blanket **/testdata/** ignore bypasses license enforcement on all testdata content, including shell scripts, Go files, and other source-like fixtures. As flagged in the previous review, this violates the requirement that license headers must be present on all source files.

🔍 Verification: Identify source files in testdata that should retain license checks
#!/bin/bash
# Description: Find source-like files in testdata that would lose license enforcement

echo "=== Shell scripts in testdata ==="
fd -e sh -t f . --search-path . -g '**/testdata/**' | head -20

echo ""
echo "=== Go files in testdata ==="
fd -e go -t f . --search-path . -g '**/testdata/**' | head -20

echo ""
echo "=== YAML files in testdata ==="
fd -e yaml -e yml -t f . --search-path . -g '**/testdata/**' | head -20

echo ""
echo "=== Python files in testdata ==="
fd -e py -t f . --search-path . -g '**/testdata/**' | head -10
♻️ Recommended fix: Replace blanket ignore with specific non-source patterns

Instead of ignoring the entire testdata tree, ignore only generated/binary/fixture extensions:

 LICENSE_IGNORES = \
 	-ignore '.git/**' \
 	-ignore '.venv/**' \
 	-ignore '**/__pycache__/**' \
 	-ignore '**/.venv/**' \
 	-ignore '*/.venv/**' \
 	-ignore '**/.idea/**' \
 	-ignore '**/*.csv' \
 	-ignore '**/*.pyc' \
 	-ignore '**/*.xml' \
 	-ignore '**/*lock.hcl' \
 	-ignore '**/*pb2*' \
 	-ignore 'bundles/**' \
 	-ignore 'dist/**' \
 	-ignore 'vendor/**' \
-	-ignore '**/testdata/**' \
+	-ignore '**/testdata/**/*.json' \
+	-ignore '**/testdata/**/*.txt' \
+	-ignore '**/testdata/**/*.md' \
+	-ignore '**/testdata/**/manifest.yaml' \
+	-ignore '**/testdata/**/values.yaml' \
 	-ignore 'site/public/**' \
 	-ignore 'site/resources/**' \
 	-ignore 'site/node_modules/**'

This ensures that source files (.sh, .go, etc.) in testdata still receive license headers while excluding only true fixture data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 150, The Makefile currently ignores '**/testdata/**' which
skips license enforcement for all files under testdata; update the ignore rule
to remove the blanket '**/testdata/**' entry and instead add explicit
non-source-only patterns (e.g., generated binaries, compressed fixtures, large
data files, and known fixture extensions) so that source-like files (e.g., .sh,
.go, .py, .yml/.yaml) in testdata continue to be checked; locate the ignore line
with the '**/testdata/**' token and replace it with a curated list of non-source
patterns to exclude only true fixture/binary files.
pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml (1)

1-2: ⚠️ Potential issue | 🟠 Major

Add the standard addlicense header above the generated banner.

This fixture still contains only the generated comment and YAML separator. Prepend the canonical license block so it matches the repo’s required YAML source header.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml`
around lines 1 - 2, This file currently only has the generated banner "#
Generated by Cloud Native Stack" and the YAML separator "---"; prepend the
canonical addlicense header block above that banner so the file contains the
standard license comment (as required for "*.{go,yaml,yml,mk}") before the
existing "# Generated by Cloud Native Stack" line, ensuring the license text is
formatted as a comment and preserved as the first content in the file.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml (1)

1-2: ⚠️ Potential issue | 🟠 Major

Add the repository-standard addlicense header.

This fixture still starts with only the generated banner and YAML separator. It needs the standard addlicense block prepended above them.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml`
around lines 1 - 2, Prepend the repository-standard addlicense header block to
this YAML fixture so the file starts with the license header before the existing
generated banner and YAML separator; update the top of
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
by inserting the standard addlicense header (as used across
**/*.{go,yaml,yml,mk} in the repo) above the existing "# Generated by Cloud
Native Stack" line, and verify with the addlicense tool to ensure the header is
present and up-to-date.
pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)

87-110: ⚠️ Potential issue | 🟡 Minor

Make the troubleshooting and references match this fixture’s component set.

This README documents only cert-manager, but these sections still point readers to gpu-operator and network-operator checks. Replace them with cert-manager-specific status/log commands and references so the fixture stays accurate.

As per coding guidelines, "When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines
87 - 110, Update the Troubleshooting and References to only cover cert-manager:
replace the pod/status check that greps for
'gpu-operator|network-operator|cert-manager' with a cert-manager-specific status
command (e.g., kubectl get pods -n cert-manager and kubectl get deployments -n
cert-manager), change the logs command to target the cert-manager namespace and
labels (instead of gpu-operator), and swap the GPU/Network verification command
and the two external links for cert-manager-relevant checks and documentation
links (e.g., cert-manager docs and common troubleshooting pages). Ensure the
README sections "Troubleshooting", "Check deployment status", "View component
logs", "Verify GPU access" (rename that subsection appropriately) and
"References" consistently reference cert-manager only.
pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml (1)

1-1: ⚠️ Potential issue | 🟠 Major

Replace the single-line banner with the full addlicense header.

This template still has only the one-line copyright notice. It needs the repository-standard multi-line license header at the top.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml`
at line 1, The file customization.yaml has only a single-line copyright banner;
replace it with the repository-standard multi-line addlicense header at the very
top of the file (the same full header used across other YAML files), ensuring
the copyright year and owner are correct, and then run the addlicense tool to
update/validate headers; target the template name "customization.yaml" and
update the top-of-file comment block accordingly.
pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml (1)

1-2: ⚠️ Potential issue | 🟠 Major

License header is still missing in this YAML source.

Line 1 includes only the generated banner; the required addlicense header is not present.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml`
around lines 1 - 2, This YAML file is missing the required addlicense header at
the top; insert the standard project license header (the addlicense-generated
comment block) as the very first lines before the existing "# Generated by Cloud
Native Stack" banner, then re-run the addlicense tool to ensure the header is
up-to-date and correctly formatted for YAML files (so the file starting with
'---' remains unchanged after the header).
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)

1-6: ⚠️ Potential issue | 🟠 Major

Make this fixture compliant and installable (header + Service spec).

Line 1 is not the full addlicense header, and Lines 3-6 define a Service without a spec, which makes the fixture non-installable as a realistic golden output.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`
around lines 1 - 6, The file is missing the full addlicense header and defines a
Service (metadata.name: dcgm-exporter) without a spec, making the fixture
non-installable; update the top of dcgm-exporter.yaml to include the proper
addlicense header used across the repo, then provide a realistic Service spec
for the Service named "dcgm-exporter" (include at minimum spec.selector matching
the DCGM exporter Deployment/Pod labels, and a spec.ports entry with port and
targetPort and protocol, and optionally type: ClusterIP) so the manifest is
valid and installable.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml (1)

1-2: ⚠️ Potential issue | 🟠 Major

Still missing the required addlicense header.

Line 1 includes only the generated banner; the repository-standard license block is still absent.

As per coding guidelines, "**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml`
around lines 1 - 2, This YAML file currently only contains the generated banner
"# Generated by Cloud Native Stack" and is missing the repository-standard
addlicense header; run the addlicense tool (or insert the standard license
block) so the file that contains the generated banner and YAML document start
("---") includes the required license header at the top, matching the project's
license template used for **/*.go, *.yaml, *.yml, *.mk files.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml (1)

1-5: ⚠️ Potential issue | 🟡 Minor

Add Apache license header (duplicate of earlier review).

This issue was already identified in a previous review round. The file is missing the required license header that other testdata YAML files include.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml`
around lines 1 - 5, Add the standard Apache 2.0 license header as a YAML comment
block at the top of this values.yaml (above the existing document start '---')
to match other testdata files; ensure the header text and year/owner match the
project’s existing license header style and keep the rest of the YAML (the
top-level "driver" key and its enabled: true) unchanged.
tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)

71-72: ⚠️ Potential issue | 🟠 Major

Tighten kind grep to avoid false positives.

grep -q 'kind: Gateway' can pass on kind: GatewayParameters, so this assertion can miss a missing Gateway object.

Suggested fix
-              grep -q 'kind: GatewayParameters' "${MANIFEST}"
-              grep -q 'kind: Gateway' "${MANIFEST}"
+              grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}"
+              grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71
- 72, The current grep for 'kind: Gateway' can match 'kind: GatewayParameters'
and yield false positives; update the assertion that checks MANIFEST so the grep
for the Gateway kind only matches the exact kind value (e.g., use a regex that
anchors the line and allows optional spaces such as '^kind:[[:space:]]*Gateway$'
or use grep -w/word-boundary matching) while keeping the existing 'kind:
GatewayParameters' check; apply this change to the second grep invocation that
currently searches for 'kind: Gateway'.
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Troubleshooting/readiness text is inconsistent with the fixture’s component set.

This fixture lists only kai-scheduler, but these sections still reference GPU/network/cert-manager checks and docs. Please make this section kai-scheduler-specific (or explicitly generic examples) so the fixture remains internally consistent.

As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”

Also applies to: 89-110

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line
41, The README's troubleshooting/readiness text mentions
GPU/network/cert-manager checks that don't apply to this fixture which only
contains kai-scheduler; update the README.md sections (including the paragraph
currently referencing the deploy script behavior and the later block around
lines 89–110) to either (a) be kai-scheduler-specific — describing only
kai-scheduler readiness indicators and relevant logs/commands — or (b) replace
the examples with clearly labeled generic placeholders and link to the global
deploy docs; ensure all references to GPU, network, and cert-manager checks are
removed or annotated as not-applicable for the kai-scheduler-only fixture.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)

87-110: ⚠️ Potential issue | 🟡 Minor

Troubleshooting references components not present in this fixture.

This fixture contains gpu-operator, but these commands/links also reference network-operator and cert-manager. Please scope troubleshooting to gpu-operator (or clearly label as generic examples).

As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines
87 - 110, The troubleshooting section currently references components not
present in this fixture (network-operator, cert-manager); update the README.md
Troubleshooting block so it only targets the gpu-operator (e.g., keep the
kubectl pod check, logs command, and GPU access verification but remove or
rename any grep/filter for network-operator and cert-manager), or explicitly
prefix the section as "Generic examples" if you want to keep other commands;
ensure headings like "Troubleshooting" and commands such as "kubectl logs -n
gpu-operator -l app=gpu-operator" and the GPU verification jsonpath are accurate
and scoped to gpu-operator.
docs/user/cli-reference.md (1)

1152-1152: ⚠️ Potential issue | 🟡 Minor

Use the compound modifier form: local-helm-wrapped chart.

Please hyphenate this phrase for correct modifier grammar.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` at line 1152, Replace the un-hyphenated modifier
"local-helm wrapped chart" with the hyphenated compound modifier
"local-helm-wrapped chart" wherever that phrase appears (e.g., in the sentence
starting "Manifest-only components... become a single local-helm wrapped
chart"); update the documented phrase to "local-helm-wrapped chart" to fix
modifier grammar.
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh (2)

472-475: ⚠️ Potential issue | 🟡 Minor

Remove the stale skyhook cleanup comment.

This fixture only covers kai-scheduler. The comment block about removing skyhook node taints (lines 472-475) is irrelevant and misstates what the script cleans up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around
lines 472 - 475, The comment block in undeploy.sh that says "Remove skyhook node
taints that persist after operator removal" is incorrect for this fixture (only
covers kai-scheduler); remove or replace that stale comment with a brief
accurate comment about cleaning up kai-scheduler resources. Locate the comment
text "Remove skyhook node taints..." in undeploy.sh and either delete those
lines or change them to a one-line note describing kai-scheduler-specific
cleanup so the script comments match its actual behavior.

43-45: ⚠️ Potential issue | 🟠 Major

Validate --timeout before appending s.

--timeout 5m becomes 5ms downstream, and non-numeric tokens defer failure to helm/kubectl. Reject non-integer input in the parser.

🛡️ Proposed fix
     --timeout)
       if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi
+      if [[ ! "$2" =~ ^[0-9]+$ ]]; then
+        echo "Error: --timeout must be an integer number of seconds"
+        exit 1
+      fi
       HELM_TIMEOUT="$2"; shift 2 ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around
lines 43 - 45, The --timeout branch stores HELM_TIMEOUT from the parser but
later appends an "s", so inputs like "5m" become "5ms" and non-numeric tokens
get passed to helm/kubectl; change the parser for the "--timeout" case to
validate that $2 is a positive integer (e.g., using a regex like ^[0-9]+$)
before assigning HELM_TIMEOUT, and if validation fails print an error and exit
non-zero; reference the "--timeout" case and the HELM_TIMEOUT variable in
undeploy.sh when implementing this check.
pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (1)

247-253: ⚠️ Potential issue | 🟠 Major

skyhook-customizations preflight skip is unreachable.

The skip_preflight_for_release() case pattern only matches nodewright-operator|kgateway, but line 418 calls it with skyhook-customizations. The condition at line 418 always evaluates to false, so the check at line 422 always runs instead of being exempted.

🐛 Proposed fix: Add skyhook-customizations to the case pattern
 skip_preflight_for_release() {
   case "$1" in
-    nodewright-operator|kgateway) return 0 ;;
+    nodewright-operator|kgateway|skyhook-customizations) return 0 ;;
     *) return 1 ;;
   esac
 }

Also applies to: 418-423

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines
247 - 253, The helper function skip_preflight_for_release currently only matches
"nodewright-operator" and "kgateway", so calls passing "skyhook-customizations"
never hit the early return; update the case pattern in
skip_preflight_for_release to include "skyhook-customizations" (e.g., add it
alongside nodewright-operator|kgateway) so the function returns 0 for that
release name, and also update any duplicate/other occurrences of the same case
pattern to keep behavior consistent.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh (2)

527-537: ⚠️ Potential issue | 🔴 Critical

Limit force-finalization to bundle-managed namespaces.

After 60 seconds this loop force-clears finalizers on every namespace in Terminating, not just namespaces this bundle deleted. On a shared cluster that mutates unrelated workloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around
lines 527 - 537, The loop is force-clearing finalizers on every Terminating
namespace; restrict that to only namespaces owned/managed by this bundle by
filtering TERMINATING before forcing deletion. Change how TERMINATING is
computed (or post-filter it) so it only contains namespaces with the
bundle-specific identity (e.g., a label like bundle=<name> or by intersecting
with a previously recorded list of namespaces this script deleted), then call
force_clear_namespace_finalizers and kubectl delete namespace only for that
filtered set; update references to TERMINATING,
force_clear_namespace_finalizers, and the kubectl delete namespace --wait=false
call accordingly.

559-568: ⚠️ Potential issue | 🟠 Major

Do not sweep stale webhooks cluster-wide here.

This pass deletes any webhook whose backing service or namespace is missing anywhere in the cluster, which can remove another tenant’s admission webhooks during unrelated cleanup. Reuse the namespace-scoped helper for bundle namespaces instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around
lines 559 - 568, The current loop scans all
mutatingwebhookconfiguration/validatingwebhookconfiguration cluster-wide and
deletes any webhook whose backing service/namespace is missing (using variables
wh_name, svc_ns, svc_name), which can remove other tenants' webhooks; change
this to only check bundle namespaces by reusing the existing namespace-scoped
helper (instead of the cluster-wide `kubectl get ... | jq ...`); specifically,
iterate only over your bundle namespaces (or call the helper like
cleanup_webhooks_in_namespace or equivalent) and for each namespace run a scoped
check (e.g., list webhooks that reference services in that namespace and verify
those services/namespaces) and delete only those matching webhooks
(mutatingwebhookconfiguration/validatingwebhookconfiguration) when the service
in that same namespace is missing.
pkg/bundler/deployer/localformat/writer.go (1)

100-101: ⚠️ Potential issue | 🟠 Major

Still emitting 4-digit folder names will desync the generated scripts.

Once idx reaches 1000, this produces 1000-*, but the generated deploy/undeploy loops only match [0-9][0-9][0-9]-*/, so those folders are silently skipped. Fail fast here or widen the script glob/parsing in the same change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 100 - 101, The code
uses fmt.Sprintf("%03d-%s", idx, c.Name) which will produce 4+ digit prefixes
once idx >= 1000 and desync the generated scripts; add a bounds check around the
existing dir formation (check the idx variable before dir := fmt.Sprintf(...))
and return a clear error (or panic) if idx >= 1000 so we fail fast instead of
emitting a mismatched folder, referencing the existing idx and dir creation (dir
:= fmt.Sprintf("%03d-%s", idx, c.Name)); alternatively, if you prefer to support
larger counts, replace "%03d" with "%04d" and ensure the matching
deploy/undeploy glob/parsing is updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/contributor/component.md`:
- Line 21: Rewrite the sentence to clearly separate current behavior from future
intent: state which deployers currently consume pkg/bundler/deployer/localformat
(and that deployers call localformat.Write()) and then add a separate clause or
sentence describing planned or intended support (e.g., future helmfile per issue
`#632`) as prospective, not implemented; ensure you remove phrasing like “consumed
by every deployer” and instead use explicit wording such as “currently used by
X, Y” and “intended/future support for Z,” while keeping references to symbols
localformat.Write() and the per-folder contents (Chart.yaml, values.yaml, etc.)
for clarity.

In `@pkg/bundler/bundler_test.go`:
- Around line 1199-1212: The test redundantly checks for cluster-values.yaml
inside directories already asserted to not exist; remove the inner os.Stat check
that tests filepath.Join(tmpDir, dir, "cluster-values.yaml") within the loop
that iterates over []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver",
"002-aws-ebs-csi-driver"} in bundler_test.go (the tmpDir loop), leaving only the
os.Stat check that ensures the directory itself does not exist.

In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml`:
- Around line 1-2: Add the standard repository license header (the
addlicense-generated YAML header) to the top of cluster-values.yaml above the
existing generator banner line "# Generated by Cloud Native Stack"; ensure the
header matches the project's addlicense output for YAML files and is up-to-date,
then save the file so the license block is the first lines followed by the
existing "---" and generator comment.

In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml`:
- Around line 1-2: This YAML fixture is missing the required addlicense header;
insert the repository's standard license comment block (SPDX or the
addlicense-produced header) as YAML comments at the very top of the file, above
the existing "# Generated by Cloud Native Stack" banner and before the document
marker "---", or run the addlicense tool to automatically add/update the header
so the file has the proper license header format and metadata.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml`:
- Around line 1-2: The YAML fixture values.yaml is missing the standard
addlicense header; prepend the repository's standard addlicense license block at
the very top of this file (before the generated banner line "---") so the file
contains the required license header format used for "*.go, *.yaml, *.yml, *.mk"
files; ensure the header matches the project's canonical addlicense text
exactly.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Line 21: The table row for the manifest-only component shows empty Version and
a Source rendered as "skyhook-customizations ()"; update the README table entry
(the row containing "skyhook-customizations") so that Version is "N/A" and
Source is "local" (e.g., change the row to use "N/A" for Version and "local" for
Source) to avoid ambiguous empty placeholders in manifest-only component
documentation.
- Around line 87-110: The Troubleshooting section in README.md currently
contains GPU/network operator-specific commands and references; replace or
annotate them so they target the skyhook-customizations fixture instead. Update
the kubectl commands (the pod listing grep, the logs command targeting
namespace/label, and the node GPU jsonpath check) to use skyhook component
names/namespaces/labels (e.g., replace
'gpu-operator|network-operator|cert-manager' and '-n gpu-operator -l
app=gpu-operator' with the appropriate skyhook pod selectors), and swap the two
NVIDIA docs with skyhook-relevant links OR add a clear note that the examples
are generic and not specific to this fixture; ensure the Troubleshooting header
and examples reflect skyhook operational checks.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml`:
- Around line 1-2: This YAML fixture is missing the repository standard
multi-line license header; prepend the standard addlicense header (the same
header used across repo YAML files) above the existing generator banner ("#
Generated by Cloud Native Stack") and document separator ("---") so the file
complies with the repo-wide YAML source policy; run the addlicense tool or copy
the canonical header and place it at the very top of the file so the header
appears before any other content.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 85-93: The helm_force_uninstall function currently swallows
failures (calls helm uninstall with --no-hooks || true) so the script proceeds
even if uninstall fails; modify helm_force_uninstall to detect a non-zero exit
from the retry and propagate it (e.g., capture the exit status of the second
helm uninstall, log a clear error including release/ns and HELM_TIMEOUT, and
exit non-zero or return the failure code) instead of using "|| true" so later
cleanup only runs when the release was actually removed.

In `@pkg/bundler/deployer/helm/testdata/skyhook_present/README.md`:
- Around line 87-110: The troubleshooting section in README.md contains
GPU/network-operator-specific commands and links (e.g., the pod grep line
matching 'gpu-operator|network-operator|cert-manager', the kubectl logs command
targeting namespace 'gpu-operator' and label 'app=gpu-operator', and the NVIDIA
docs links) which do not match the skyhook-operator fixture; update this section
to use skyhook-relevant diagnostics (replace the grep/filter to match
skyhook-operator services, change the kubectl logs namespace/label to the
skyhook deployment/service names, and swap the reference links to skyhook
operator docs) or explicitly mark the examples as generic/placeholder so the
fixture reads consistently.

In `@pkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.sh`:
- Around line 490-493: The script uses an undefined variable
NODEWRIGHT_TAINT_KEY in the node taint removal loop which will fail under set
-u; update the loop to reference the existing SKYHOOK_TAINT_KEY (or ensure
NODEWRIGHT_TAINT_KEY is defined earlier) so the variable name matches the
defined constant (see the assignment of SKYHOOK_TAINT_KEY around line ~475 and
the taint removal loop using NODEWRIGHT_TAINT_KEY).

In
`@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml`:
- Around line 1-2: This YAML fixture is missing the repository-standard
addlicense header; update the file that begins with the generator comment and
the '---' document marker so it includes the required addlicense YAML license
header block at the top of the file (above the existing comment and '---'),
matching the repository's standard license text used for all *.yaml/*.yml files.

In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 60-66: The envContent creation currently writes CHART, REPO, and
VERSION unquoted which allows shell metacharacters to execute when sourcing
upstream.env; update the envContent assembly in upstream_helm.go (where
envContent is created before calling writeFile) to safely quote each value
(CHART, REPO, c.Version) with shell-safe quoting—e.g., wrap values in single
quotes and escape any embedded single quotes or use a proper shell-quoting
utility—so the file written by writeFile("upstream.env") contains quoted,
escaped values.

In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-102: The grep exclude pattern in the INSTALL assignment is
inconsistent: it uses grep -v -- "-post/" while other occurrences use grep -v --
'-gpu-operator-post'; update the grep in the line that sets INSTALL (the ls |
grep -v ... | head -1 pipeline) to use the same exclude pattern
('-gpu-operator-post') so it consistently filters the same post-release files
when finding the install.sh, and keep the rest of the pipeline (ls, head, and
the subsequent grep -q) unchanged.

---

Outside diff comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 266-287: The script sets SKYHOOK_TAINT_KEY but later uses
NODEWRIGHT_TAINT_KEY, causing empty-variable behavior; update all usages of
NODEWRIGHT_TAINT_KEY in the kubectl jsonpath/grep/echo/kubectl taint commands to
use SKYHOOK_TAINT_KEY so the stale taint discovery and removal logic (variables:
SKYHOOK_TAINT_KEY, stale_nodewright, kubectl taint node) operates on the
intended taint key.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 482-500: The script sets and possibly overrides SKYHOOK_TAINT_KEY
but later echoes and uses NODEWRIGHT_TAINT_KEY, causing silent failures; update
the references in the echo and kubectl taint loop to use SKYHOOK_TAINT_KEY (or
make the initial variable NODEWRIGHT_TAINT_KEY and keep it consistent) so the
taint removal uses the same variable (check occurrences around the echo
"Removing ..." and the kubectl taint node "${node}" ... lines).

In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml`:
- Around line 1-3: Add the standard Apache 2.0 license header to the top of the
values.yaml test fixture (the same header used in other testdata YAML files) so
it satisfies the addlicense tool and matches repository conventions; update the
file named values.yaml under the skyhook operator testdata to include the
multi-line Apache License, Version 2.0 comment block at the very top before the
existing YAML document marker.

In
`@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml`:
- Around line 1-5: Add the standard Apache 2.0 license header to this YAML test
fixture so it matches other testdata files and satisfies the addlicense tool;
update the file containing the YAML keys "crds:" and "enabled: true" by
prepending the canonical Apache 2.0 header comment block at the top of
values.yaml (above the existing '---' marker) so the license is present and
up-to-date.

---

Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 1152: Replace the un-hyphenated modifier "local-helm wrapped chart" with
the hyphenated compound modifier "local-helm-wrapped chart" wherever that phrase
appears (e.g., in the sentence starting "Manifest-only components... become a
single local-helm wrapped chart"); update the documented phrase to
"local-helm-wrapped chart" to fix modifier grammar.

In `@Makefile`:
- Line 150: The Makefile currently ignores '**/testdata/**' which skips license
enforcement for all files under testdata; update the ignore rule to remove the
blanket '**/testdata/**' entry and instead add explicit non-source-only patterns
(e.g., generated binaries, compressed fixtures, large data files, and known
fixture extensions) so that source-like files (e.g., .sh, .go, .py, .yml/.yaml)
in testdata continue to be checked; locate the ignore line with the
'**/testdata/**' token and replace it with a curated list of non-source patterns
to exclude only true fixture/binary files.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README's troubleshooting/readiness text mentions
GPU/network/cert-manager checks that don't apply to this fixture which only
contains kai-scheduler; update the README.md sections (including the paragraph
currently referencing the deploy script behavior and the later block around
lines 89–110) to either (a) be kai-scheduler-specific — describing only
kai-scheduler readiness indicators and relevant logs/commands — or (b) replace
the examples with clearly labeled generic placeholders and link to the global
deploy docs; ensure all references to GPU, network, and cert-manager checks are
removed or annotated as not-applicable for the kai-scheduler-only fixture.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 472-475: The comment block in undeploy.sh that says "Remove
skyhook node taints that persist after operator removal" is incorrect for this
fixture (only covers kai-scheduler); remove or replace that stale comment with a
brief accurate comment about cleaning up kai-scheduler resources. Locate the
comment text "Remove skyhook node taints..." in undeploy.sh and either delete
those lines or change them to a one-line note describing kai-scheduler-specific
cleanup so the script comments match its actual behavior.
- Around line 43-45: The --timeout branch stores HELM_TIMEOUT from the parser
but later appends an "s", so inputs like "5m" become "5ms" and non-numeric
tokens get passed to helm/kubectl; change the parser for the "--timeout" case to
validate that $2 is a positive integer (e.g., using a regex like ^[0-9]+$)
before assigning HELM_TIMEOUT, and if validation fails print an error and exit
non-zero; reference the "--timeout" case and the HELM_TIMEOUT variable in
undeploy.sh when implementing this check.

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml`:
- Around line 1-2: This YAML file is missing the required addlicense header at
the top; insert the standard project license header (the addlicense-generated
comment block) as the very first lines before the existing "# Generated by Cloud
Native Stack" banner, then re-run the addlicense tool to ensure the header is
up-to-date and correctly formatted for YAML files (so the file starting with
'---' remains unchanged after the header).

In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml`:
- Line 1: The file customization.yaml has only a single-line copyright banner;
replace it with the repository-standard multi-line addlicense header at the very
top of the file (the same full header used across other YAML files), ensuring
the copyright year and owner are correct, and then run the addlicense tool to
update/validate headers; target the template name "customization.yaml" and
update the top-of-file comment block accordingly.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 247-253: The helper function skip_preflight_for_release currently
only matches "nodewright-operator" and "kgateway", so calls passing
"skyhook-customizations" never hit the early return; update the case pattern in
skip_preflight_for_release to include "skyhook-customizations" (e.g., add it
alongside nodewright-operator|kgateway) so the function returns 0 for that
release name, and also update any duplicate/other occurrences of the same case
pattern to keep behavior consistent.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml`:
- Around line 1-2: Prepend the repository-standard addlicense header block to
this YAML fixture so the file starts with the license header before the existing
generated banner and YAML separator; update the top of
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
by inserting the standard addlicense header (as used across
**/*.{go,yaml,yml,mk} in the repo) above the existing "# Generated by Cloud
Native Stack" line, and verify with the addlicense tool to ensure the header is
present and up-to-date.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml`:
- Around line 1-5: Add the standard Apache 2.0 license header as a YAML comment
block at the top of this values.yaml (above the existing document start '---')
to match other testdata files; ensure the header text and year/owner match the
project’s existing license header style and keep the rest of the YAML (the
top-level "driver" key and its enabled: true) unchanged.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 1-6: The file is missing the full addlicense header and defines a
Service (metadata.name: dcgm-exporter) without a spec, making the fixture
non-installable; update the top of dcgm-exporter.yaml to include the proper
addlicense header used across the repo, then provide a realistic Service spec
for the Service named "dcgm-exporter" (include at minimum spec.selector matching
the DCGM exporter Deployment/Pod labels, and a spec.ports entry with port and
targetPort and protocol, and optionally type: ClusterIP) so the manifest is
valid and installable.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml`:
- Around line 1-2: This YAML file currently only contains the generated banner
"# Generated by Cloud Native Stack" and is missing the repository-standard
addlicense header; run the addlicense tool (or insert the standard license
block) so the file that contains the generated banner and YAML document start
("---") includes the required license header at the top, matching the project's
license template used for **/*.go, *.yaml, *.yml, *.mk files.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 87-110: The troubleshooting section currently references
components not present in this fixture (network-operator, cert-manager); update
the README.md Troubleshooting block so it only targets the gpu-operator (e.g.,
keep the kubectl pod check, logs command, and GPU access verification but remove
or rename any grep/filter for network-operator and cert-manager), or explicitly
prefix the section as "Generic examples" if you want to keep other commands;
ensure headings like "Troubleshooting" and commands such as "kubectl logs -n
gpu-operator -l app=gpu-operator" and the GPU verification jsonpath are accurate
and scoped to gpu-operator.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 527-537: The loop is force-clearing finalizers on every
Terminating namespace; restrict that to only namespaces owned/managed by this
bundle by filtering TERMINATING before forcing deletion. Change how TERMINATING
is computed (or post-filter it) so it only contains namespaces with the
bundle-specific identity (e.g., a label like bundle=<name> or by intersecting
with a previously recorded list of namespaces this script deleted), then call
force_clear_namespace_finalizers and kubectl delete namespace only for that
filtered set; update references to TERMINATING,
force_clear_namespace_finalizers, and the kubectl delete namespace --wait=false
call accordingly.
- Around line 559-568: The current loop scans all
mutatingwebhookconfiguration/validatingwebhookconfiguration cluster-wide and
deletes any webhook whose backing service/namespace is missing (using variables
wh_name, svc_ns, svc_name), which can remove other tenants' webhooks; change
this to only check bundle namespaces by reusing the existing namespace-scoped
helper (instead of the cluster-wide `kubectl get ... | jq ...`); specifically,
iterate only over your bundle namespaces (or call the helper like
cleanup_webhooks_in_namespace or equivalent) and for each namespace run a scoped
check (e.g., list webhooks that reference services in that namespace and verify
those services/namespaces) and delete only those matching webhooks
(mutatingwebhookconfiguration/validatingwebhookconfiguration) when the service
in that same namespace is missing.

In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml`:
- Around line 1-2: This file currently only has the generated banner "#
Generated by Cloud Native Stack" and the YAML separator "---"; prepend the
canonical addlicense header block above that banner so the file contains the
standard license comment (as required for "*.{go,yaml,yml,mk}") before the
existing "# Generated by Cloud Native Stack" line, ensuring the license text is
formatted as a comment and preserved as the first content in the file.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 87-110: Update the Troubleshooting and References to only cover
cert-manager: replace the pod/status check that greps for
'gpu-operator|network-operator|cert-manager' with a cert-manager-specific status
command (e.g., kubectl get pods -n cert-manager and kubectl get deployments -n
cert-manager), change the logs command to target the cert-manager namespace and
labels (instead of gpu-operator), and swap the GPU/Network verification command
and the two external links for cert-manager-relevant checks and documentation
links (e.g., cert-manager docs and common troubleshooting pages). Ensure the
README sections "Troubleshooting", "Check deployment status", "View component
logs", "Verify GPU access" (rename that subsection appropriately) and
"References" consistently reference cert-manager only.

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 100-101: The code uses fmt.Sprintf("%03d-%s", idx, c.Name) which
will produce 4+ digit prefixes once idx >= 1000 and desync the generated
scripts; add a bounds check around the existing dir formation (check the idx
variable before dir := fmt.Sprintf(...)) and return a clear error (or panic) if
idx >= 1000 so we fail fast instead of emitting a mismatched folder, referencing
the existing idx and dir creation (dir := fmt.Sprintf("%03d-%s", idx, c.Name));
alternatively, if you prefer to support larger counts, replace "%03d" with
"%04d" and ensure the matching deploy/undeploy glob/parsing is updated
accordingly.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The current grep for 'kind: Gateway' can match 'kind:
GatewayParameters' and yield false positives; update the assertion that checks
MANIFEST so the grep for the Gateway kind only matches the exact kind value
(e.g., use a regex that anchors the line and allows optional spaces such as
'^kind:[[:space:]]*Gateway$' or use grep -w/word-boundary matching) while
keeping the existing 'kind: GatewayParameters' check; apply this change to the
second grep invocation that currently searches for 'kind: Gateway'.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d9deff84-4ec2-416e-98ce-029175ef15ab

📥 Commits

Reviewing files that changed from the base of the PR and between a79ff85 and f9bafb4.

📒 Files selected for processing (77)
  • Makefile
  • docs/contributor/component.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler_test.go
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/README.md
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/skyhook_present/README.md
  • pkg/bundler/deployer/helm/testdata/skyhook_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
  • pkg/bundler/deployer/localformat/doc.go
  • pkg/bundler/deployer/localformat/folder.go
  • pkg/bundler/deployer/localformat/kustomize.go
  • pkg/bundler/deployer/localformat/local_helm.go
  • pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl
  • pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/testdata/README.md
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env
  • pkg/bundler/deployer/localformat/upstream_helm.go
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/handler_test.go
  • tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml
  • tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml

Comment thread docs/contributor/component.md
Comment thread pkg/bundler/bundler_test.go Outdated
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/README.md Outdated
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
Comment thread pkg/bundler/deployer/localformat/upstream_helm.go Outdated
Comment thread tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml Outdated
fi
fi
fi
stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] Keep the nodewright taint key variable and folder name consistent.

This block initializes SKYHOOK_TAINT_KEY, but the grep, warning, and kubectl taint calls expand ${NODEWRIGHT_TAINT_KEY} under set -u, so any bundle containing nodewright-operator aborts before installation on a fresh cluster. The directory lookup above also searches for NNN-skyhook-operator, but the registry component and generated folder are NNN-nodewright-operator, so custom runtimeRequiredTaint values are never read. Use one variable name consistently and look up the nodewright-operator folder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. The variable is now NODEWRIGHT_TAINT_KEY consistently throughout the block (declaration, default assignment, parse, grep, warning, and kubectl taint), and the directory lookup uses the registry-correct [0-9][0-9][0-9]-nodewright-operator glob. The taint key default value is still skyhook.nvidia.com because the operator's taint domain didn't change in #642 - only the component/repo name did.

Reference: pkg/bundler/deployer/helm/templates/deploy.sh.tmpl:266-285.

# Handle both key=value:effect and key:effect formats
NODEWRIGHT_TAINT_KEY="${taint_value%%=*}"
NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}"
{{- if eq .Name "skyhook-operator" }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] This nodewright taint cleanup block never renders for real recipes.

Current recipes use the component name nodewright-operator; there is no registry component named skyhook-operator. Because this condition never fires, undeploy removes the operator but leaves the skyhook.nvidia.com runtime taint on nodes. The block also initializes SKYHOOK_TAINT_KEY but later uses ${NODEWRIGHT_TAINT_KEY} under set -u, so a rendered skyhook fixture would fail as well. Gate on nodewright-operator and use one taint-key variable consistently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. The block now gates on nodewright-operator ({{- if eq .Name "nodewright-operator" }} at undeploy.sh.tmpl:494) and uses NODEWRIGHT_TAINT_KEY consistently throughout. Same fix applied to the deploy.sh.tmpl counterpart (Thread 5).

@yuanchen8911
Copy link
Copy Markdown
Contributor

Please address coderabbit's comments as needed too.

@lockwobr lockwobr force-pushed the feat/generic-helm branch from f9bafb4 to e281666 Compare April 28, 2026 20:40
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/bundler/bundler_test.go (1)

469-485: ⚠️ Potential issue | 🟡 Minor

Check the YAML key instead of searching for enabled: true.

This assertion is too loose for the behavior it’s trying to verify: it can miss a leaked top-level enabled: false, and it can also fail on unrelated nested enabled: true values.

♻️ Proposed fix
-			valuesStr := string(valuesData)
-			if strings.Contains(valuesStr, "enabled: true") {
-				t.Errorf("enabled key leaked into Helm values:\n%s", valuesStr)
-			}
+			var values map[string]any
+			if err := yaml.Unmarshal(valuesData, &values); err != nil {
+				t.Fatalf("failed to parse values.yaml: %v", err)
+			}
+			if _, ok := values["enabled"]; ok {
+				t.Errorf("enabled key leaked into Helm values:\n%s", valuesData)
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/bundler_test.go` around lines 469 - 485, The current string
search for "enabled: true" is too brittle; instead parse values.yaml into a YAML
object and assert the top-level "enabled" key is not present or not set to true
by unmarshalling valuesData into a map[string]interface{} and checking
map["enabled"]; update the check that uses valuesStr to use the parsed map so it
fails only on a leaked top-level enabled, and keep the existing override
assertion by verifying the nested controller.replicaCount key exists in the
parsed structure (e.g., assert
parsed["controller"].(map[string]interface{})["replicaCount"] is present).
♻️ Duplicate comments (27)
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)

3-7: ⚠️ Potential issue | 🟠 Major

Service manifest is incomplete (spec/ports missing).

This fixture currently defines a Service that is not installable as-is. Please add a minimal spec with at least one port (and selector if appropriate).

Proposed fix
 apiVersion: v1
 kind: Service
 metadata:
   name: dcgm-exporter
+spec:
+  selector:
+    app: dcgm-exporter
+  ports:
+    - name: metrics
+      port: 9400
+      targetPort: 9400
+      protocol: TCP
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`
around lines 3 - 7, The Service manifest for kind: Service named "dcgm-exporter"
is missing a spec (ports and optional selector); update the dcgm-exporter
Service resource to include a minimal spec block that declares at least one port
(port and targetPort and protocol) and, if appropriate for this fixture, a
selector matching the Deployment/Pod labels (e.g., app/dcgm-exporter) so the
Service is selectable and installable; ensure the resource remains valid YAML
and aligns with the Deployment/Pod labels in this testdata set.
tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)

71-72: ⚠️ Potential issue | 🟠 Major

Tighten the Gateway assertion.

grep -q 'kind: Gateway' still matches GatewayParameters, so this test can pass even when the Gateway object is missing. Use an anchored full-line match here.

Suggested fix
-              grep -q 'kind: Gateway' "${MANIFEST}"
+              grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71
- 72, The test currently uses grep -q 'kind: Gateway' which also matches
GatewayParameters; update the test to assert an exact full-line match for the
Gateway kind (e.g., replace the grep -q 'kind: Gateway' invocation with a grep
that matches the entire line "kind: Gateway" using either line anchors or grep
-x) so the MANIFEST check detects the Gateway object specifically while keeping
the existing GatewayParameters check intact.
pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)

91-110: ⚠️ Potential issue | 🟡 Minor

Fix the cert-manager fixture’s troubleshooting section.

This golden still points at GPU/network-operator commands and references, so the cert-manager-only bundle reads incorrectly.

✏️ Proposed fix
- kubectl get pods -A | grep -E 'gpu-operator|network-operator|cert-manager'
+ kubectl get pods -n cert-manager

- kubectl logs -n gpu-operator -l app=gpu-operator
+ kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines
91 - 110, The README's troubleshooting section currently shows
GPU/network-operator commands and references; update the "kubectl get pods -A |
grep -E 'gpu-operator|network-operator|cert-manager'" command, the "View
component logs" kubectl logs command, the "Verify GPU access" jsonpath check,
and the References list to be cert-manager specific. Replace the pod
listing/grep with a cert-manager namespace/label lookup (e.g., namespace
cert-manager), change the logs command to target cert-manager pods (use -n
cert-manager and appropriate label), remove the GPU allocatable jsonpath check
and instead include a basic cert-manager health/status check (kubectl get
deployments, pods, or certificate statuses), and update the References to point
to cert-manager docs/installation pages so the cert-manager-only bundle reads
correctly.
docs/user/cli-reference.md (1)

1152-1152: ⚠️ Potential issue | 🟡 Minor

Use compound-modifier hyphenation here.

Please change “local-helm wrapped chart” to “local-helm-wrapped chart” for clarity and consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` at line 1152, Update the wording in the docs
where the phrase "local-helm wrapped chart" appears (specifically the sentence
"Manifest-only components (no upstream Helm chart, just raw manifests) become a
single local-helm wrapped chart.") to use the compound-modifier hyphenation:
change it to "local-helm-wrapped chart" so the modifier is clear and consistent.
pkg/bundler/deployer/helm/testdata/nodewright_present/README.md (1)

87-110: ⚠️ Potential issue | 🟡 Minor

Troubleshooting section still targets GPU/Network Operator, not nodewright.

The commands and links in this section are inconsistent with the nodewright-operator fixture (skyhook namespace) and should be replaced with nodewright-relevant checks/references (or explicitly labeled as generic examples).

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md` around lines
87 - 110, The Troubleshooting section incorrectly targets GPU/Network operators;
replace the GPU-operator and network-operator checks/logs/links with
nodewright-specific commands and references: update the kubectl pod check to
filter for nodewright-operator (e.g., grep for 'nodewright-operator' or
namespace 'skyhook'), change the logs command to kubectl logs -n skyhook -l
app=nodewright-operator (or the actual label used by the fixture), replace the
GPU node allocatable check with a relevant nodewright health or CR status check
(e.g., kubectl get nodes or kubectl get <NodewrightCR> -n skyhook -o
yaml/jsonpath), and swap the external links to the nodewright-operator
documentation or the project's README; alternatively label these entries
explicitly as generic examples if you intend to keep them
non-nodewright-specific.
pkg/bundler/deployer/localformat/doc.go (1)

15-17: ⚠️ Potential issue | 🟠 Major

Package doc should not imply unimplemented deployer adoption.

Please distinguish current consumers from future targets (e.g., “currently used by X; intended for Y/Z”) instead of listing all as present-tense consumers.

As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/doc.go` around lines 15 - 17, Update the
package doc comment for package localformat in
pkg/bundler/deployer/localformat/doc.go to avoid implying unimplemented deployer
adoption: explicitly state which deployers currently consume the local-chart
bundle layout (e.g., "currently used by X and Y") and mark others as intended or
future targets (e.g., "planned for/compatible with Z in the future"), removing
any present-tense claim that Helmfile, ArgoCD, or Flux already consume it if
they do not; ensure the text references the package name localformat and clearly
separates current consumers from intended future targets.
tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml (1)

99-100: 🧹 Nitpick | 🔵 Trivial

Unify the post-folder exclusion pattern for readability.

This exclusion uses "-post/" while the rest of the file generally uses -gpu-operator-post. Aligning to one pattern avoids ambiguity when maintaining these assertions.

♻️ Suggested consistency tweak
-              INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \
-                | grep -v -- "-post/" | head -1)
+              INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \
+                | grep -v -- "-gpu-operator-post" | head -1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml` around lines 99 - 100,
The exclusion pattern for determining INSTALL is inconsistent: replace the grep
exclusion string "-post/" with the unified "-gpu-operator-post" so the command
that sets INSTALL (the shell pipeline producing INSTALL=$(ls ... | grep -v --
"...") ) uses the same "-gpu-operator-post" filter used elsewhere; update the
grep -v argument accordingly to grep -v -- "-gpu-operator-post".
docs/contributor/component.md (1)

21-21: ⚠️ Potential issue | 🟠 Major

Separate implemented deployers from future intent.

This sentence still reads as if all listed deployers currently consume localformat.Write(). Please split into “currently used by …” vs “planned/future …” to avoid overstating shipped behavior.

As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/contributor/component.md` at line 21, The sentence in docs referencing
pkg/bundler/deployer/localformat and the list of deployers conflates current and
future behavior; update the paragraph to explicitly separate which deployers
currently consume localformat.Write() (e.g., helms, argocd, argocd-helm if they
truly do) from planned or proposed ones (e.g., future helmfile per issue `#632`),
by rewriting the line to state "currently used by ..." vs "planned/future ..."
and ensure the symbol localformat.Write() and the deployer names (`helm`,
`helmfile`, `argocd`, `argocd-helm`) are mentioned so readers can locate the
relevant implementation and roadmap note.
pkg/bundler/deployer/helm/testdata/manifest_only/README.md (2)

87-110: ⚠️ Potential issue | 🟡 Minor

Troubleshooting content is not aligned with this fixture.

These commands/links target GPU/network operators, not skyhook-customizations. Replace with skyhook-specific checks (or explicitly label as generic examples).

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` around lines 87 -
110, The Troubleshooting section in README.md currently lists GPU/network
operator commands and links that don't apply to this manifest_only fixture;
update the "Troubleshooting" block under the "Check deployment status", "View
component logs", and "Verify GPU access" headings to include
skyhook-customizations-specific commands or explicitly label the existing
examples as generic/non-applicable, e.g., replace the kubectl commands and
NVIDIA links with checks that target skyhook-customizations (or add a clear note
like "Generic examples — not specific to skyhook-customizations") so the content
accurately reflects the fixture and operational guidance.

21-21: ⚠️ Potential issue | 🟡 Minor

Use explicit metadata values for manifest-only components.

The row leaves Version empty and renders Source as skyhook-customizations (). Please use explicit values like N/A and local to avoid ambiguous placeholders.

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` at line 21, The
table row for the manifest-only component "skyhook-customizations" leaves
Version blank and shows Source as "skyhook-customizations ()"; update the README
table so the Version cell explicitly reads "N/A" and the Source cell reads
"local" (i.e., replace the empty Version and the "()"-style placeholder with
explicit values for the "skyhook-customizations" row).
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)

92-110: ⚠️ Potential issue | 🟡 Minor

Troubleshooting and references include components outside this bundle.

This fixture bundles gpu-operator, but these sections still include network/cert-manager content. Please scope checks/links to gpu-operator or clearly mark them as generic examples.

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines
92 - 110, The README mixes components outside this bundle (network-operator and
cert-manager) with the bundled gpu-operator; restrict commands and links to GPU
Operator only or label the examples as generic. Update the grep in the "Get
pods" example to only search for gpu-operator, remove or annotate the
network-operator/cert-manager references in the troubleshooting text, ensure the
"View component logs" and "Verify GPU access" sections explicitly reference the
gpu-operator namespace and resource (keep kubectl logs -n gpu-operator -l
app=gpu-operator and the nvidia.com/gpu jsonpath) and replace the Network
Operator link with either a note that it’s external or remove it so the
"References" list only contains GPU Operator docs (or clearly mark others as
external examples).
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Fixture docs still reference unrelated components.

This README is for kai-scheduler, but these sections still reference GPU/network/cert-manager checks and docs. Please replace with kai-scheduler-specific diagnostics (or clearly mark examples as generic) so the fixture remains internally consistent.

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”

Also applies to: 89-110

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line
41, The README's “Note” paragraph and other sections (around the shown Note and
lines 89–110) reference GPU/network/cert-manager checks and external GPU docs
which are unrelated to the kai-scheduler fixture; update those sections to
contain kai-scheduler-specific diagnostics (for example: expected
Deployment/DaemonSet status, scheduler logs, CR status, webhook health, and how
to run kubeectl describe/logs for kai-scheduler) or explicitly mark those
examples as generic/templated, and replace or annotate the AICR CLI link/notes
to point to kai-scheduler behavior docs or a clearly labeled generic deploy
behavior section so the fixture README consistently documents kai-scheduler
operational checks.
pkg/bundler/deployer/localformat/upstream_helm.go (1)

60-66: ⚠️ Potential issue | 🔴 Critical

Shell-quote upstream.env values before writing.

CHART, REPO, and VERSION are written unquoted and later sourced by shell. Metacharacters in values can execute during source.

Suggested fix
 import (
 	"embed"
 	"fmt"
 	"os"
 	"path/filepath"
 	"strings"
 	"text/template"
@@
-	envContent := fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", chart, repo, c.Version)
+	envContent := fmt.Sprintf(
+		"CHART=%s\nREPO=%s\nVERSION=%s\n",
+		shellSingleQuote(chart),
+		shellSingleQuote(repo),
+		shellSingleQuote(c.Version),
+	)
@@
 }
+
+func shellSingleQuote(v string) string {
+	return "'" + strings.ReplaceAll(v, "'", `'"'"'`) + "'"
+}
#!/bin/bash
set -euo pipefail

# Verify unquoted serialization in upstream_helm.go
rg -n 'envContent := fmt.Sprintf\("CHART=%s\\nREPO=%s\\nVERSION=%s\\n"' pkg/bundler/deployer/localformat/upstream_helm.go

# Verify upstream.env is sourced/consumed by install template(s)
fd 'install-upstream-helm.sh.tmpl' pkg --type f -x rg -n 'upstream\.env|(^|\s)(source|\.)\s+.*upstream\.env' {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/upstream_helm.go` around lines 60 - 66, The
env file is written unquoted which allows shell metacharacters to be executed
when sourced; update the envContent construction in upstream_helm.go to
shell-quote CHART, REPO and VERSION before writing (e.g. escape any single
quotes and wrap each value in single quotes or use a dedicated shell-escape
helper), replace the current envContent :=
fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", ...) with a version that produces
CHART='...'\nREPO='...'\nVERSION='...'\n, and keep using deployer.SafeJoin and
writeFile as-is so the file path/permissions are unchanged.
pkg/bundler/deployer/helm/helm.go (1)

124-149: ⚠️ Potential issue | 🟠 Major

Reused output directories can install stale components.

Generate writes new folders but does not prune old NNN-* directories. Since deploy.sh installs by filesystem glob, stale folders can be deployed on subsequent runs with the same output path.

Suggested fix
 import (
 	"context"
 	_ "embed"
 	"fmt"
 	"log/slog"
 	"os"
+	"path/filepath"
 	"strings"
 	"time"
@@
 	// Create output directory
 	if err := os.MkdirAll(outputDir, 0755); err != nil {
 		return nil, errors.Wrap(errors.ErrCodeInternal,
 			"failed to create output directory", err)
 	}
+
+	// Remove stale numbered component folders from previous bundle generations.
+	staleDirs, globErr := filepath.Glob(filepath.Join(outputDir, "[0-9][0-9][0-9]-*"))
+	if globErr != nil {
+		return nil, errors.Wrap(errors.ErrCodeInternal, "failed to enumerate stale component folders", globErr)
+	}
+	for _, d := range staleDirs {
+		if rmErr := os.RemoveAll(d); rmErr != nil {
+			return nil, errors.Wrap(errors.ErrCodeInternal, fmt.Sprintf("failed to remove stale folder %s", d), rmErr)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/helm.go` around lines 124 - 149, Generate currently
writes new component folders but doesn't remove previous NNN-* directories,
allowing stale components to be deployed; update Generate to prune old staged
directories before writing (or immediately after localformat.Write but before
appending folder files) by listing entries in outputDir and removing any
directory matching the NNN-* pattern that are not part of the newly returned
folders; reference the localformat.Write result variable folders, the outputDir
variable, and ensure use of deployer.SafeJoin when resolving paths so pruning
enforces containment and avoids deleting outside paths.
pkg/bundler/deployer/localformat/writer.go (3)

154-156: ⚠️ Potential issue | 🟡 Minor

Only append ?ref= when a tag is provided.

For git-sourced kustomize targets with empty Tag, the current code emits repo//path?ref=. Keep repo//path unless Tag is set.

Suggested fix
 				target := c.Path
 				if c.Repository != "" {
-					target = fmt.Sprintf("%s//%s?ref=%s", c.Repository, c.Path, c.Tag)
+					target = fmt.Sprintf("%s//%s", c.Repository, c.Path)
+					if c.Tag != "" {
+						target = fmt.Sprintf("%s?ref=%s", target, c.Tag)
+					}
 				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 154 - 156, The
target string construction currently always appends "?ref=" even when c.Tag is
empty; update the code that builds target (the branch where c.Repository != ""
that sets target) to only append "?ref=%s" when c.Tag is non-empty (use c.Tag to
decide), e.g. construct base := fmt.Sprintf("%s//%s", c.Repository, c.Path) and
then set target = base only, or target = base + "?ref="+c.Tag when c.Tag != "".
Ensure you modify the block that references c.Repository, c.Path, c.Tag and
assigns target.

81-102: ⚠️ Potential issue | 🔴 Critical

Fail fast before emitting folder indices beyond 999.

Folder names are %03d-*, but deploy/undeploy scripts glob only NNN-*. Once idx reaches 1000, generated folders are silently skipped by orchestration.

Suggested fix
 	for _, c := range opts.Components {
+		if idx > 999 {
+			return nil, errors.New(errors.ErrCodeInvalidRequest,
+				"too many bundle folders; max 999 supported by NNN-* layout")
+		}
 		if err := ctx.Err(); err != nil {
 			return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err)
 		}
@@
 			if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 {
+				if idx > 999 {
+					return nil, errors.New(errors.ErrCodeInvalidRequest,
+						"too many bundle folders; max 999 supported by NNN-* layout")
+				}
 				postName := c.Name + "-post"

Also applies to: 119-130

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 81 - 102, The loop
that builds folder names uses idx with fmt.Sprintf("%03d-%s", idx, c.Name) so
once idx >= 1000 the glob "NNN-*" used by orchestration will skip folders; add a
fail-fast guard that checks idx >= 1000 (or > 999) before formatting dir and
return an errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("too many
components: index %d would overflow 3-digit folder prefix", idx)); apply the
same guard in the other similar block around where dir is created (the other
place referenced in the comment) so the routine returns a clear error instead of
silently producing unmatchable folder names.

83-90: ⚠️ Potential issue | 🟠 Major

Reject mixed-component <name>-post collisions up front.

A real component named x-post can collide with the synthetic folder emitted for mixed component x, causing ambiguous release/folder mapping.

Suggested fix
 func Write(ctx context.Context, opts Options) ([]Folder, error) {
+	declared := make(map[string]struct{}, len(opts.Components))
+	for _, c := range opts.Components {
+		if _, exists := declared[c.Name]; exists {
+			return nil, errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("duplicate component name %q", c.Name))
+		}
+		declared[c.Name] = struct{}{}
+	}
+
 	if err := os.MkdirAll(opts.OutputDir, 0o755); err != nil {
 		return nil, errors.Wrap(errors.ErrCodeInternal, "create output dir", err)
 	}
@@
 			if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 {
 				postName := c.Name + "-post"
+				if _, exists := declared[postName]; exists {
+					return nil, errors.New(errors.ErrCodeInvalidRequest,
+						fmt.Sprintf("component %q injects %q which collides with a declared component", c.Name, postName))
+				}

Also applies to: 117-124

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 83 - 90, When
validating opts.Components in the writer (the loop over opts.Components that
calls deployer.IsSafePathComponent), add a check to reject name collisions where
a real component named "<base>-post" would conflict with the synthetic folder
emitted for a mixed component "<base>": build a map of component names (and
track which components are mixed if there is a field like Mixed/IsMixed) and,
before accepting a component, if its Name has suffix "-post" and the map already
contains a component with the same base name (Name without "-post") that is
mixed, return an errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("invalid
component name %q: reserved for mixed-component synthetic folder", c.Name));
apply the same collision check in the second validation block (lines 117-124) so
both validation sites reject these mixed-component "-post" collisions up front
while preserving the IsSafePathComponent check.
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (4)

43-46: ⚠️ Potential issue | 🟠 Major

Validate --timeout before appending s.

HELM_TIMEOUT accepts arbitrary strings, but later commands force an s suffix. Inputs like --timeout 5m become 5ms, and invalid tokens fail later in Helm/kubectl calls.

Suggested fix
     --timeout)
       if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi
-      HELM_TIMEOUT="$2"; shift 2 ;;
+      if ! [[ "$2" =~ ^[0-9]+$ ]]; then
+        echo "Error: --timeout must be an integer number of seconds"
+        exit 1
+      fi
+      HELM_TIMEOUT="$2"; shift 2 ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 43 - 46,
The CLI parsing currently blindly sets HELM_TIMEOUT="$2" and later forces a
trailing "s", which mangles values like "5m" into "5ms"; modify the --timeout
handling in undeploy.sh.tmpl so that after assigning HELM_TIMEOUT="$2" you
validate the token (e.g., if it matches a pure number /^[0-9]+$/ then append an
"s", if it already contains a valid time unit like s|m|ms|h accept as-is,
otherwise reject with "Error: invalid --timeout value"). Update the branch that
handles --timeout to perform this regex check and either normalize numeric-only
inputs by appending "s" or exit with an error for unrecognized formats.

452-475: ⚠️ Potential issue | 🔴 Critical

Delete hook-created resources before uninstalling wrapper releases.

For wrapper charts that use Helm hooks, helm uninstall does not remove hook-created objects. Without an explicit pre-uninstall sweep, resources/finalizers can remain after controller uninstall.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 452 - 475,
Add a pre-uninstall sweep for hook-created resources by invoking a cleanup
function before calling helm_force_uninstall inside the per-component loop: call
pre_uninstall_hook_cleanup "${name}" "${ns}" (or another appropriately named
function) immediately before helm_force_uninstall "${name}" "${ns}"; implement
that function to locate and delete Helm hook-created objects and remove
finalizers for resources owned by the wrapper/chart so they don't remain after
controller uninstall (use component name/namespace to scope deletions). Ensure
delete_release_cluster_resources and delete_orphaned_webhooks_for_ns remain
after the uninstall to handle cluster-wide cleanup.

418-425: ⚠️ Potential issue | 🟡 Minor

Safety scans still omit injected -post releases.

The pre-flight and post-flight CRD scans iterate .ComponentsReversed, but uninstall iterates actual NNN-* folders (which include injected -post wrappers). This leaves wrapper releases outside the same safety surface.

Also applies to: 506-515, 620-638

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 418 - 425,
The pre-flight/post-flight CRD checks currently iterate .ComponentsReversed,
which misses injected "-post" wrapper releases and therefore doesn't cover the
same set the uninstall loop (which iterates the actual NNN-* folders) processes;
update each pre/post-flight range (the blocks that call
skip_preflight_for_release and check_release_for_stuck_crds) to iterate the same
collection the uninstall code uses (the list of actual installed release
folders) instead of .ComponentsReversed so wrapper "-post" releases are
included—locate the template ranges that call skip_preflight_for_release and
check_release_for_stuck_crds and replace the range source with the
unified/installed-releases collection used by the uninstall loop (use the same
variable name used by uninstall to ensure parity).

477-480: ⚠️ Potential issue | 🟡 Minor

Move Nodewright taint comments inside the conditional block.

These comments render even when nodewright-operator is not in the bundle, which makes generated scripts misleading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 477 - 480,
The comments about Nodewright taints are placed before the components loop and
thus appear in generated scripts even when nodewright-operator isn't present;
move those comment lines so they are inside the conditional that handles the
nodewright component (e.g., inside the {{- range .ComponentsReversed }} block
and specifically within the branch that checks for the nodewright-operator
component), and keep the reference to runtimeRequiredTaint (defaults to
skyhook.nvidia.com) next to the taint-removal logic so the comment only renders
when nodewright-related code is generated.
pkg/bundler/deployer/localformat/writer_test.go (2)

117-119: ⚠️ Potential issue | 🟡 Minor

Avoid dereferencing folders[0] in the fatal path.

If Write returns zero folders, these t.Fatalf calls panic while formatting folders[0].Kind instead of reporting the real assertion failure.

Suggested fix
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm {
- 	t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind)
- }
+ if len(folders) != 1 {
+ 	t.Fatalf("want 1 local-helm folder, got %d", len(folders))
+ }
+ if folders[0].Kind != localformat.KindLocalHelm {
+ 	t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind)
+ }

Also applies to: 310-312

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The
test currently dereferences folders[0] inside t.Fatalf which will panic when
folders is empty; change the assertion in the test to guard the index first
(check len(folders) == 0 and fail with a clear message) before inspecting
folders[0].Kind (or split into two checks: length check then kind check) for the
occurrences around the assertion referencing folders, localformat.KindLocalHelm,
and t.Fatalf (also fix the duplicate at the later occurrence).

325-364: ⚠️ Potential issue | 🟡 Minor

Include a kustomize component in the determinism assertion.

The determinism test currently exercises upstream and mixed-manifest paths, but not the kustomize render path (templates/manifest.yaml), which is part of the new layout contract.

pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh (3)

85-93: ⚠️ Potential issue | 🟠 Major

helm_force_uninstall still suppresses terminal uninstall failures.

If both uninstall attempts fail, execution continues and reports completion despite a potentially still-installed release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around
lines 85 - 93, The helper helm_force_uninstall currently swallows failures (it
redirects stderr on first attempt and uses "|| true" on the second), so if both
uninstalls fail the script still continues; modify helm_force_uninstall to
propagate errors instead: remove the stderr discard on the first helm uninstall
and remove the "|| true" on the second uninstall (or explicitly check both
command exit codes), and if both attempts fail print an error message naming the
release and namespace and then exit or return a non-zero status (e.g., return 1)
so callers know the uninstall failed; the function name to update is
helm_force_uninstall.

527-537: ⚠️ Potential issue | 🔴 Critical

Namespace force-finalization is still applied to all terminating namespaces.

This remains unsafe on shared clusters because unrelated namespaces can be patched/deleted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around
lines 527 - 537, The current loop force-clears finalizers for every terminating
namespace (TERMINATING) which is unsafe on shared clusters; change the logic to
first restrict TERMINATING to only the known/test namespaces you own (e.g.,
compute a filtered list like OWN_TERMINATING by matching names against a
whitelist, label/annotation, or the specific expected namespaces for this
mixed_gpu_operator test), then iterate over that filtered list when calling
force_clear_namespace_finalizers and kubectl delete; ensure you reference the
existing TERMINATING variable and the force_clear_namespace_finalizers function
so the code only patches/deletes namespaces that match the allowed set and
explicitly skip system or unrelated namespaces (e.g., kube-system, kube-public).

559-568: ⚠️ Potential issue | 🟠 Major

Global stale-webhook sweep is still deleting outside bundle scope.

This can remove webhook configurations owned by other tenants/workloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around
lines 559 - 568, The sweep is deleting webhook configurations globally without
verifying ownership; update the while loop that reads wh_name, svc_ns, svc_name
to only delete webhooks that are owned by this bundle (e.g., check
metadata.ownerReferences or a bundle-specific label) and skip others.
Concretely, before the kubectl delete
mutatingwebhookconfiguration/validatingwebhookconfiguration calls, query the
webhook JSON for an identifying ownerReference or label (using kubectl get
mutatingwebhookconfiguration/validatingwebhookconfiguration "$wh_name" -o
jsonpath=...) and only perform the delete when that ownerReference or label
matches the bundle's expected value; otherwise echo a skip message and continue.
Ensure checks use wh_name, svc_ns, svc_name and guard both delete commands so
unrelated tenants are never removed.
pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (1)

249-253: ⚠️ Potential issue | 🟠 Major

skip_preflight_for_release does not match skyhook-customizations, so the skip branch is unreachable.

At Line 418 you call skip_preflight_for_release "skyhook-customizations", but this function never returns success for that value.

Suggested fix
 skip_preflight_for_release() {
   case "$1" in
-    nodewright-operator|kgateway) return 0 ;;
+    nodewright-operator|kgateway|skyhook-customizations) return 0 ;;
     *) return 1 ;;
   esac
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines
249 - 253, The skip_preflight_for_release function never returns success for
"skyhook-customizations", making the call skip_preflight_for_release
"skyhook-customizations" unreachable; update the function
(skip_preflight_for_release) to include "skyhook-customizations" in the
successful case pattern (e.g., add it alongside nodewright-operator and
kgateway) so the function returns 0 for that release, or alternatively adjust
the caller to use a release name that the function recognizes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 559-568: The deletion loop currently inspects all
mutatingwebhookconfigurations/validatingwebhookconfigurations cluster-wide and
may remove unrelated webhooks; restrict scope by filtering the jq pipeline to
only consider webhooks whose .clientConfig.service.namespace equals the target
release namespace (e.g., an environment variable like RELEASE_NAMESPACE or a
provided NAMESPACE variable) before attempting deletion, and use the existing
variables (wh_name, svc_ns, svc_name) to only delete when svc_ns matches that
target namespace; update the kubectl/jq command that produces the list and the
if-condition in the while loop to enforce svc_ns == "$TARGET_NAMESPACE" so only
webhooks tied to the release namespace are removed.
- Around line 85-93: The helm_force_uninstall function currently swallows
failures from the second uninstall (it uses "|| true"), so if both attempts fail
the script continues and masks a partial undeploy; change helm_force_uninstall
to detect the exit status of the second helm uninstall (the one with --no-hooks)
and propagate a non-zero failure (e.g., remove the "|| true" and return the
command's exit code or explicitly log an error and return non-zero) so callers
see the undeploy failure; reference the helm_force_uninstall function and the
helm uninstall invocation that uses --no-hooks and HELM_TIMEOUT to locate and
update the behavior.
- Around line 527-537: The loop currently collects all Terminating namespaces
into TERMINATING and force-clears finalizers for every one; restrict this to
only bundle-managed namespaces by filtering the kubectl query to bundle
namespaces (e.g. use a label selector like kubectl get ns -l
"app.kubernetes.io/managed-by=bundler" --no-headers or a name-prefix filter
using a BUNDLE_PREFIX variable) instead of the global get; update the
TERMINATING assignment and the subsequent for-loop to only iterate over those
filtered names, and keep using force_clear_namespace_finalizers and kubectl
delete namespace "${ns}" --ignore-not-found --wait=false as before.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Line 41: The README note is too broad for this nodewright-only fixture; update
the note in README.md to either (a) scope it to nodewright-specific post-install
convergence (mention Nodewright node tuning and that convergence continues
asynchronously) and remove references to GPU Operator operand rollout and DRA
kubelet plugin registration, or (b) make the note explicitly generic by
prefacing with “If deploying GPU workloads or additional components…” and link
to the CLI reference; locate and edit the paragraph that begins “Note: The
deploy script's final status…” and adjust the text to only mention
nodewright-related async actions or clearly state it applies to broader
deployments.

---

Outside diff comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 469-485: The current string search for "enabled: true" is too
brittle; instead parse values.yaml into a YAML object and assert the top-level
"enabled" key is not present or not set to true by unmarshalling valuesData into
a map[string]interface{} and checking map["enabled"]; update the check that uses
valuesStr to use the parsed map so it fails only on a leaked top-level enabled,
and keep the existing override assertion by verifying the nested
controller.replicaCount key exists in the parsed structure (e.g., assert
parsed["controller"].(map[string]interface{})["replicaCount"] is present).

---

Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: The sentence in docs referencing pkg/bundler/deployer/localformat and
the list of deployers conflates current and future behavior; update the
paragraph to explicitly separate which deployers currently consume
localformat.Write() (e.g., helms, argocd, argocd-helm if they truly do) from
planned or proposed ones (e.g., future helmfile per issue `#632`), by rewriting
the line to state "currently used by ..." vs "planned/future ..." and ensure the
symbol localformat.Write() and the deployer names (`helm`, `helmfile`, `argocd`,
`argocd-helm`) are mentioned so readers can locate the relevant implementation
and roadmap note.

In `@docs/user/cli-reference.md`:
- Line 1152: Update the wording in the docs where the phrase "local-helm wrapped
chart" appears (specifically the sentence "Manifest-only components (no upstream
Helm chart, just raw manifests) become a single local-helm wrapped chart.") to
use the compound-modifier hyphenation: change it to "local-helm-wrapped chart"
so the modifier is clear and consistent.

In `@pkg/bundler/deployer/helm/helm.go`:
- Around line 124-149: Generate currently writes new component folders but
doesn't remove previous NNN-* directories, allowing stale components to be
deployed; update Generate to prune old staged directories before writing (or
immediately after localformat.Write but before appending folder files) by
listing entries in outputDir and removing any directory matching the NNN-*
pattern that are not part of the newly returned folders; reference the
localformat.Write result variable folders, the outputDir variable, and ensure
use of deployer.SafeJoin when resolving paths so pruning enforces containment
and avoids deleting outside paths.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 43-46: The CLI parsing currently blindly sets HELM_TIMEOUT="$2"
and later forces a trailing "s", which mangles values like "5m" into "5ms";
modify the --timeout handling in undeploy.sh.tmpl so that after assigning
HELM_TIMEOUT="$2" you validate the token (e.g., if it matches a pure number
/^[0-9]+$/ then append an "s", if it already contains a valid time unit like
s|m|ms|h accept as-is, otherwise reject with "Error: invalid --timeout value").
Update the branch that handles --timeout to perform this regex check and either
normalize numeric-only inputs by appending "s" or exit with an error for
unrecognized formats.
- Around line 452-475: Add a pre-uninstall sweep for hook-created resources by
invoking a cleanup function before calling helm_force_uninstall inside the
per-component loop: call pre_uninstall_hook_cleanup "${name}" "${ns}" (or
another appropriately named function) immediately before helm_force_uninstall
"${name}" "${ns}"; implement that function to locate and delete Helm
hook-created objects and remove finalizers for resources owned by the
wrapper/chart so they don't remain after controller uninstall (use component
name/namespace to scope deletions). Ensure delete_release_cluster_resources and
delete_orphaned_webhooks_for_ns remain after the uninstall to handle
cluster-wide cleanup.
- Around line 418-425: The pre-flight/post-flight CRD checks currently iterate
.ComponentsReversed, which misses injected "-post" wrapper releases and
therefore doesn't cover the same set the uninstall loop (which iterates the
actual NNN-* folders) processes; update each pre/post-flight range (the blocks
that call skip_preflight_for_release and check_release_for_stuck_crds) to
iterate the same collection the uninstall code uses (the list of actual
installed release folders) instead of .ComponentsReversed so wrapper "-post"
releases are included—locate the template ranges that call
skip_preflight_for_release and check_release_for_stuck_crds and replace the
range source with the unified/installed-releases collection used by the
uninstall loop (use the same variable name used by uninstall to ensure parity).
- Around line 477-480: The comments about Nodewright taints are placed before
the components loop and thus appear in generated scripts even when
nodewright-operator isn't present; move those comment lines so they are inside
the conditional that handles the nodewright component (e.g., inside the {{-
range .ComponentsReversed }} block and specifically within the branch that
checks for the nodewright-operator component), and keep the reference to
runtimeRequiredTaint (defaults to skyhook.nvidia.com) next to the taint-removal
logic so the comment only renders when nodewright-related code is generated.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README's “Note” paragraph and other sections (around the shown
Note and lines 89–110) reference GPU/network/cert-manager checks and external
GPU docs which are unrelated to the kai-scheduler fixture; update those sections
to contain kai-scheduler-specific diagnostics (for example: expected
Deployment/DaemonSet status, scheduler logs, CR status, webhook health, and how
to run kubeectl describe/logs for kai-scheduler) or explicitly mark those
examples as generic/templated, and replace or annotate the AICR CLI link/notes
to point to kai-scheduler behavior docs or a clearly labeled generic deploy
behavior section so the fixture README consistently documents kai-scheduler
operational checks.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Around line 87-110: The Troubleshooting section in README.md currently lists
GPU/network operator commands and links that don't apply to this manifest_only
fixture; update the "Troubleshooting" block under the "Check deployment status",
"View component logs", and "Verify GPU access" headings to include
skyhook-customizations-specific commands or explicitly label the existing
examples as generic/non-applicable, e.g., replace the kubectl commands and
NVIDIA links with checks that target skyhook-customizations (or add a clear note
like "Generic examples — not specific to skyhook-customizations") so the content
accurately reflects the fixture and operational guidance.
- Line 21: The table row for the manifest-only component
"skyhook-customizations" leaves Version blank and shows Source as
"skyhook-customizations ()"; update the README table so the Version cell
explicitly reads "N/A" and the Source cell reads "local" (i.e., replace the
empty Version and the "()"-style placeholder with explicit values for the
"skyhook-customizations" row).

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 249-253: The skip_preflight_for_release function never returns
success for "skyhook-customizations", making the call skip_preflight_for_release
"skyhook-customizations" unreachable; update the function
(skip_preflight_for_release) to include "skyhook-customizations" in the
successful case pattern (e.g., add it alongside nodewright-operator and
kgateway) so the function returns 0 for that release, or alternatively adjust
the caller to use a release name that the function recognizes.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 3-7: The Service manifest for kind: Service named "dcgm-exporter"
is missing a spec (ports and optional selector); update the dcgm-exporter
Service resource to include a minimal spec block that declares at least one port
(port and targetPort and protocol) and, if appropriate for this fixture, a
selector matching the Deployment/Pod labels (e.g., app/dcgm-exporter) so the
Service is selectable and installable; ensure the resource remains valid YAML
and aligns with the Deployment/Pod labels in this testdata set.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 92-110: The README mixes components outside this bundle
(network-operator and cert-manager) with the bundled gpu-operator; restrict
commands and links to GPU Operator only or label the examples as generic. Update
the grep in the "Get pods" example to only search for gpu-operator, remove or
annotate the network-operator/cert-manager references in the troubleshooting
text, ensure the "View component logs" and "Verify GPU access" sections
explicitly reference the gpu-operator namespace and resource (keep kubectl logs
-n gpu-operator -l app=gpu-operator and the nvidia.com/gpu jsonpath) and replace
the Network Operator link with either a note that it’s external or remove it so
the "References" list only contains GPU Operator docs (or clearly mark others as
external examples).

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 85-93: The helper helm_force_uninstall currently swallows failures
(it redirects stderr on first attempt and uses "|| true" on the second), so if
both uninstalls fail the script still continues; modify helm_force_uninstall to
propagate errors instead: remove the stderr discard on the first helm uninstall
and remove the "|| true" on the second uninstall (or explicitly check both
command exit codes), and if both attempts fail print an error message naming the
release and namespace and then exit or return a non-zero status (e.g., return 1)
so callers know the uninstall failed; the function name to update is
helm_force_uninstall.
- Around line 527-537: The current loop force-clears finalizers for every
terminating namespace (TERMINATING) which is unsafe on shared clusters; change
the logic to first restrict TERMINATING to only the known/test namespaces you
own (e.g., compute a filtered list like OWN_TERMINATING by matching names
against a whitelist, label/annotation, or the specific expected namespaces for
this mixed_gpu_operator test), then iterate over that filtered list when calling
force_clear_namespace_finalizers and kubectl delete; ensure you reference the
existing TERMINATING variable and the force_clear_namespace_finalizers function
so the code only patches/deletes namespaces that match the allowed set and
explicitly skip system or unrelated namespaces (e.g., kube-system, kube-public).
- Around line 559-568: The sweep is deleting webhook configurations globally
without verifying ownership; update the while loop that reads wh_name, svc_ns,
svc_name to only delete webhooks that are owned by this bundle (e.g., check
metadata.ownerReferences or a bundle-specific label) and skip others.
Concretely, before the kubectl delete
mutatingwebhookconfiguration/validatingwebhookconfiguration calls, query the
webhook JSON for an identifying ownerReference or label (using kubectl get
mutatingwebhookconfiguration/validatingwebhookconfiguration "$wh_name" -o
jsonpath=...) and only perform the delete when that ownerReference or label
matches the bundle's expected value; otherwise echo a skip message and continue.
Ensure checks use wh_name, svc_ns, svc_name and guard both delete commands so
unrelated tenants are never removed.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Around line 87-110: The Troubleshooting section incorrectly targets
GPU/Network operators; replace the GPU-operator and network-operator
checks/logs/links with nodewright-specific commands and references: update the
kubectl pod check to filter for nodewright-operator (e.g., grep for
'nodewright-operator' or namespace 'skyhook'), change the logs command to
kubectl logs -n skyhook -l app=nodewright-operator (or the actual label used by
the fixture), replace the GPU node allocatable check with a relevant nodewright
health or CR status check (e.g., kubectl get nodes or kubectl get <NodewrightCR>
-n skyhook -o yaml/jsonpath), and swap the external links to the
nodewright-operator documentation or the project's README; alternatively label
these entries explicitly as generic examples if you intend to keep them
non-nodewright-specific.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 91-110: The README's troubleshooting section currently shows
GPU/network-operator commands and references; update the "kubectl get pods -A |
grep -E 'gpu-operator|network-operator|cert-manager'" command, the "View
component logs" kubectl logs command, the "Verify GPU access" jsonpath check,
and the References list to be cert-manager specific. Replace the pod
listing/grep with a cert-manager namespace/label lookup (e.g., namespace
cert-manager), change the logs command to target cert-manager pods (use -n
cert-manager and appropriate label), remove the GPU allocatable jsonpath check
and instead include a basic cert-manager health/status check (kubectl get
deployments, pods, or certificate statuses), and update the References to point
to cert-manager docs/installation pages so the cert-manager-only bundle reads
correctly.

In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 15-17: Update the package doc comment for package localformat in
pkg/bundler/deployer/localformat/doc.go to avoid implying unimplemented deployer
adoption: explicitly state which deployers currently consume the local-chart
bundle layout (e.g., "currently used by X and Y") and mark others as intended or
future targets (e.g., "planned for/compatible with Z in the future"), removing
any present-tense claim that Helmfile, ArgoCD, or Flux already consume it if
they do not; ensure the text references the package name localformat and clearly
separates current consumers from intended future targets.

In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 60-66: The env file is written unquoted which allows shell
metacharacters to be executed when sourced; update the envContent construction
in upstream_helm.go to shell-quote CHART, REPO and VERSION before writing (e.g.
escape any single quotes and wrap each value in single quotes or use a dedicated
shell-escape helper), replace the current envContent :=
fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", ...) with a version that produces
CHART='...'\nREPO='...'\nVERSION='...'\n, and keep using deployer.SafeJoin and
writeFile as-is so the file path/permissions are unchanged.

In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The test currently dereferences folders[0] inside
t.Fatalf which will panic when folders is empty; change the assertion in the
test to guard the index first (check len(folders) == 0 and fail with a clear
message) before inspecting folders[0].Kind (or split into two checks: length
check then kind check) for the occurrences around the assertion referencing
folders, localformat.KindLocalHelm, and t.Fatalf (also fix the duplicate at the
later occurrence).

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 154-156: The target string construction currently always appends
"?ref=" even when c.Tag is empty; update the code that builds target (the branch
where c.Repository != "" that sets target) to only append "?ref=%s" when c.Tag
is non-empty (use c.Tag to decide), e.g. construct base := fmt.Sprintf("%s//%s",
c.Repository, c.Path) and then set target = base only, or target = base +
"?ref="+c.Tag when c.Tag != "". Ensure you modify the block that references
c.Repository, c.Path, c.Tag and assigns target.
- Around line 81-102: The loop that builds folder names uses idx with
fmt.Sprintf("%03d-%s", idx, c.Name) so once idx >= 1000 the glob "NNN-*" used by
orchestration will skip folders; add a fail-fast guard that checks idx >= 1000
(or > 999) before formatting dir and return an
errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("too many components: index
%d would overflow 3-digit folder prefix", idx)); apply the same guard in the
other similar block around where dir is created (the other place referenced in
the comment) so the routine returns a clear error instead of silently producing
unmatchable folder names.
- Around line 83-90: When validating opts.Components in the writer (the loop
over opts.Components that calls deployer.IsSafePathComponent), add a check to
reject name collisions where a real component named "<base>-post" would conflict
with the synthetic folder emitted for a mixed component "<base>": build a map of
component names (and track which components are mixed if there is a field like
Mixed/IsMixed) and, before accepting a component, if its Name has suffix "-post"
and the map already contains a component with the same base name (Name without
"-post") that is mixed, return an errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("invalid component name %q: reserved for mixed-component synthetic
folder", c.Name)); apply the same collision check in the second validation block
(lines 117-124) so both validation sites reject these mixed-component "-post"
collisions up front while preserving the IsSafePathComponent check.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The test currently uses grep -q 'kind: Gateway' which also
matches GatewayParameters; update the test to assert an exact full-line match
for the Gateway kind (e.g., replace the grep -q 'kind: Gateway' invocation with
a grep that matches the entire line "kind: Gateway" using either line anchors or
grep -x) so the MANIFEST check detects the Gateway object specifically while
keeping the existing GatewayParameters check intact.

In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-100: The exclusion pattern for determining INSTALL is
inconsistent: replace the grep exclusion string "-post/" with the unified
"-gpu-operator-post" so the command that sets INSTALL (the shell pipeline
producing INSTALL=$(ls ... | grep -v -- "...") ) uses the same
"-gpu-operator-post" filter used elsewhere; update the grep -v argument
accordingly to grep -v -- "-gpu-operator-post".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment thread pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
Comment thread pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
Comment thread pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/README.md Outdated
@lockwobr lockwobr force-pushed the feat/generic-helm branch from e281666 to 811ecac Compare April 28, 2026 21:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (21)
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)

3-6: ⚠️ Potential issue | 🟠 Major

Service manifest is incomplete and not installable.

Line 4 declares kind: Service, but Lines 3-6 omit spec entirely. A non-ExternalName Service requires spec (including ports), otherwise rendered output is invalid.

Suggested fix
 apiVersion: v1
 kind: Service
 metadata:
   name: dcgm-exporter
+spec:
+  selector:
+    app.kubernetes.io/name: dcgm-exporter
+  ports:
+    - name: metrics
+      port: 9400
+      targetPort: 9400
+      protocol: TCP
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`
around lines 3 - 6, The Service manifest for "metadata.name: dcgm-exporter"
declares "kind: Service" but lacks a required "spec" block; add a proper spec
for a non-ExternalName Service including a selector that matches the exporter
pods (or appropriate labels) and at least one ports entry (e.g., port,
targetPort, protocol) so the rendered Service is valid and installable. Ensure
the manifest contains "spec:" with "selector:" referencing the pod labels and
"ports:" with the correct port definitions for dcgm-exporter.
docs/contributor/component.md (1)

21-21: ⚠️ Potential issue | 🟠 Major

Separate current support from future intent in this sentence.

“Consumed by every deployer” conflicts with “future helmfile per #632” and overstates what is implemented today. Please rewrite to explicitly state currently supported deployers and then a separate future-intent sentence.

As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/contributor/component.md` at line 21, The sentence overstates current
support by saying the `pkg/bundler/deployer/localformat` layout is "consumed by
every deployer" while also mentioning a future `helmfile` support; update the
text to separate current behavior from future intent: explicitly state that
`localformat.Write()` is currently used by the listed deployers (e.g., `helm`,
`argocd`, `argocd-helm`) and then add a distinct, clearly marked sentence
indicating that support for `helmfile` is planned per issue `#632` (or similar
future intent), ensuring you keep `localformat` ownership and the fact that
deployers add top-level orchestration files (deploy.sh, helmfile.yaml,
Application CRs) unchanged.
tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)

71-72: ⚠️ Potential issue | 🟠 Major

Harden kind assertions to avoid false positives.

Line 72 can pass when only GatewayParameters exists, because grep -q 'kind: Gateway' matches that substring too. Anchor both kind checks to full lines.

✅ Tighten assertions
- grep -q 'kind: GatewayParameters' "${MANIFEST}"
- grep -q 'kind: Gateway' "${MANIFEST}"
+ grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}"
+ grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71
- 72, The grep checks for manifest kinds are too loose and can false-match
(e.g., "kind: GatewayParameters" satisfies the "kind: Gateway" check); update
the two assertions that run grep -q 'kind: GatewayParameters' "${MANIFEST}" and
grep -q 'kind: Gateway' "${MANIFEST}" to anchor the pattern to the full line
(e.g., use start/end anchors or grep -x) so they only match exact lines "kind:
GatewayParameters" and "kind: Gateway" in the ${MANIFEST}.
pkg/bundler/bundler_test.go (1)

1199-1206: 🧹 Nitpick | 🔵 Trivial

Remove redundant nested file checks under already-missing directories.

Once the directory non-existence assertion passes, checking cluster-values.yaml beneath that same directory adds no extra signal.

♻️ Suggested simplification
 // Disabled component should NOT have a directory at all (under any numbering)
 for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} {
 	if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) {
 		t.Errorf("expected %s directory to NOT be created (component is disabled)", dir)
 	}
-	if _, statErr := os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")); !os.IsNotExist(statErr) {
-		t.Errorf("expected %s/cluster-values.yaml to NOT exist (component is disabled)", dir)
-	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/bundler_test.go` around lines 1199 - 1206, The test is
redundantly checking for cluster-values.yaml under directories already asserted
absent; remove the nested file existence checks and only assert the directory
non-existence for each of the dirs (e.g., "aws-ebs-csi-driver",
"001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver") using os.Stat on
filepath.Join(tmpDir, dir) in the test in pkg/bundler/bundler_test.go so the
loop that references tmpDir and dir only contains the directory non-existence
assertion and not the additional filepath.Join(tmpDir, dir,
"cluster-values.yaml") checks.
tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml (1)

99-100: ⚠️ Potential issue | 🟡 Minor

Normalize the post-folder exclusion pattern in the install script lookup.

Line 100 uses -post/ while the rest of this file uses -gpu-operator-post; keeping one pattern avoids confusion and drift.

Suggested patch
-              INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \
-                | grep -v -- "-post/" | head -1)
+              INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \
+                | grep -v -- "-gpu-operator-post" | head -1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml` around lines 99 - 100,
The install script lookup uses a mismatched exclusion pattern "-post/" which is
inconsistent with the rest of the file; update the grep pattern in the INSTALL
assignment (the line that builds INSTALL) to exclude "-gpu-operator-post"
instead of "-post/" so the filter matches the project's existing post-folder
naming convention (update the grep -v argument used when selecting from
"${WORK}/helm-dynamic/[0-9]*-gpu-operator/install.sh").
pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Align readiness note and references with the cert-manager-only fixture.

Line 41 and Lines 106-107 describe unrelated GPU/network topics; please scope these to cert-manager or mark them as generic guidance.

As per coding guidelines: “When writing documentation... Ensure accuracy... and document operational clarity.”

Also applies to: 106-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` at line 41,
The README contains a "Note:" paragraph that references GPU/node convergence and
network topics unrelated to the cert-manager-only fixture; update that paragraph
(the "**Note:**" block) and the other similar guidance later in the file so they
either (a) explicitly scope the content to GPU/network deployments (e.g., prefix
with "For GPU/network deployments:") or (b) reword to generic deployment
readiness guidance applicable to cert-manager-only fixtures, and align the
references to the AICR CLI docs accordingly (ensure the cert-manager-only
fixture README does not mention Nodewright/GPU operator/DRA kubelet plugin
unless scoped).
docs/user/cli-reference.md (1)

1152-1152: ⚠️ Potential issue | 🟡 Minor

Hyphenate the compound modifier in Line 1152.

Use local-helm-wrapped chart for consistent grammar.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` at line 1152, The sentence "Manifest-only
components (no upstream Helm chart, just raw manifests) become a single
local-helm wrapped chart." uses an unhyphenated compound modifier; update that
phrase to "local-helm-wrapped chart" so the compound modifier is hyphenated
correctly (search for the exact sentence "Manifest-only components (no upstream
Helm chart, just raw manifests) become a single local-helm wrapped chart." and
replace the modifier).
pkg/bundler/deployer/helm/testdata/nodewright_present/README.md (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Scope this fixture README to nodewright-only behavior and references.

Line 41 and Lines 106-107 currently describe broader GPU stack concerns/docs, but this fixture documents only nodewright-operator. Please make the note and references component-accurate (or explicitly mark them as generic).

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy... and document operational clarity.”

Also applies to: 106-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md` at line 41,
The README note currently references broad GPU-stack convergence and the AICR
CLI entry about deploy script behavior but this fixture documents only
nodewright-operator; update the text around the deploy script note (line
referencing the deploy script) and the references at lines 106-107 to be
component-accurate by limiting scope to nodewright-operator behavior (e.g.,
mention Nodewright node tuning and any nodewright-specific async actions) or
explicitly label the statements as “generic GPU-stack behavior” if you intend to
keep the broader content; ensure references to the AICR CLI or GPU operator are
either removed or replaced with nodewright-specific docs/links and that
occurrences of “GPU operator”/“DRA kubelet plugin” are not asserted as part of
this fixture’s expected state unless you mark them as generic.
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)

41-41: ⚠️ Potential issue | 🟡 Minor

Update note and references to be kai-scheduler-specific (or explicitly generic).

Line 41 and Lines 106-107 are currently GPU-stack oriented and don’t match this fixture’s component set.

As per coding guidelines: “When writing documentation... Ensure accuracy, use neutral tone... and document operational clarity.”

Also applies to: 106-107

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line
41, The README note paragraph that begins with "**Note:** The deploy script's
final status..." and the later GPU-specific sentences (referencing
`--best-effort`, "GPU workloads", "fresh GPU nodes", "Nodewright node tuning",
"GPU operator operand rollout", "DRA kubelet plugin registration") must be
rewritten to either be kai-scheduler-specific or explicitly generic: remove or
replace GPU-stack terms with kai-scheduler component names or neutral wording
about asynchronous convergence of cluster components after the script exits,
retain mention of `--best-effort` behavior and the link to the AICR CLI
Reference but adjust its context to match kai-scheduler (or state it's a generic
deploy behavior link), and ensure the revised text clarifies that final script
exit does not guarantee full operational readiness and users should check
warnings and logs.
pkg/bundler/deployer/helm/templates/README.md.tmpl (2)

131-134: ⚠️ Potential issue | 🟡 Minor

Only emit reference links that apply to the generated bundle.

This template always appends GPU Operator and Network Operator docs, so manifest-only bundles and gpu-operator-only bundles end up with unrelated references. Please gate these links on component membership, or clearly label them as generic examples. As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 131 - 134,
The README.md.tmpl currently unconditionally emits GPU Operator and Network
Operator links in the "References" section; update the template (README.md.tmpl)
to only include each operator link when the corresponding component is present
(e.g., use template conditionals like has_gpu_operator / has_network_operator or
the bundle/component membership variables available in the templating context),
or alternatively add a clear header such as "Example operator references (may
not apply to all bundles)" to avoid implying they apply to manifest-only or
single-operator bundles; adjust the "References" block generation accordingly so
links are gated by component membership or clearly labeled as generic examples.

20-27: ⚠️ Potential issue | 🟡 Minor

Render local/mixed bundle metadata explicitly in the components section.

For manifest-only/local-chart bundles this currently generates blank Version fields and name () sources, and for mixed bundles the intro still implies a 1:1 component-to-folder mapping even though the bundle can emit an injected NNN+1-<name>-post/ folder. Please make the local case explicit (N/A / local) and mention the synthetic -post folder behavior so the generated README matches the actual layout. As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 20 - 27, The
README template currently renders empty Version and source fields for
manifest-only/local-chart bundles; update the template logic around the
Components loop (referencing .Components, .Name, .Version, .ChartName,
.Repository, .Namespace) to print "N/A" for missing .Version and to use a clear
local source label such as "local" or "manifest-only" when
.ChartName/.Repository are blank, and also update the intro text to explicitly
state that mixed bundles may emit an extra synthetic folder named
"NNN+1-<name>-post/" so the numbering and component-to-folder mapping in the
generated README accurately reflect the actual layout.
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)

309-321: ⚠️ Potential issue | 🟠 Major

Don't install arbitrary stale NNN-* folders from a reused output directory.

This loop executes whatever numbered folders happen to exist on disk. If a user regenerates into an existing --output path, removed or renumbered components from the previous bundle are still picked up and installed. Please either clean the output directory before writing or render an explicit generated-folder manifest and iterate that instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 309 - 321,
The current install loop in deploy.sh.tmpl iterates whatever NNN-* directories
exist under "${SCRIPT_DIR}", which can install stale items from a reused output
dir; change it to either (A) purge/clean the output directory before writing
generated folders so the for loop over "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ only
sees fresh folders, or (B) stop globbing the filesystem and instead read an
explicit generated-folder manifest (e.g., a file written at bundle render time
like "${SCRIPT_DIR}/generated-manifest.txt") and iterate that list to set
dir/base/name/namespace (keep using the same variable names dir, base, name,
namespace and the awk extraction for namespace) so only intended folders are
installed. Ensure whichever approach you pick is implemented where the loop
currently starts (the for dir in ... block) and that the manifest is produced by
the renderer or the cleanup occurs before this loop runs.
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh (2)

485-488: ⚠️ Potential issue | 🟡 Minor

Stale comment about nodewright taints in kai-scheduler fixture.

This comment block mentions "nodewright node taints" and "runtimeRequiredTaint (defaults to skyhook.nvidia.com)" but this is the kai-scheduler-only test fixture. The comment is misleading since no taint removal code follows. Consider removing or updating the comment to match the fixture's scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around
lines 485 - 488, Remove or update the stale comment block that references
"nodewright node taints" and "runtimeRequiredTaint (defaults to
skyhook.nvidia.com)" in the undeploy.sh fixture so it accurately reflects
kai-scheduler's scope; specifically either delete the comment or replace it with
a short note that no node taint removal occurs in this kai-scheduler-only
fixture (search for the comment text "nodewright" or "runtimeRequiredTaint" to
find the exact spot).

43-45: ⚠️ Potential issue | 🟡 Minor

Validate --timeout to reject non-integer values.

The --timeout parser accepts any string but later appends an s suffix (e.g., line 88: "${HELM_TIMEOUT}s"). If a user passes --timeout 5m, it becomes 5ms, causing silent misbehavior.

🛡️ Proposed fix
     --timeout)
       if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi
+      if ! [[ "$2" =~ ^[0-9]+$ ]]; then
+        echo "Error: --timeout must be an integer number of seconds"
+        exit 1
+      fi
       HELM_TIMEOUT="$2"; shift 2 ;;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around
lines 43 - 45, The --timeout parsing block assigns HELM_TIMEOUT from the next
arg but doesn't validate it, causing suffix concatenation like
"${HELM_TIMEOUT}s" to misinterpret values like "5m"; update the parsing case
that sets HELM_TIMEOUT to validate that the supplied value is a positive integer
(e.g., match against a numeric regex such as ^[0-9]+$) and if it fails print a
clear error and exit non‑zero; place this check immediately after
HELM_TIMEOUT="$2" in the same case handling so invalid values are rejected
before shift 2 and subsequent use.
pkg/bundler/deployer/localformat/writer_test.go (2)

310-312: ⚠️ Potential issue | 🟡 Minor

Same panic guard needed here.

Same issue as above—if folders is empty, the t.Fatalf format string panics when evaluating folders[0].Kind.

🐛 Proposed fix
-	if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm {
-		t.Fatalf("want 1 local-helm folder (kustomize wrapped), got %d folders kind=%v", len(folders), folders[0].Kind)
+	if len(folders) != 1 {
+		t.Fatalf("want 1 folder (kustomize wrapped), got %d", len(folders))
+	}
+	if folders[0].Kind != localformat.KindLocalHelm {
+		t.Fatalf("want local-helm folder (kustomize wrapped), got kind=%v", folders[0].Kind)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 310 - 312, Test
panics if folders is empty because folders[0].Kind is evaluated inside t.Fatalf;
guard the access by first checking len(folders) and handling the zero-length
case separately (e.g., if len(folders) == 0 then fail with a message that 0
folders were returned), otherwise proceed to assert Kind on folders[0]; update
the assertion around folders and t.Fatalf usage in writer_test.go to avoid
indexing when folders is empty.

117-119: ⚠️ Potential issue | 🟡 Minor

Guard against panic when folders is empty.

If Write returns zero folders, the t.Fatalf format string evaluates folders[0].Kind causing a panic before the assertion failure is reported. Split the length check from the kind check.

🐛 Proposed fix
-	if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm {
-		t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind)
+	if len(folders) != 1 {
+		t.Fatalf("want 1 folder, got %d", len(folders))
+	}
+	if folders[0].Kind != localformat.KindLocalHelm {
+		t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The
test currently indexes folders[0].Kind which panics if folders is empty; change
the assertions in the test to first assert len(folders) == 1 (using
t.Fatalf/t.Fatalf message that only references len(folders)), and only after
that check folders[0].Kind == localformat.KindLocalHelm (with a separate
t.Fatalf message if the kind mismatches). Ensure you update the test that
references the folders slice so the length guard executes before any indexing
(look for the folders variable and localformat.KindLocalHelm usage in
writer_test.go).
pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (4)

249-253: ⚠️ Potential issue | 🟠 Major

Add skyhook-customizations to the preflight skip list.

The generated call site later checks skip_preflight_for_release "skyhook-customizations", but this case statement can never return success for that release. That makes the advertised skip path unreachable and can block undeploy on CRs the bundle deletes before uninstall.

Suggested fix
 skip_preflight_for_release() {
   case "$1" in
-    nodewright-operator|kgateway) return 0 ;;
+    nodewright-operator|kgateway|skyhook-customizations) return 0 ;;
     *) return 1 ;;
   esac
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines
249 - 253, The case in the skip_preflight_for_release function currently only
returns success for "nodewright-operator" and "kgateway", but the code later
calls skip_preflight_for_release "skyhook-customizations", so add
"skyhook-customizations" to the case list so the function returns 0 for that
release; update the case pattern in skip_preflight_for_release to include
skyhook-customizations alongside nodewright-operator and kgateway so the
preflight skip path becomes reachable.

572-581: ⚠️ Potential issue | 🟠 Major

Do not sweep stale webhooks cluster-wide.

This post-flight pass deletes any webhook whose service or namespace is missing anywhere in the cluster. That reaches outside the bundle and can remove another workload’s admission webhooks. Reuse delete_orphaned_webhooks_for_ns "skyhook" here instead of the global scan.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines
572 - 581, The script currently performs a cluster-wide scan deleting any
webhook whose backing service/namespace is missing; replace that global sweep
with a call to the existing helper delete_orphaned_webhooks_for_ns with the
target namespace "skyhook" so only webhooks tied to this bundle are removed.
Locate the block that runs kubectl/jq to enumerate webhooks and remove it, and
instead invoke delete_orphaned_webhooks_for_ns "skyhook" (ensuring the function
is defined/imported earlier in the script) so deletion is scoped to the skyhook
namespace only.

540-550: ⚠️ Potential issue | 🔴 Critical

Only force-finalize namespaces owned by this bundle.

After 60 seconds this loop strips finalizers from every Terminating namespace in the cluster, not just skyhook. On a shared cluster that can mutate unrelated tenants’ teardown flows. Filter TERMINATING to the bundle namespace set before calling force_clear_namespace_finalizers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines
540 - 550, The loop that sets TERMINATING currently collects all Terminating
namespaces cluster-wide and later force-clears finalizers for each, which can
affect unrelated tenants; modify the selection so TERMINATING is filtered to
only include the bundle's namespaces (e.g., the skyhook namespace set) before
iterating. Specifically, update the logic that computes TERMINATING (used in the
loop and in the call sites to force_clear_namespace_finalizers and kubectl
delete) to intersect/grep against the bundle namespace list (the set you use
elsewhere for this deployment), so force_clear_namespace_finalizers and kubectl
delete are only invoked for those matching namespaces rather than every
Terminating namespace. Ensure references to TERMINATING and
force_clear_namespace_finalizers are used consistently after filtering.

85-93: ⚠️ Potential issue | 🟠 Major

Propagate the fallback helm uninstall failure.

If the --no-hooks retry also fails, the script keeps deleting cluster resources and still ends with Undeployment complete., which hides a partial teardown. Return non-zero from helm_force_uninstall so the undeploy stops on an actual uninstall failure. This same generated block appears in the other new undeploy goldens as well.

Suggested fix
 helm_force_uninstall() {
   local release="$1"
   local ns="$2"
   if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then
     return
   fi
   echo "  Retrying ${release} removal with --no-hooks..."
-  helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true
+  if ! helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found; then
+    echo "ERROR: failed to uninstall ${release} from ${ns}" >&2
+    return 1
+  fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 85
- 93, The helm_force_uninstall function currently swallows failures of the
fallback uninstall (it calls helm uninstall ... --no-hooks ... || true) so
undeploy proceeds even on real uninstall failure; change helm_force_uninstall to
propagate the fallback command's exit status: run the second helm uninstall
without "|| true", capture its exit code (or simply let it return), and return a
non-zero value when that uninstall fails so callers see the failure (i.e.,
ensure helm_force_uninstall returns non-zero on failure and zero on success).
pkg/bundler/deployer/localformat/writer.go (1)

113-163: ⚠️ Potential issue | 🟠 Major

Reject bundles that exceed the three-digit folder contract.

The writer formats %03d-..., but once idx reaches 1000 it emits 1000-* while the generated deploy/undeploy scripts only match [0-9][0-9][0-9]-*/. Mixed components consume two indices, so this can happen before 1000 declared components. Fail fast before formatting both the primary and injected -post folder names.

Suggested fix
 	folders := make([]Folder, 0, len(opts.Components))
 	idx := 1
 	for _, c := range opts.Components {
+		if idx > 999 {
+			return nil, errors.New(
+				errors.ErrCodeInvalidRequest,
+				"too many bundle folders; max 999 supported by the current NNN-* layout",
+			)
+		}
 		if err := ctx.Err(); err != nil {
 			return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err)
 		}
@@
 			if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 {
+				if idx > 999 {
+					return nil, errors.New(
+						errors.ErrCodeInvalidRequest,
+						"too many bundle folders; max 999 supported by the current NNN-* layout",
+					)
+				}
 				postName := c.Name + "-post"
 				postDir := fmt.Sprintf("%03d-%s", idx, postName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 113 - 163, The loop
uses idx to format folder names with "%03d-%s" but never checks overflow, so
when idx >= 1000 (or idx+1 for mixed components) the produced names break the
three-digit contract; add a guard before creating primary and injected "-post"
folders: if idx > 999 return a clear errors.New(errors.ErrCodeInvalidRequest,
...) about too many components, and for the mixed-case (KindUpstreamHelm branch)
also check if idx+1 > 999 and fail before calling writeUpstreamHelmFolder or
writeLocalHelmFolder; reference the loop’s idx,
opts.Components/opts.ComponentManifests, the KindUpstreamHelm branch, and the
two helper functions writeUpstreamHelmFolder and writeLocalHelmFolder when
adding the checks and error returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 121-129: The template's use of trimming (the {{- ... -}} markers)
around the gpu conditional removes the blank line between the preceding code
fence and the "### Verify GPU access" heading causing Markdown lint errors;
update the template by removing the trailing/leading hyphen(s) from the
conditional delimiters (e.g., change {{- if $hasGPU }} / {{- end }} to {{ if
$hasGPU }} / {{ end }}) or explicitly emit a blank line before the heading so
that the block guarded by $hasGPU (set in the range .Components logic) preserves
the separator in the generated README.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh`:
- Around line 43-45: The --timeout option currently assigns any string to
HELM_TIMEOUT (in the --timeout case) and later uses it as "${HELM_TIMEOUT}s",
which corrupts values like "5m"; fix this by validating the value is an integer
before accepting it: in the --timeout case that sets HELM_TIMEOUT, check the
supplied value against an integer regex (e.g. digits-only) and emit an
error/exit if it fails, otherwise assign and shift; reference the --timeout
case, the HELM_TIMEOUT variable, and the later use of "${HELM_TIMEOUT}s" when
applying the validation.

In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 16-17: Update the package doc comment in
pkg/bundler/deployer/localformat/doc.go to separate current supported deployers
from planned ones: replace the single line that mixes both with two clear
statements such as one saying "Currently consumed by deployers: helm, helmfile."
and a separate sentence prefaced with "Planned/intent: for future use by argocd
and Flux" (or similar wording) so the godoc clearly distinguishes shipped
behavior from intended future support.

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 82-88: The Write function currently performs filesystem operations
before honoring context cancellation; check the provided ctx for cancellation
before mutating the output tree by calling ctx.Err() (or select on ctx.Done())
at the top of Write, and return an appropriate error when canceled instead of
proceeding to os.MkdirAll or pruneStaleFolders; ensure subsequent long-running
loops or calls like pruneStaleFolders also observe ctx (propagate ctx to
pruneStaleFolders or add checks there) so pre-canceled contexts cause an
immediate return from Write rather than creating directories or returning
success.

In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml`:
- Around line 64-68: The glob used to find the manifest uses a permissive prefix
"[0-9]*-gpu-operator-post" which can match non-NNN prefixes; change the glob to
require exactly three digits (e.g., "[0-9][0-9][0-9]-gpu-operator-post")
wherever it's used (the MANIFEST assignment and the similar pattern around lines
97-98) so the folder contract is enforced; update both occurrences to use the
three-digit glob.

In
`@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml`:
- Line 69: The inline resource substitution --resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml
2>/dev/null | head -1)" can yield an empty string and cause a non-actionable
failure; before the corresponding chainsaw assert invocation capture the ls
output into a variable (e.g. template_path=$(ls ... | head -1)), test that the
variable is non-empty (if [ -z "$template_path" ]; then echo "ERROR: tuning
template not found" >&2; exit 1; fi) and then pass --resource "$template_path"
to chainsaw assert; apply the same guard pattern for the other three assertion
steps that use similar --resource "$(ls ... | head -1)" calls (the ones around
the chainsaw assert usages).

In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml`:
- Around line 52-55: Both shell globs are too permissive; tighten them to
require exactly three leading digits. Replace the first pattern
basic/[0-9]*/values.yaml with basic/[0-9][0-9][0-9]*/values.yaml (or
basic/[0-9][0-9][0-9]-*/values.yaml if the directory includes a dash) and
replace the gpu-operator check basic/[0-9]*-gpu-operator with
basic/[0-9][0-9][0-9]-gpu-operator so the tests only match the deterministic
NNN-<component> layout.

---

Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: The sentence overstates current support by saying the
`pkg/bundler/deployer/localformat` layout is "consumed by every deployer" while
also mentioning a future `helmfile` support; update the text to separate current
behavior from future intent: explicitly state that `localformat.Write()` is
currently used by the listed deployers (e.g., `helm`, `argocd`, `argocd-helm`)
and then add a distinct, clearly marked sentence indicating that support for
`helmfile` is planned per issue `#632` (or similar future intent), ensuring you
keep `localformat` ownership and the fact that deployers add top-level
orchestration files (deploy.sh, helmfile.yaml, Application CRs) unchanged.

In `@docs/user/cli-reference.md`:
- Line 1152: The sentence "Manifest-only components (no upstream Helm chart,
just raw manifests) become a single local-helm wrapped chart." uses an
unhyphenated compound modifier; update that phrase to "local-helm-wrapped chart"
so the compound modifier is hyphenated correctly (search for the exact sentence
"Manifest-only components (no upstream Helm chart, just raw manifests) become a
single local-helm wrapped chart." and replace the modifier).

In `@pkg/bundler/bundler_test.go`:
- Around line 1199-1206: The test is redundantly checking for
cluster-values.yaml under directories already asserted absent; remove the nested
file existence checks and only assert the directory non-existence for each of
the dirs (e.g., "aws-ebs-csi-driver", "001-aws-ebs-csi-driver",
"002-aws-ebs-csi-driver") using os.Stat on filepath.Join(tmpDir, dir) in the
test in pkg/bundler/bundler_test.go so the loop that references tmpDir and dir
only contains the directory non-existence assertion and not the additional
filepath.Join(tmpDir, dir, "cluster-values.yaml") checks.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 309-321: The current install loop in deploy.sh.tmpl iterates
whatever NNN-* directories exist under "${SCRIPT_DIR}", which can install stale
items from a reused output dir; change it to either (A) purge/clean the output
directory before writing generated folders so the for loop over
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ only sees fresh folders, or (B) stop globbing
the filesystem and instead read an explicit generated-folder manifest (e.g., a
file written at bundle render time like "${SCRIPT_DIR}/generated-manifest.txt")
and iterate that list to set dir/base/name/namespace (keep using the same
variable names dir, base, name, namespace and the awk extraction for namespace)
so only intended folders are installed. Ensure whichever approach you pick is
implemented where the loop currently starts (the for dir in ... block) and that
the manifest is produced by the renderer or the cleanup occurs before this loop
runs.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 131-134: The README.md.tmpl currently unconditionally emits GPU
Operator and Network Operator links in the "References" section; update the
template (README.md.tmpl) to only include each operator link when the
corresponding component is present (e.g., use template conditionals like
has_gpu_operator / has_network_operator or the bundle/component membership
variables available in the templating context), or alternatively add a clear
header such as "Example operator references (may not apply to all bundles)" to
avoid implying they apply to manifest-only or single-operator bundles; adjust
the "References" block generation accordingly so links are gated by component
membership or clearly labeled as generic examples.
- Around line 20-27: The README template currently renders empty Version and
source fields for manifest-only/local-chart bundles; update the template logic
around the Components loop (referencing .Components, .Name, .Version,
.ChartName, .Repository, .Namespace) to print "N/A" for missing .Version and to
use a clear local source label such as "local" or "manifest-only" when
.ChartName/.Repository are blank, and also update the intro text to explicitly
state that mixed bundles may emit an extra synthetic folder named
"NNN+1-<name>-post/" so the numbering and component-to-folder mapping in the
generated README accurately reflect the actual layout.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README note paragraph that begins with "**Note:** The deploy
script's final status..." and the later GPU-specific sentences (referencing
`--best-effort`, "GPU workloads", "fresh GPU nodes", "Nodewright node tuning",
"GPU operator operand rollout", "DRA kubelet plugin registration") must be
rewritten to either be kai-scheduler-specific or explicitly generic: remove or
replace GPU-stack terms with kai-scheduler component names or neutral wording
about asynchronous convergence of cluster components after the script exits,
retain mention of `--best-effort` behavior and the link to the AICR CLI
Reference but adjust its context to match kai-scheduler (or state it's a generic
deploy behavior link), and ensure the revised text clarifies that final script
exit does not guarantee full operational readiness and users should check
warnings and logs.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 485-488: Remove or update the stale comment block that references
"nodewright node taints" and "runtimeRequiredTaint (defaults to
skyhook.nvidia.com)" in the undeploy.sh fixture so it accurately reflects
kai-scheduler's scope; specifically either delete the comment or replace it with
a short note that no node taint removal occurs in this kai-scheduler-only
fixture (search for the comment text "nodewright" or "runtimeRequiredTaint" to
find the exact spot).
- Around line 43-45: The --timeout parsing block assigns HELM_TIMEOUT from the
next arg but doesn't validate it, causing suffix concatenation like
"${HELM_TIMEOUT}s" to misinterpret values like "5m"; update the parsing case
that sets HELM_TIMEOUT to validate that the supplied value is a positive integer
(e.g., match against a numeric regex such as ^[0-9]+$) and if it fails print a
clear error and exit non‑zero; place this check immediately after
HELM_TIMEOUT="$2" in the same case handling so invalid values are rejected
before shift 2 and subsequent use.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 249-253: The case in the skip_preflight_for_release function
currently only returns success for "nodewright-operator" and "kgateway", but the
code later calls skip_preflight_for_release "skyhook-customizations", so add
"skyhook-customizations" to the case list so the function returns 0 for that
release; update the case pattern in skip_preflight_for_release to include
skyhook-customizations alongside nodewright-operator and kgateway so the
preflight skip path becomes reachable.
- Around line 572-581: The script currently performs a cluster-wide scan
deleting any webhook whose backing service/namespace is missing; replace that
global sweep with a call to the existing helper delete_orphaned_webhooks_for_ns
with the target namespace "skyhook" so only webhooks tied to this bundle are
removed. Locate the block that runs kubectl/jq to enumerate webhooks and remove
it, and instead invoke delete_orphaned_webhooks_for_ns "skyhook" (ensuring the
function is defined/imported earlier in the script) so deletion is scoped to the
skyhook namespace only.
- Around line 540-550: The loop that sets TERMINATING currently collects all
Terminating namespaces cluster-wide and later force-clears finalizers for each,
which can affect unrelated tenants; modify the selection so TERMINATING is
filtered to only include the bundle's namespaces (e.g., the skyhook namespace
set) before iterating. Specifically, update the logic that computes TERMINATING
(used in the loop and in the call sites to force_clear_namespace_finalizers and
kubectl delete) to intersect/grep against the bundle namespace list (the set you
use elsewhere for this deployment), so force_clear_namespace_finalizers and
kubectl delete are only invoked for those matching namespaces rather than every
Terminating namespace. Ensure references to TERMINATING and
force_clear_namespace_finalizers are used consistently after filtering.
- Around line 85-93: The helm_force_uninstall function currently swallows
failures of the fallback uninstall (it calls helm uninstall ... --no-hooks ...
|| true) so undeploy proceeds even on real uninstall failure; change
helm_force_uninstall to propagate the fallback command's exit status: run the
second helm uninstall without "|| true", capture its exit code (or simply let it
return), and return a non-zero value when that uninstall fails so callers see
the failure (i.e., ensure helm_force_uninstall returns non-zero on failure and
zero on success).

In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 3-6: The Service manifest for "metadata.name: dcgm-exporter"
declares "kind: Service" but lacks a required "spec" block; add a proper spec
for a non-ExternalName Service including a selector that matches the exporter
pods (or appropriate labels) and at least one ports entry (e.g., port,
targetPort, protocol) so the rendered Service is valid and installable. Ensure
the manifest contains "spec:" with "selector:" referencing the pod labels and
"ports:" with the correct port definitions for dcgm-exporter.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Line 41: The README note currently references broad GPU-stack convergence and
the AICR CLI entry about deploy script behavior but this fixture documents only
nodewright-operator; update the text around the deploy script note (line
referencing the deploy script) and the references at lines 106-107 to be
component-accurate by limiting scope to nodewright-operator behavior (e.g.,
mention Nodewright node tuning and any nodewright-specific async actions) or
explicitly label the statements as “generic GPU-stack behavior” if you intend to
keep the broader content; ensure references to the AICR CLI or GPU operator are
either removed or replaced with nodewright-specific docs/links and that
occurrences of “GPU operator”/“DRA kubelet plugin” are not asserted as part of
this fixture’s expected state unless you mark them as generic.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Line 41: The README contains a "Note:" paragraph that references GPU/node
convergence and network topics unrelated to the cert-manager-only fixture;
update that paragraph (the "**Note:**" block) and the other similar guidance
later in the file so they either (a) explicitly scope the content to GPU/network
deployments (e.g., prefix with "For GPU/network deployments:") or (b) reword to
generic deployment readiness guidance applicable to cert-manager-only fixtures,
and align the references to the AICR CLI docs accordingly (ensure the
cert-manager-only fixture README does not mention Nodewright/GPU operator/DRA
kubelet plugin unless scoped).

In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 310-312: Test panics if folders is empty because folders[0].Kind
is evaluated inside t.Fatalf; guard the access by first checking len(folders)
and handling the zero-length case separately (e.g., if len(folders) == 0 then
fail with a message that 0 folders were returned), otherwise proceed to assert
Kind on folders[0]; update the assertion around folders and t.Fatalf usage in
writer_test.go to avoid indexing when folders is empty.
- Around line 117-119: The test currently indexes folders[0].Kind which panics
if folders is empty; change the assertions in the test to first assert
len(folders) == 1 (using t.Fatalf/t.Fatalf message that only references
len(folders)), and only after that check folders[0].Kind ==
localformat.KindLocalHelm (with a separate t.Fatalf message if the kind
mismatches). Ensure you update the test that references the folders slice so the
length guard executes before any indexing (look for the folders variable and
localformat.KindLocalHelm usage in writer_test.go).

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 113-163: The loop uses idx to format folder names with "%03d-%s"
but never checks overflow, so when idx >= 1000 (or idx+1 for mixed components)
the produced names break the three-digit contract; add a guard before creating
primary and injected "-post" folders: if idx > 999 return a clear
errors.New(errors.ErrCodeInvalidRequest, ...) about too many components, and for
the mixed-case (KindUpstreamHelm branch) also check if idx+1 > 999 and fail
before calling writeUpstreamHelmFolder or writeLocalHelmFolder; reference the
loop’s idx, opts.Components/opts.ComponentManifests, the KindUpstreamHelm
branch, and the two helper functions writeUpstreamHelmFolder and
writeLocalHelmFolder when adding the checks and error returns.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The grep checks for manifest kinds are too loose and can
false-match (e.g., "kind: GatewayParameters" satisfies the "kind: Gateway"
check); update the two assertions that run grep -q 'kind: GatewayParameters'
"${MANIFEST}" and grep -q 'kind: Gateway' "${MANIFEST}" to anchor the pattern to
the full line (e.g., use start/end anchors or grep -x) so they only match exact
lines "kind: GatewayParameters" and "kind: Gateway" in the ${MANIFEST}.

In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-100: The install script lookup uses a mismatched exclusion
pattern "-post/" which is inconsistent with the rest of the file; update the
grep pattern in the INSTALL assignment (the line that builds INSTALL) to exclude
"-gpu-operator-post" instead of "-post/" so the filter matches the project's
existing post-folder naming convention (update the grep -v argument used when
selecting from "${WORK}/helm-dynamic/[0-9]*-gpu-operator/install.sh").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7784bcaa-fe35-49c3-9754-e17d419df09b

📥 Commits

Reviewing files that changed from the base of the PR and between e281666 and 811ecac.

📒 Files selected for processing (78)
  • Makefile
  • docs/contributor/component.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler_test.go
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/README.md
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/nodewright_present/README.md
  • pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
  • pkg/bundler/deployer/localformat/doc.go
  • pkg/bundler/deployer/localformat/folder.go
  • pkg/bundler/deployer/localformat/kustomize.go
  • pkg/bundler/deployer/localformat/local_helm.go
  • pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl
  • pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/testdata/README.md
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env
  • pkg/bundler/deployer/localformat/upstream_helm.go
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/handler_test.go
  • tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml
  • tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml

Comment thread pkg/bundler/deployer/helm/templates/README.md.tmpl Outdated
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
Comment thread pkg/bundler/deployer/localformat/doc.go Outdated
Comment thread pkg/bundler/deployer/localformat/writer.go
Comment thread tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml
Comment thread tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml Outdated
Comment thread tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml
@lockwobr lockwobr force-pushed the feat/generic-helm branch from 811ecac to 24374b4 Compare April 28, 2026 23:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (20)
tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)

71-72: ⚠️ Potential issue | 🟠 Major

Anchor kind assertions to avoid false positives.

At Line 72, grep -q 'kind: Gateway' can still pass on kind: GatewayParameters, so this assertion can mask regressions.

Proposed fix
-              grep -q 'kind: GatewayParameters' "${MANIFEST}"
-              grep -q 'kind: Gateway' "${MANIFEST}"
+              grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}"
+              grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71
- 72, The grep asserting 'kind: Gateway' can be satisfied by 'kind:
GatewayParameters' and thus give false positives; update the check that reads
from MANIFEST to anchor the kind assertion so it matches only the exact line
(e.g., match start/end and optional leading whitespace) for 'kind: Gateway'
rather than a substring, leaving the existing 'kind: GatewayParameters'
assertion intact; target the two grep assertions that reference MANIFEST and
replace the loose substring match for 'kind: Gateway' with an anchored
exact-line match.
tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml (1)

52-55: ⚠️ Potential issue | 🟡 Minor

Tighten glob patterns to enforce exact NNN- layout.

Line 52 and Line 54 still accept any digit-prefixed folder, so this test can pass even when the folder naming contract is violated.

🔧 Suggested patch
-              ls "${WORK}"/basic/[0-9]*/values.yaml >/dev/null 2>&1
+              ls "${WORK}"/basic/[0-9][0-9][0-9]-*/values.yaml >/dev/null 2>&1
@@
-              ls -d "${WORK}"/basic/[0-9]*-gpu-operator >/dev/null 2>&1 \
+              ls -d "${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator >/dev/null 2>&1 \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml` around lines 52 - 55,
The glob patterns in the two shell checks (ls "${WORK}"/basic/[0-9]*/values.yaml
and ls -d "${WORK}"/basic/[0-9]*-gpu-operator) are too permissive; tighten them
to require exactly three digits for the NNN prefix by replacing [0-9]* with
[0-9][0-9][0-9] (so use ls "${WORK}"/basic/[0-9][0-9][0-9]/values.yaml and ls -d
"${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator).
pkg/bundler/bundler_test.go (1)

1204-1206: 🧹 Nitpick | 🔵 Trivial

Remove redundant cluster-values.yaml checks under non-existent directories.

These checks are unreachable once the parent directory non-existence assertion passes, so they add noise without additional signal.

Suggested patch
 	for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} {
 		if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) {
 			t.Errorf("expected %s directory to NOT be created (component is disabled)", dir)
 		}
-		if _, statErr := os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")); !os.IsNotExist(statErr) {
-			t.Errorf("expected %s/cluster-values.yaml to NOT exist (component is disabled)", dir)
-		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/bundler_test.go` around lines 1204 - 1206, The test contains a
redundant unreachable check using os.Stat(filepath.Join(tmpDir, dir,
"cluster-values.yaml")) after already asserting the parent directory does not
exist; remove this extra cluster-values.yaml existence assertion (the os.Stat +
os.IsNotExist branch) so the test only asserts the parent directory
non-existence using tmpDir and dir, and does not perform the unreachable
filepath.Join("cluster-values.yaml") check.
tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml (1)

69-69: ⚠️ Potential issue | 🟡 Minor

Add explicit non-empty path guards before chainsaw assert.

Each inline --resource "$(ls ... | head -1)" can resolve to an empty string and produce a non-actionable failure. Capture into a variable, validate, then assert.

Suggested pattern (apply to all four assertions)
-              chainsaw assert \
-                --resource "$(ls "${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1)" \
+              TUNING_PATH=$(ls "${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1)
+              [ -n "${TUNING_PATH}" ] || { echo "tuning.yaml not found in bundle-defaults" >&2; exit 1; }
+              chainsaw assert \
+                --resource "${TUNING_PATH}" \
                 --file ./assert-tuning-defaults.yaml \
                 --no-color --timeout 5s

Also applies to: 95-95, 121-121, 149-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml`
at line 69, Replace each inline --resource "$(ls ... | head -1)" usage with a
two-step guard: capture the command output into a variable, test that the
variable is non-empty and points to an existing file, then pass that variable to
the chainsaw assert invocation; specifically update occurrences of the pattern
--resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml
2>/dev/null | head -1)" so you first set something like resource_path=$(ls ... |
head -1) (unique symbol: resource_path), validate [ -n "$resource_path" ] && [
-f "$resource_path" ] (or exit with a clear error) and only then call chainsaw
assert --resource "$resource_path"; apply the same change to the three other
occurrences noted.
tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml (1)

67-67: ⚠️ Potential issue | 🟡 Minor

Enforce exact NNN- prefix in both glob checks.

Using [0-9]*-... is too permissive and can mask regressions away from the numbered-folder contract. Please require exactly three digits in both places.

Suggested patch
-              MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9]*-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | head -1)
+              MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | head -1)
@@
-              MATCHES=$(ls "${WORK}"/bundle-dcgm-disabled/[0-9]*-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | wc -l)
+              MATCHES=$(ls "${WORK}"/bundle-dcgm-disabled/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | wc -l)

Also applies to: 97-97

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml` at line 67,
Update the loose glob used when assigning MANIFEST to require exactly three
leading digits: replace occurrences of the pattern
"[0-9]*-gpu-operator-post/..." with an exact-three-digit pattern
"[0-9][0-9][0-9]-gpu-operator-post/..." (the line assigning MANIFEST and the
other occurrence around line 97); ensure both glob checks use [0-9][0-9][0-9]-
so only folders with a 3-digit prefix match.
pkg/bundler/deployer/helm/templates/README.md.tmpl (1)

121-129: ⚠️ Potential issue | 🟡 Minor

Preserve the blank line before the GPU heading.

The trim markers around the $hasGPU conditional can collapse spacing after the code fence; remove trim on the if/end delimiters to keep markdown structure stable.

🧩 Proposed template fix
-{{- if $hasGPU }}
+{{ if $hasGPU }}
 ### Verify GPU access
 
 ```bash
 kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]'

-{{- end }}
+{{ end }}

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @pkg/bundler/deployer/helm/templates/README.md.tmpl around lines 121 - 129,
The template's trim markers on the $hasGPU conditional are collapsing the blank
line before the "### Verify GPU access" heading; update the Helm template by
removing the leading/trailing dash trim on the conditional delimiters so the
block uses {{ if $hasGPU }} ... {{ end }} (keeping the inner code fence and
content unchanged) so the markdown blank line and code fence are preserved;
locate the $hasGPU declaration and the conditional around the "### Verify GPU
access" heading and change {{- if $hasGPU }} / {{- end }} to {{ if $hasGPU }} /
{{ end }}.


</details>

</blockquote></details>
<details>
<summary>docs/user/cli-reference.md (1)</summary><blockquote>

`1152-1152`: _⚠️ Potential issue_ | _🟡 Minor_

**Hyphenate the compound modifier for clarity.**

Use `local-helm-wrapped chart` instead of `local-helm wrapped chart`.

 
<details>
<summary>✏️ Proposed doc fix</summary>

```diff
-- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart.
+- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm-wrapped chart.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` at line 1152, Replace the unhyphenated phrase
"local-helm wrapped chart" with the hyphenated compound modifier
"local-helm-wrapped chart" in the sentence "Manifest-only components (no
upstream Helm chart, just raw manifests) become a single local-helm wrapped
chart." so the compound modifier is clear and grammatically correct; locate and
update that exact phrase in docs/user/cli-reference.md.
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)

109-112: ⚠️ Potential issue | 🟡 Minor

References include a component not present in this fixture.

Line 112 links Network Operator docs in a gpu-operator-only fixture. Please remove it or explicitly mark the references list as generic.

As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines
109 - 112, The README references a component not present in this fixture: remove
or mark the Network Operator link ("Network Operator Documentation" /
https://docs.nvidia.com/networking/display/cokan10/network+operator) so the
references reflect this GPU-operator-only fixture; either delete that list item
or replace it with a generic note such as "Additional NVIDIA operator docs (not
included in this fixture)" to make the references accurate and avoid implying
the Network Operator is part of this testdata.
docs/contributor/component.md (1)

21-21: ⚠️ Potential issue | 🟠 Major

Separate implemented behavior from roadmap language.

Line 21 still reads as if all deployers currently consume localformat.Write(). Please split this into: (1) deployers that use it today, and (2) future intent (e.g., helmfile) as prospective.

As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/contributor/component.md` at line 21, Split the sentence to clearly
separate current implemented behavior from future intent: state that
pkg/bundler/deployer/localformat produces the NNN-<component>/ layout and that
current deployers (list the ones actually using it today, e.g., helm, argocd,
argocd-helm if true) call localformat.Write(), then add a second clarifying
sentence that mentions helmfile (and any other roadmap items) only as
prospective future consumers; keep the reference to localformat.Write() and the
per-folder contents (Chart.yaml, values.yaml, cluster-values.yaml, install.sh,
templates/, upstream.env) intact while removing any language implying that all
listed deployers already use localformat.Write().
pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)

104-107: ⚠️ Potential issue | 🟡 Minor

Reference links are not aligned with the kai-scheduler fixture scope.

Line 106–107 points to GPU/Network Operator docs in a kai_scheduler_present README. Please replace with kai-scheduler-relevant references or label these links as generic.

As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` around
lines 104 - 107, The References section in the kai_scheduler_present README
currently lists GPU/Network Operator links that are out of scope for the
kai-scheduler fixture; update the "References" under the "## References" header
so it only contains kai-scheduler-relevant links or change those two entries
into a clearly labeled "Generic operator references" note. Locate the README's
"References" block (the lines containing the NVIDIA GPU Operator and Network
Operator URLs) and either replace them with kai-scheduler documentation links or
prepend a label like "Generic operator docs (not directly related to
kai-scheduler):" to make the scope clear.
pkg/bundler/deployer/localformat/doc.go (1)

15-16: ⚠️ Potential issue | 🟠 Major

Separate current consumers from future intent in the package doc.

This wording reads as if helmfile/argocd/Flux already consume localformat, but this PR only wires Helm to it. Please split this into “currently used by …” vs “intended for …” so the godoc does not overstate shipped support.

As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/doc.go` around lines 15 - 16, Package doc
comment in pkg/bundler/deployer/localformat/doc.go overstates consumers by
implying helmfile/argocd/Flux already use localformat; update the package
comment to clearly separate current consumers from intended future consumers.
Edit the top-level godoc for package localformat to state which tools currently
consume the format (e.g., Helm) and then add a distinct sentence like "intended
for" or "planned for" listing helmfile, argocd, Flux as future targets, ensuring
no language implies they already consume it; keep the package name localformat
and any existing descriptive text otherwise unchanged.
pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)

104-107: ⚠️ Potential issue | 🟡 Minor

Replace the unrelated operator references.

This fixture only documents cert-manager, so linking GPU Operator and Network Operator docs here makes the README inaccurate. Point this section at cert-manager docs or generic bundle docs instead.

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor... Ensure accuracy... and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines
104 - 107, Update the References section in the README.md by removing the
unrelated "GPU Operator Documentation" and "Network Operator Documentation"
links and replacing them with authoritative cert-manager documentation (e.g.,
cert-manager installation and usage pages) or a generic bundle docs link; target
the "## References" header and the two existing bullet link entries so the
fixture only documents cert-manager and stays accurate.
pkg/bundler/deployer/helm/testdata/manifest_only/README.md (1)

104-107: ⚠️ Potential issue | 🟡 Minor

Align the references with skyhook-customizations.

These links are still GPU/network-operator specific, but this README documents only skyhook-customizations. Replace them with skyhook-relevant docs or generic bundle documentation so the fixture stays accurate.

As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor... Ensure accuracy... and document operational clarity.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` around lines 104
- 107, The References section currently under the "## References" heading
contains GPU/operator-specific links; update that section in
pkg/bundler/deployer/helm/testdata/manifest_only/README.md to reference
skyhook-relevant resources (e.g., the skyhook-customizations documentation or a
generic bundle/operator overview) instead of the NVIDIA GPU Operator and Network
Operator links; specifically replace the two links currently pointing to "GPU
Operator Documentation" and "Network Operator Documentation" with appropriate
skyhook-customizations doc URLs or canonical bundle docs and adjust link labels
accordingly so the fixture accurately documents skyhook-customizations.
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)

309-321: ⚠️ Potential issue | 🟠 Major

Don't drive installs from whatever NNN-* folders happen to be on disk.

This loop will run every numbered directory under the bundle root. If regeneration reuses an existing --output path and leaves stale folders behind, removed or renumbered components are installed again. Prefer rendering an explicit generated-folder manifest and iterating that, or clear old numbered folders before writing.

🔎 Verification script
#!/bin/bash
set -euo pipefail

echo "=== localformat.Write call sites ==="
rg -n -C3 'localformat\.Write' pkg --type go

echo
echo "=== output-dir cleanup around bundle generation ==="
rg -n -C3 'RemoveAll|os\.RemoveAll|ReadDir|MkdirAll|NNN-|[0-9]\[0-9]\[0-9\]-' pkg/bundler --type go

echo
echo "=== filesystem-driven folder enumeration in templates ==="
rg -n -C2 '\[0-9\]\[0-9\]\[0-9\]-\*/|NNN-\*' pkg/bundler/deployer/helm/templates
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 309 - 321,
The current install loop in deploy.sh.tmpl (the for dir in
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; ... block that reads namespace from
"${dir}/install.sh") is unsafe because it installs whatever numbered folders
happen to exist on disk; change it to iterate only over an explicit generated
manifest or clean the output before generation: either (A) have the bundler emit
a deterministic list file (e.g. GENERATED_MANIFEST or
"${HELM_WORKDIR}/components.list") and replace the glob loop with reading that
list and using each listed path to read install.sh and extract --namespace, or
(B) ensure the bundler removes/cleans all existing
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ folders before writing new output so the
existing glob cannot pick up stale dirs; update the code that writes the
numbered folders (bundle generation) or add a pre-install cleanup step in
deploy.sh.tmpl to call out/RemoveAll on those dirs accordingly so only intended
components are installed.
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)

463-466: ⚠️ Potential issue | 🟠 Major

Include injected -post releases in the CRD/finalizer safety scans.

The uninstall loop now treats <name>-post as a real Helm release, but the pre-flight check_release_for_stuck_crds, orphan-CRD cleanup, and post-flight orphaned-CRD scan still render only from .ComponentsReversed. Any CRDs or finalizer-bearing CRs owned by the wrapper release bypass those safeguards.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 463 - 466,
The pre/post-flight CRD and finalizer safety scans currently iterate only
.ComponentsReversed and therefore miss injected "-post" releases; update the
scans (check_release_for_stuck_crds, the orphan-CRD cleanup loop, and the
post-flight orphaned-CRD scan) to also consider the injected "<name>-post"
releases by either: (a) when looping .ComponentsReversed, treat both the
component.Name and component.Name+"-post" (using the same component.Namespace
mapping set earlier), or (b) build a combined list that appends "{{ .Name
}}-post" for any component that has injected mixed-component post folders and
iterate that combined list for all CRD/finalizer checks; ensure you reference
the same namespace resolution logic used in the existing injected "-post"
if-block so those releases are included in check_release_for_stuck_crds and the
orphan-CRD scans.
pkg/bundler/deployer/localformat/writer.go (2)

82-88: ⚠️ Potential issue | 🟠 Major

Check context cancellation before filesystem I/O.

The function performs os.MkdirAll and pruneStaleFolders before the first ctx.Err() check at line 116. A pre-canceled context will still create/prune directories before returning an error, and with zero components it returns success instead of a timeout error.

🐛 Proposed fix: add context check at function entry
 func Write(ctx context.Context, opts Options) ([]Folder, error) {
+	if err := ctx.Err(); err != nil {
+		return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err)
+	}
 	if err := os.MkdirAll(opts.OutputDir, 0o755); err != nil {
 		return nil, errors.Wrap(errors.ErrCodeInternal, "create output dir", err)
 	}

As per coding guidelines: "Always check ctx.Done() in long-running operations and loops to respect context cancellation".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 82 - 88, The Write
function performs filesystem operations (os.MkdirAll and pruneStaleFolders)
before honoring context cancellation; add an early check for ctx cancellation at
the start of Write (use ctx.Err() or select on ctx.Done()) and return an
appropriate wrapped cancellation error instead of proceeding with
os.MkdirAll/pruneStaleFolders so that a pre-canceled context does not
create/prune directories or return success.

113-134: ⚠️ Potential issue | 🟠 Major

Fail fast when folder index exceeds 999.

The folder name format %03d- produces 4+ digit prefixes once idx >= 1000, but deploy.sh/undeploy.sh only match [0-9][0-9][0-9]-*/. Components beyond 999 would be silently skipped during deployment.

🐛 Proposed fix: add a cap check before formatting
 	folders := make([]Folder, 0, len(opts.Components))
 	idx := 1
 	for _, c := range opts.Components {
+		if idx > 999 {
+			return nil, errors.New(errors.ErrCodeInvalidRequest,
+				"too many components; max 999 supported by NNN-* folder layout")
+		}
 		if err := ctx.Err(); err != nil {
 			return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer.go` around lines 113 - 134, The loop
that builds directory names with dir := fmt.Sprintf("%03d-%s", idx, c.Name) can
produce 4+ digit prefixes when idx >= 1000 which break deploy.sh pattern
matching; add a guard before formatting (e.g., if idx >= 1000) to return an
appropriate error (use the same errors.New signature and ErrCodeInvalidRequest
used elsewhere) so components beyond 999 fail fast; update the loop around idx,
fmt.Sprintf and classify(...) to perform this cap check and return early with a
clear message like "too many components; maximum supported is 999".
pkg/bundler/deployer/localformat/writer_test.go (2)

117-119: ⚠️ Potential issue | 🟡 Minor

Guard against panic when formatting error message with empty folders.

The condition len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm uses short-circuit evaluation, but the t.Fatalf message still formats folders[0].Kind unconditionally. If Write returns zero folders, this panics while trying to report the failure.

🐛 Proposed fix: separate length check from kind check
-	if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm {
-		t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind)
+	if len(folders) != 1 {
+		t.Fatalf("want 1 folder, got %d", len(folders))
+	}
+	if folders[0].Kind != localformat.KindLocalHelm {
+		t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The
current failure message can panic because it always formats folders[0].Kind even
when folders is empty; change the checks in the test to first assert
len(folders) == 1 and call t.Fatalf with only len(folders) if that fails, then
separately check folders[0].Kind != localformat.KindLocalHelm and call t.Fatalf
including folders[0].Kind only in that branch—look for the test using the local
variable folders and the comparison to localformat.KindLocalHelm and split the
length check from the kind check to avoid indexing an empty slice.

310-312: ⚠️ Potential issue | 🟡 Minor

Same panic risk as above — separate length check from kind check.

If Write returns zero folders, folders[0].Kind in the error message will panic.

🐛 Proposed fix
-	if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm {
-		t.Fatalf("want 1 local-helm folder (kustomize wrapped), got %d folders kind=%v", len(folders), folders[0].Kind)
+	if len(folders) != 1 {
+		t.Fatalf("want 1 folder (kustomize wrapped), got %d", len(folders))
+	}
+	if folders[0].Kind != localformat.KindLocalHelm {
+		t.Fatalf("want local-helm folder (kustomize wrapped), got kind=%v", folders[0].Kind)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 310 - 312, The
test currently accesses folders[0].Kind inside the same conditional as the
length check, which can panic when len(folders) == 0; change the logic to first
assert len(folders) == 1 (and call t.Fatalf or t.Fatalf with a message that does
not index folders) and only after that check folders[0].Kind against
localformat.KindLocalHelm, using separate t.Fatalf calls for the length vs kind
failures; refer to the variables/symbols folders, localformat.KindLocalHelm and
the t.Fatalf usages in writer_test.go to locate and update the checks.
pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh (1)

485-487: ⚠️ Potential issue | 🟡 Minor

Drop the unconditional nodewright cleanup banner from non-nodewright outputs.

This fixture never emits the corresponding taint-removal commands, so the generated script claims cleanup it does not perform. Please gate this comment block on the same condition as the actual nodewright taint-removal section in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh` around
lines 485 - 487, The fixture's undeploy.sh testdata emits a banner about
removing nodewright taints even when no taint-removal commands are generated;
update the test fixture to gate that comment block behind the same condition
used in the template (the nodewright taint-removal section) so the banner only
appears when nodewright taints are actually removed—match the condition used in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (the
runtimeRequiredTaint/nodewright presence check) and ensure the comment block in
pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh is wrapped
with that same conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 102-103: The markdown has no blank line between a fenced code
block and the following heading "### Verify GPU access", which triggers
MD031/MD022; add a single blank line after the closing triple-backtick fence so
there is an empty line before the "### Verify GPU access" heading to satisfy the
linter.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Around line 106-107: The README.md in the nodewright_present test fixture
currently includes links titled "GPU Operator Documentation" and "Network
Operator Documentation" that are not specific to nodewright-operator; update the
fixture's README.md so those two entries either (a) are replaced with
nodewright-operator–relevant references (e.g., official nodewright operator
docs, repo, API reference) or (b) are clearly annotated as generic/example
operator docs (prefix with "Generic example:" or similar) to avoid mis-scoping;
ensure the change is applied within the nodewright_present README.md where those
link texts appear.

In `@pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh`:
- Around line 489-499: The grep that sets custom_taint_line is too permissive
and matches commented examples; update the extraction so it only matches actual
YAML keys by anchoring to line start and allowing leading whitespace (but not
'#'). Replace the grep call used to set custom_taint_line (the one reading
"${nodewright_dir}/values.yaml") with a regex like
'^[[:space:]]*runtimeRequiredTaint:' (e.g. grep -E -m1
'^[[:space:]]*runtimeRequiredTaint:' "${nodewright_dir}/values.yaml") so only
real key lines are captured; leave subsequent taint_value and
NODEWRIGHT_TAINT_KEY processing unchanged.

In `@pkg/bundler/deployer/localformat/kustomize.go`:
- Around line 33-47: The current buildKustomize checks ctx.Err() only before
calling k.Run, so cancellation during the long-running k.Run is ignored; change
buildKustomize to run k.Run(fs, path) in a goroutine and send its result (rm,
err) over a channel, then select on ctx.Done() and that channel so you can
return an errors.ErrCodeTimeout wrapped error if the context is cancelled while
k.Run is still executing; after receiving the k.Run result handle the existing
error wrapping for "kustomize build failed" and the rm.AsYaml() flow (rm and out
variables) as before.

In `@tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml`:
- Around line 78-79: The glob in the assignment to VAL is too permissive (it
uses [0-9]*); tighten it to the fixed three-digit layout by matching exactly
three digits and the trailing dash (e.g., use [0-9][0-9][0-9]-) and remove the
now-unnecessary grep -v filter; update the command that sets VAL (the ls
"${WORK}"/bundle/.../values.yaml invocation) to use the stricter pattern so only
NNN-gpu-operator folders are matched.

---

Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: Split the sentence to clearly separate current implemented behavior
from future intent: state that pkg/bundler/deployer/localformat produces the
NNN-<component>/ layout and that current deployers (list the ones actually using
it today, e.g., helm, argocd, argocd-helm if true) call localformat.Write(),
then add a second clarifying sentence that mentions helmfile (and any other
roadmap items) only as prospective future consumers; keep the reference to
localformat.Write() and the per-folder contents (Chart.yaml, values.yaml,
cluster-values.yaml, install.sh, templates/, upstream.env) intact while removing
any language implying that all listed deployers already use localformat.Write().

In `@docs/user/cli-reference.md`:
- Line 1152: Replace the unhyphenated phrase "local-helm wrapped chart" with the
hyphenated compound modifier "local-helm-wrapped chart" in the sentence
"Manifest-only components (no upstream Helm chart, just raw manifests) become a
single local-helm wrapped chart." so the compound modifier is clear and
grammatically correct; locate and update that exact phrase in
docs/user/cli-reference.md.

In `@pkg/bundler/bundler_test.go`:
- Around line 1204-1206: The test contains a redundant unreachable check using
os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")) after already
asserting the parent directory does not exist; remove this extra
cluster-values.yaml existence assertion (the os.Stat + os.IsNotExist branch) so
the test only asserts the parent directory non-existence using tmpDir and dir,
and does not perform the unreachable filepath.Join("cluster-values.yaml") check.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 309-321: The current install loop in deploy.sh.tmpl (the for dir
in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; ... block that reads namespace from
"${dir}/install.sh") is unsafe because it installs whatever numbered folders
happen to exist on disk; change it to iterate only over an explicit generated
manifest or clean the output before generation: either (A) have the bundler emit
a deterministic list file (e.g. GENERATED_MANIFEST or
"${HELM_WORKDIR}/components.list") and replace the glob loop with reading that
list and using each listed path to read install.sh and extract --namespace, or
(B) ensure the bundler removes/cleans all existing
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ folders before writing new output so the
existing glob cannot pick up stale dirs; update the code that writes the
numbered folders (bundle generation) or add a pre-install cleanup step in
deploy.sh.tmpl to call out/RemoveAll on those dirs accordingly so only intended
components are installed.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 121-129: The template's trim markers on the $hasGPU conditional
are collapsing the blank line before the "### Verify GPU access" heading; update
the Helm template by removing the leading/trailing dash trim on the conditional
delimiters so the block uses {{ if $hasGPU }} ... {{ end }} (keeping the inner
code fence and content unchanged) so the markdown blank line and code fence are
preserved; locate the $hasGPU declaration and the conditional around the "###
Verify GPU access" heading and change {{- if $hasGPU }} / {{- end }} to {{ if
$hasGPU }} / {{ end }}.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 463-466: The pre/post-flight CRD and finalizer safety scans
currently iterate only .ComponentsReversed and therefore miss injected "-post"
releases; update the scans (check_release_for_stuck_crds, the orphan-CRD cleanup
loop, and the post-flight orphaned-CRD scan) to also consider the injected
"<name>-post" releases by either: (a) when looping .ComponentsReversed, treat
both the component.Name and component.Name+"-post" (using the same
component.Namespace mapping set earlier), or (b) build a combined list that
appends "{{ .Name }}-post" for any component that has injected mixed-component
post folders and iterate that combined list for all CRD/finalizer checks; ensure
you reference the same namespace resolution logic used in the existing injected
"-post" if-block so those releases are included in check_release_for_stuck_crds
and the orphan-CRD scans.

In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Around line 104-107: The References section in the kai_scheduler_present
README currently lists GPU/Network Operator links that are out of scope for the
kai-scheduler fixture; update the "References" under the "## References" header
so it only contains kai-scheduler-relevant links or change those two entries
into a clearly labeled "Generic operator references" note. Locate the README's
"References" block (the lines containing the NVIDIA GPU Operator and Network
Operator URLs) and either replace them with kai-scheduler documentation links or
prepend a label like "Generic operator docs (not directly related to
kai-scheduler):" to make the scope clear.

In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Around line 104-107: The References section currently under the "##
References" heading contains GPU/operator-specific links; update that section in
pkg/bundler/deployer/helm/testdata/manifest_only/README.md to reference
skyhook-relevant resources (e.g., the skyhook-customizations documentation or a
generic bundle/operator overview) instead of the NVIDIA GPU Operator and Network
Operator links; specifically replace the two links currently pointing to "GPU
Operator Documentation" and "Network Operator Documentation" with appropriate
skyhook-customizations doc URLs or canonical bundle docs and adjust link labels
accordingly so the fixture accurately documents skyhook-customizations.

In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 109-112: The README references a component not present in this
fixture: remove or mark the Network Operator link ("Network Operator
Documentation" /
https://docs.nvidia.com/networking/display/cokan10/network+operator) so the
references reflect this GPU-operator-only fixture; either delete that list item
or replace it with a generic note such as "Additional NVIDIA operator docs (not
included in this fixture)" to make the references accurate and avoid implying
the Network Operator is part of this testdata.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 104-107: Update the References section in the README.md by
removing the unrelated "GPU Operator Documentation" and "Network Operator
Documentation" links and replacing them with authoritative cert-manager
documentation (e.g., cert-manager installation and usage pages) or a generic
bundle docs link; target the "## References" header and the two existing bullet
link entries so the fixture only documents cert-manager and stays accurate.

In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh`:
- Around line 485-487: The fixture's undeploy.sh testdata emits a banner about
removing nodewright taints even when no taint-removal commands are generated;
update the test fixture to gate that comment block behind the same condition
used in the template (the nodewright taint-removal section) so the banner only
appears when nodewright taints are actually removed—match the condition used in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (the
runtimeRequiredTaint/nodewright presence check) and ensure the comment block in
pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh is wrapped
with that same conditional.

In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 15-16: Package doc comment in
pkg/bundler/deployer/localformat/doc.go overstates consumers by implying
helmfile/argocd/Flux already use localformat; update the package comment to
clearly separate current consumers from intended future consumers. Edit the
top-level godoc for package localformat to state which tools currently consume
the format (e.g., Helm) and then add a distinct sentence like "intended for" or
"planned for" listing helmfile, argocd, Flux as future targets, ensuring no
language implies they already consume it; keep the package name localformat and
any existing descriptive text otherwise unchanged.

In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The current failure message can panic because it always
formats folders[0].Kind even when folders is empty; change the checks in the
test to first assert len(folders) == 1 and call t.Fatalf with only len(folders)
if that fails, then separately check folders[0].Kind !=
localformat.KindLocalHelm and call t.Fatalf including folders[0].Kind only in
that branch—look for the test using the local variable folders and the
comparison to localformat.KindLocalHelm and split the length check from the kind
check to avoid indexing an empty slice.
- Around line 310-312: The test currently accesses folders[0].Kind inside the
same conditional as the length check, which can panic when len(folders) == 0;
change the logic to first assert len(folders) == 1 (and call t.Fatalf or
t.Fatalf with a message that does not index folders) and only after that check
folders[0].Kind against localformat.KindLocalHelm, using separate t.Fatalf calls
for the length vs kind failures; refer to the variables/symbols folders,
localformat.KindLocalHelm and the t.Fatalf usages in writer_test.go to locate
and update the checks.

In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 82-88: The Write function performs filesystem operations
(os.MkdirAll and pruneStaleFolders) before honoring context cancellation; add an
early check for ctx cancellation at the start of Write (use ctx.Err() or select
on ctx.Done()) and return an appropriate wrapped cancellation error instead of
proceeding with os.MkdirAll/pruneStaleFolders so that a pre-canceled context
does not create/prune directories or return success.
- Around line 113-134: The loop that builds directory names with dir :=
fmt.Sprintf("%03d-%s", idx, c.Name) can produce 4+ digit prefixes when idx >=
1000 which break deploy.sh pattern matching; add a guard before formatting
(e.g., if idx >= 1000) to return an appropriate error (use the same errors.New
signature and ErrCodeInvalidRequest used elsewhere) so components beyond 999
fail fast; update the loop around idx, fmt.Sprintf and classify(...) to perform
this cap check and return early with a clear message like "too many components;
maximum supported is 999".

In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml`:
- Line 67: Update the loose glob used when assigning MANIFEST to require exactly
three leading digits: replace occurrences of the pattern
"[0-9]*-gpu-operator-post/..." with an exact-three-digit pattern
"[0-9][0-9][0-9]-gpu-operator-post/..." (the line assigning MANIFEST and the
other occurrence around line 97); ensure both glob checks use [0-9][0-9][0-9]-
so only folders with a 3-digit prefix match.

In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The grep asserting 'kind: Gateway' can be satisfied by
'kind: GatewayParameters' and thus give false positives; update the check that
reads from MANIFEST to anchor the kind assertion so it matches only the exact
line (e.g., match start/end and optional leading whitespace) for 'kind: Gateway'
rather than a substring, leaving the existing 'kind: GatewayParameters'
assertion intact; target the two grep assertions that reference MANIFEST and
replace the loose substring match for 'kind: Gateway' with an anchored
exact-line match.

In
`@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml`:
- Line 69: Replace each inline --resource "$(ls ... | head -1)" usage with a
two-step guard: capture the command output into a variable, test that the
variable is non-empty and points to an existing file, then pass that variable to
the chainsaw assert invocation; specifically update occurrences of the pattern
--resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml
2>/dev/null | head -1)" so you first set something like resource_path=$(ls ... |
head -1) (unique symbol: resource_path), validate [ -n "$resource_path" ] && [
-f "$resource_path" ] (or exit with a clear error) and only then call chainsaw
assert --resource "$resource_path"; apply the same change to the three other
occurrences noted.

In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml`:
- Around line 52-55: The glob patterns in the two shell checks (ls
"${WORK}"/basic/[0-9]*/values.yaml and ls -d
"${WORK}"/basic/[0-9]*-gpu-operator) are too permissive; tighten them to require
exactly three digits for the NNN prefix by replacing [0-9]* with [0-9][0-9][0-9]
(so use ls "${WORK}"/basic/[0-9][0-9][0-9]/values.yaml and ls -d
"${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: db0724c0-8d24-4bf4-a339-2116ce4900af

📥 Commits

Reviewing files that changed from the base of the PR and between 811ecac and 24374b4.

📒 Files selected for processing (78)
  • Makefile
  • docs/contributor/component.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler_test.go
  • pkg/bundler/deployer/helm/helm.go
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
  • pkg/bundler/deployer/helm/testdata/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.env
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml
  • pkg/bundler/deployer/helm/testdata/manifest_only/README.md
  • pkg/bundler/deployer/helm/testdata/manifest_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.sh
  • pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.env
  • pkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yaml
  • pkg/bundler/deployer/helm/testdata/nodewright_present/README.md
  • pkg/bundler/deployer/helm/testdata/nodewright_present/deploy.sh
  • pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.env
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
  • pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh
  • pkg/bundler/deployer/localformat/doc.go
  • pkg/bundler/deployer/localformat/folder.go
  • pkg/bundler/deployer/localformat/kustomize.go
  • pkg/bundler/deployer/localformat/local_helm.go
  • pkg/bundler/deployer/localformat/templates/chart.yaml.tmpl
  • pkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmpl
  • pkg/bundler/deployer/localformat/testdata/README.md
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yaml
  • pkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yaml
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.sh
  • pkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yaml
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.sh
  • pkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.env
  • pkg/bundler/deployer/localformat/upstream_helm.go
  • pkg/bundler/deployer/localformat/writer.go
  • pkg/bundler/deployer/localformat/writer_test.go
  • pkg/bundler/handler_test.go
  • tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml
  • tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml
  • tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml

Comment thread pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/README.md Outdated
Comment thread pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh
Comment thread pkg/bundler/deployer/localformat/kustomize.go
Comment thread tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml Outdated
@lockwobr lockwobr force-pushed the feat/generic-helm branch from 24374b4 to 42ab5e1 Compare April 29, 2026 17:49
mchmarny
mchmarny previously approved these changes Apr 29, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-checked current head 42ab5e1. Several earlier inline issues are addressed, but the five blockers from my latest changes-requested summary are still unresolved.


- All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it.
- Raw manifests for mixed components now apply **post-install only**, via the injected `-post` wrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively.
- Tooling that parsed bundle paths by bare component name must account for the `NNN-` prefix.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path-format change still has an in-tree consumer using the old layout: kwok/scripts/validate-scheduling.sh:384 still points at ${WORK_DIR}/bundle/kube-prometheus-stack/values.yaml. Under the new layout it should glob ${WORK_DIR}/bundle/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml; otherwise the KWOK storage patch is silently skipped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. kwok/scripts/validate-scheduling.sh:381-394 now globs the NNN-prefixed folder:

prom_values=$(ls -1 "${WORK_DIR}/bundle"/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml 2>/dev/null | head -1)
if [[ -n "$prom_values" && -f "$prom_values" ]] && yq eval ...

This restores the KWOK storage patch under the new layout.


// Generate creates a per-component Helm bundle from the configured generator fields.
// Per-component folder content (Chart.yaml, values.yaml, install.sh, templates/*)
// is delegated to pkg/bundler/deployer/localformat. The helm deployer owns only
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code comment is updated, but the public godoc still describes the legacy Helm layout. Please update pkg/bundler/handler.go, pkg/bundler/doc.go, and pkg/bundler/deployer/helm/doc.go, which still mention <component>/values.yaml, <component>/README.md, and <component>/manifests/.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. Updated all three godoc files to describe the NNN-prefixed local-chart layout:

  • pkg/bundler/handler.go:58-66 — API zip-archive description now lists NNN-<component>/install.sh, values.yaml, cluster-values.yaml, plus undeploy.sh.
  • pkg/bundler/doc.go:55-65 — Helm output format now lists the per-folder layout (install.sh, values.yaml, cluster-values.yaml, upstream.env, or Chart.yaml + templates/).
  • pkg/bundler/deployer/helm/doc.go:15-30 — package-level godoc now distinguishes per-folder content (owned by localformat) from top-level files (owned by this deployer).


Previous releases used a flat `<component>/` layout with `manifests/` siblings and a `--deployer helm` script that branched on component kind. The new format is uniform:

- All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new layout section correctly says folders carry install.sh, but the later deployment note still says each component subdirectory contains a README.md with the manual Helm command. That is no longer true in the NNN-* layout; please update the note around line 1317 to point at per-folder install.sh instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. docs/user/cli-reference.md:1300-1317 now describes the NNN-prefixed layout and points at per-folder install.sh:

# Navigate to bundle
cd bundles

# Review root README and a component's values
cat README.md
cat 001-gpu-operator/values.yaml
...

Note: deploy.sh and undeploy.sh are convenience scripts — not the only deployment path. Each NNN-<component>/ folder contains a rendered install.sh that runs the exact helm upgrade --install command for manual or pipeline-driven deployment.

// Generate creates a per-component Helm bundle from the configured generator fields.
// Per-component folder content (Chart.yaml, values.yaml, install.sh, templates/*)
// is delegated to pkg/bundler/deployer/localformat. The helm deployer owns only
// the top-level orchestration: README.md, deploy.sh, undeploy.sh, and checksums.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the helm deployer now owning only top-level files and no longer embedding/rendering component-README.md.tmpl, pkg/bundler/deployer/helm/templates/component-README.md.tmpl appears orphaned. Please delete it or restore a real use.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. pkg/bundler/deployer/helm/templates/component-README.md.tmpl had no remaining call sites and is now removed. Per-folder content (including any user-facing per-component documentation) is owned by pkg/bundler/deployer/localformat; the helm deployer's templates dir now only carries top-level orchestration templates (README.md.tmpl, deploy.sh.tmpl, undeploy.sh.tmpl).

fi
export COMPONENT_WAIT_ARGS

# Invoke the per-folder install.sh with retry. On failure, dump kai-scheduler
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop now implements retry inline, but the old retry() helper is still defined earlier in this template and has no call sites. Please remove the dead helper to avoid future drift.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. The retry() helper at the top of deploy.sh.tmpl had no remaining call sites (the per-component install loop at line ~340 implements its own retry inline) and is now removed. backoff_seconds() is kept since the inline loop still uses it. Updated the corresponding TestGenerate_DeployScriptExecutable assertion (helm_test.go:179-193) to drop "retry()" from the expected structural markers.

@lockwobr lockwobr force-pushed the feat/generic-helm branch 2 times, most recently from f4d774c to 8c40e99 Compare April 29, 2026 20:18
yuanchen8911
yuanchen8911 previously approved these changes Apr 29, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All five blockers resolved on 8c40e99 — KWOK glob, godoc layout in 3 files, cli-reference note, orphan template deletion, and dead retry() removal. LGTM.

@lockwobr lockwobr enabled auto-merge (squash) April 29, 2026 21:44
lockwobr and others added 5 commits April 29, 2026 15:08
Introduce pkg/bundler/deployer/localformat — a deployer-agnostic writer
that produces the NNN-<component>/ numbered folder layout. Each folder
is one of two kinds: upstream-helm (no Chart.yaml; upstream.env carries
{CHART,REPO,VERSION}) or local-helm (Chart.yaml + templates/). Mixed
components (Helm + raw manifests) emit two adjacent folders: a primary
upstream-helm + an injected -post local-helm wrapper. Each folder
carries a rendered install.sh; deploy.sh is now a generic loop with
no per-component branching, only name-matched special-case blocks.

Rewire --deployer helm to consume localformat. Generator.Generate
delegates per-component content (values.yaml, cluster-values.yaml,
Chart.yaml, templates/, install.sh) to localformat.Write and owns only
top-level orchestration files (README.md, deploy.sh, undeploy.sh,
checksums). Deletes ~195 lines of dead code: generateComponentDirectories,
buildComponentDataList (full version), writeClusterValuesFile,
hasYAMLObjects (relocated to localformat). Deletes ComponentData fields
HasChart, HasManifests, IsKustomize.

Rewrite deploy.sh.tmpl: generic install loop replaces ~90 lines of
{{ if .IsKustomize/.HasChart/.HasManifests }} branching. Delete the
~50-line CRD-race retry block — its trigger (pre-apply raw manifests)
is structurally eliminated; mixed components now apply manifests
post-install via the injected -post wrapped chart, where Helm owns
CRD ordering natively. Rewrite undeploy.sh.tmpl to a generic reverse
loop. Preserve name-matched blocks (skyhook taint cleanup, kai-scheduler
async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD
scan) around the loops; their simplification is deferred.

Refactor helm_test.go: introduce assertBundleGolden helper (mirroring
localformat's writer_test) + 5 golden-file scenarios under testdata/
exercising upstream-helm-only, manifest-only, mixed gpu-operator,
kai-scheduler-present, skyhook-present bundles. Goldens double as
reference examples of rendered output. Delete 11 redundant tests
covering removed code or duplicating golden coverage; add focused
error-path tests. Net: 48 → 37 tests, 3408 → 2130 lines (−37%),
coverage 87.6% → 87.9%.

Update bundler_test.go, handler_test.go, and 6 chainsaw e2e fixtures
to assert on NNN-prefixed paths.

Breaking changes for --deployer helm consumers:
- Bundle directory layout: <component>/ → NNN-<component>/
- Mixed components emit two folders (primary + -post)
- Raw manifests for mixed components apply post-install only
- CRD-race retry block removed; pre-apply mechanism gone
- No more per-component README.md (top-level README still present)

Docs: docs/user/cli-reference.md updated with new layout, folder kinds,
mixed-component -post injection, and breaking-change callout.
docs/contributor/component.md describes localformat as the shared
base format and the load-bearing invariants (localformat never writes
deployer-specific files; install.sh never name-customized; Write is
deterministic).
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Introduce pkg/bundler/deployer/localformat — a deployer-agnostic writer
that produces the NNN-<component>/ numbered folder layout. Each folder
is one of two kinds: upstream-helm (no Chart.yaml; upstream.env carries
{CHART,REPO,VERSION}) or local-helm (Chart.yaml + templates/). Mixed
components (Helm + raw manifests) emit two adjacent folders: a primary
upstream-helm + an injected -post local-helm wrapper. Each folder
carries a rendered install.sh; deploy.sh is now a generic loop with
no per-component branching, only name-matched special-case blocks.

Rewire --deployer helm to consume localformat. Generator.Generate
delegates per-component content (values.yaml, cluster-values.yaml,
Chart.yaml, templates/, install.sh) to localformat.Write and owns only
top-level orchestration files (README.md, deploy.sh, undeploy.sh,
checksums). Deletes ~195 lines of dead code: generateComponentDirectories,
buildComponentDataList (full version), writeClusterValuesFile,
hasYAMLObjects (relocated to localformat). Deletes ComponentData fields
HasChart, HasManifests, IsKustomize.

Rewrite deploy.sh.tmpl: generic install loop replaces ~90 lines of
{{ if .IsKustomize/.HasChart/.HasManifests }} branching. Delete the
~50-line CRD-race retry block — its trigger (pre-apply raw manifests)
is structurally eliminated; mixed components now apply manifests
post-install via the injected -post wrapped chart, where Helm owns
CRD ordering natively. Rewrite undeploy.sh.tmpl to a generic reverse
loop. Preserve name-matched blocks (skyhook taint cleanup, kai-scheduler
async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD
scan) around the loops; their simplification is deferred.

Refactor helm_test.go: introduce assertBundleGolden helper (mirroring
localformat's writer_test) + 5 golden-file scenarios under testdata/
exercising upstream-helm-only, manifest-only, mixed gpu-operator,
kai-scheduler-present, skyhook-present bundles. Goldens double as
reference examples of rendered output. Delete 11 redundant tests
covering removed code or duplicating golden coverage; add focused
error-path tests. Net: 48 → 37 tests, 3408 → 2130 lines (−37%),
coverage 87.6% → 87.9%.

Update bundler_test.go, handler_test.go, and 6 chainsaw e2e fixtures
to assert on NNN-prefixed paths.

Breaking changes for --deployer helm consumers:
- Bundle directory layout: <component>/ → NNN-<component>/
- Mixed components emit two folders (primary + -post)
- Raw manifests for mixed components apply post-install only
- CRD-race retry block removed; pre-apply mechanism gone
- No more per-component README.md (top-level README still present)

Docs: docs/user/cli-reference.md updated with new layout, folder kinds,
mixed-component -post injection, and breaking-change callout.
docs/contributor/component.md describes localformat as the shared
base format and the load-bearing invariants (localformat never writes
deployer-specific files; install.sh never name-customized; Write is
deterministic).
@lockwobr lockwobr merged commit cf3cd33 into main Apr 29, 2026
61 of 78 checks passed
@lockwobr lockwobr deleted the feat/generic-helm branch April 29, 2026 23:59
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:

    - name: gke-nccl-tcpxo
      ...
      manifest:
        defaultNamespace: kube-system

`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:

    helm:
      defaultRepository: ""
      defaultNamespace: kube-system

Bug surfacing timeline:

- Pre-NVIDIA#706, manifest-only components were installed by the root
  `deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
  carry inline `metadata.namespace: kube-system`, so the empty registry
  default was harmless; `kubectl apply` did not need
  `ComponentRef.Namespace` for routing.

- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
  localformat`) wraps every component — manifest-only included — as a
  local Helm chart. The generated `install.sh` now always emits
  `helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
  which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
  block, that field is empty, producing:

      helm upgrade --install gke-nccl-tcpxo ./ \
        --namespace  --create-namespace \

  Shell argument collapsing makes Helm parse the literal
  `--create-namespace` as the namespace name and fails with:

      Error: create: failed to create:
      namespaces "--create-namespace" not found

- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
  used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
  PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
  H100 GKE-COS training jobs (its registry/base.yaml changes
  auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
  failure.

Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:

    $ aicr recipe --service gke --accelerator h100 \
                  --intent training --os cos -o /tmp/recipe.yaml
    $ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
    $ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
        helm upgrade --install gke-nccl-tcpxo ./ \
          --namespace kube-system --create-namespace \

Refs: NVIDIA#706, NVIDIA#715
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

- recipes/registry.yaml: also fix the gke-nccl-tcpxo registry
  entry to use the established manifest-only Helm pattern (empty
  `helm.defaultRepository` plus `defaultNamespace: kube-system`)
  instead of the unparsed `manifest:` block. The `manifest:` field
  is not on the ComponentConfig struct, so its `defaultNamespace`
  was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only
  components were installed via raw `kubectl apply`, which routed
  via inline `metadata.namespace`). After NVIDIA#706 wraps every
  component as a local Helm chart, the generated install.sh emits
  `--namespace  --create-namespace` (empty) and Helm fails. This
  blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including
  this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when
  registry.yaml or base.yaml change. Switches to the same pattern
  used by `nodewright-customizations`. Verified bundled install.sh
  now contains `--namespace kube-system`. Supersedes NVIDIA#718.

Refs: NVIDIA#698
Closes: NVIDIA#716, NVIDIA#718
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 30, 2026
The assert-bundle-topology-nfd chainsaw step hardcoded
${WORK}/bundle/nfd/values.yaml, which doesn't exist after the
NNN-folder bundle layout introduced by NVIDIA#706. CI fails with
"sed: can't read .../bundle/nfd/values.yaml".

Resolve the path via the same glob the sibling assert-bundle-scheduling-nfd
step uses (`bundle/[0-9][0-9][0-9]-nfd/values.yaml`).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Uniform numbered local-chart bundle layout + --deployer helm adoption

4 participants