feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662)#706
Conversation
|
The whole |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a deterministic numbered per-component bundle layout ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 26
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
418-425:⚠️ Potential issue | 🟠 MajorInclude injected
-postreleases in the undeploy safety checks.
localformat.Write()can add*-postreleases for mixed components, but these loops still walk.ComponentsReversed, which only contains the original recipe components. That means the wrapper release is skipped by the pre-flight finalizer scan and the post-flight CRD audit, so undeploy can still tear down the controller while CRs from the-postrelease are stuck. Build these checks from the actual generated release list, or synthesize-postentries before rendering the template.Also applies to: 620-638
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 418 - 425, The template currently iterates .ComponentsReversed which omits injected "-post" wrapper releases added by localformat.Write, so update the undeploy safety loops (the block using .ComponentsReversed and the similar block at 620-638) to iterate the actual generated release list or a synthesized list that includes "-post" variants before rendering; specifically, replace uses of .ComponentsReversed with the generated releases collection (or generate extra entries named "{{ .Name }}-post" for mixed components) so that skip_preflight_for_release and check_release_for_stuck_crds are invoked for those "-post" releases as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1152: Replace the phrase "local-helm wrapped chart" with the
compound-modifier form "local-helm-wrapped chart" in the docs text so the
adjective correctly modifies "chart"; locate the occurrence of the string
"local-helm wrapped chart" (in docs/user/cli-reference.md) and update it to
"local-helm-wrapped chart".
In `@Makefile`:
- Line 150: The makefile currently ignores '**/testdata/**' which broadly
disables addlicense checks; update the Makefile rule that sets the ignore
patterns for addlicense (remove or replace the '**/testdata/**' entry) and
instead list only explicit non-source/generated patterns inside testdata (e.g.,
generated, vendor, binary, or fixture extensions) so addlicense still runs on
real source-like fixtures and scripts; ensure the updated ignore list is used by
the addlicense invocation in the same Makefile target.
In `@pkg/bundler/bundler_test.go`:
- Around line 423-429: The test's inclusion check is too permissive because it
treats the legacy flat directory as acceptable; update the assertion in
bundler_test.go so inclusion is determined only by the numbered folder. Replace
the current logic that computes included from statErr1 and statErr2 with a
single check based solely on statErr1 (the numbered path returned by os.Stat on
filepath.Join(tmpDir, "002-aws-ebs-csi-driver")) and remove or ignore the
flat-path check (statErr2); this ensures the test fails if the numbered layout
is missing.
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 338-352: The retry loop currently calls
dump_kai_scheduler_helm_diagnostics and cleanup_helm_hooks with only the release
name (variable name), causing namespace mismatches; update the calls inside the
while loop to pass the component namespace as the second argument (e.g.,
dump_kai_scheduler_helm_diagnostics "${name}" "${namespace}" and
cleanup_helm_hooks "${name}" "${namespace}") so the helpers operate on the
correct namespace (or use whatever component namespace variable exists in the
template, e.g., "${component_namespace}"); ensure the helper functions
(dump_kai_scheduler_helm_diagnostics, cleanup_helm_hooks) accept and use the
namespace parameter if they do not already.
In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.sh`:
- Around line 20-26: The helm upgrade/install invocation uses "${CHART}" which
is invalid for OCI refs; update the IsOCI branch in the
install-upstream-helm.sh.tmpl template so that when REPO starts with "oci://"
the helm command uses "${REPO}" (the full OCI chart reference) instead of
"${CHART}", and keep using "--repo ${REPO}" with "${CHART}" for non-OCI; change
the conditional branch that renders the helm upgrade --install line to
substitute ${REPO} for OCI installs while preserving the other flags
(${COMPONENT_WAIT_ARGS}, ${DRY_RUN_FLAG}, ${KUBECONFIG_FLAG},
${HELM_DEBUG_FLAG}).
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README for the kai_scheduler_present fixture contains stale
GPU/network/cert-manager checks (notably the paragraph around the deploy script
behavior and the block at lines 89–110); update those sections to be
kai-scheduler specific or generic component-agnostic checks: remove or rewrite
GPU/operator/cert-manager verification steps and replace them with kai-scheduler
health/status checks (e.g., checking kai-scheduler pods, logs, readiness probes,
and relevant controller events) or a short generic “verify component readiness
and check logs” paragraph; ensure the deploy script note and subsequent
diagnostics reference only kai-scheduler identifiers and commands so the fixture
output is internally consistent.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 472-475: The generated undeploy.sh contains a stale comment block
starting with "Remove skyhook node taints that persist after operator removal."
(mentioning runtimeRequiredTaint and skyhook.nvidia.com) that is incorrect for
the kai-scheduler fixture; remove that entire comment block from the undeploy.sh
output so the script no longer misstates it removes skyhook taints for the
kai-scheduler-only fixture.
- Around line 43-45: The parser currently stores the raw value into HELM_TIMEOUT
in the --timeout case and later forces an "s" suffix which turns e.g. "5m" into
"5ms" or accepts invalid tokens; update the --timeout handling in undeploy.sh
(the case branch that sets HELM_TIMEOUT) to validate the token before appending
a suffix: either enforce a pure integer using a numeric regex (e.g. match
^[0-9]+$) and reject non-integers with an error, or accept and preserve
user-provided units by not forcing an "s" downstream; implement the chosen
approach in the --timeout branch so HELM_TIMEOUT contains a safe, validated
value.
In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml`:
- Around line 1-2: Add the standard addlicense header to cluster-values.yaml
(above the existing generator banner) so the file contains the required license
block used across the repo (including SPDX/copyright and year) and keep the
existing "---" generator banner after the header; run the addlicense tool or
insert the exact header used in other *.yaml fixtures to ensure consistency and
valid YAML.
In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml`:
- Line 1: Replace the single-line copyright banner ("# Copyright (c) 2026,
NVIDIA CORPORATION. All rights reserved.") with the full addlicense header used
across the repo: run the addlicense tool or insert the standard multi-line
license header at the top of customization.yaml (the template file containing
the current single-line banner) so the file complies with the repo-wide YAML
license-header requirement.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 247-253: The skip_preflight_for_release() case only matches
"skyhook-operator" and "kgateway" so calling skip_preflight_for_release with
"skyhook-customizations" never returns 0; update the case pattern inside
skip_preflight_for_release to include "skyhook-customizations" (e.g., add it to
the matching alternatives or use a pattern that covers all three names) so that
skip_preflight_for_release "skyhook-customizations" returns 0 and the preflight
is correctly skipped where echoed.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml`:
- Around line 1-2: Add the repository-standard addlicense header to the YAML
fixture (cluster-values.yaml) so it matches other sources; run the addlicense
tool with the repository's license template (or insert the exact header used in
other .yaml/.yml files) at the top of the file before the existing front-matter
(the leading "# Generated by Cloud Native Stack" and "---") so the file now
begins with the standard license header followed by the comment and YAML
separator.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml`:
- Around line 1-4: This YAML file (values.yaml) is missing the required
addlicense header; add the standard Apache license header used across the repo
at the top of this file (or add an explicit addlicense exclusion comment if the
file should be exempt), ensuring the header is formatted exactly like other YAML
sources in this PR so the addlicense tool passes — update the file that contains
the top-level "driver" key accordingly.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 1-6: Add the repository's full addlicense YAML header at the top
of this Kubernetes manifest and make the Service resource installable by adding
a spec.ports section; specifically, update the dcgm-exporter Service
(metadata.name: dcgm-exporter) to include a valid spec with at least one port
entry (name, port, targetPort, protocol) so the rendered Service is valid and
installable, and ensure the license header matches the project's addlicense
output exactly.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml`:
- Around line 1-2: The file values.yaml is missing the required addlicense
header; update
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml
by adding the project's standard license header at the top (use the same header
used across other YAML files) or run the addlicense tool to insert/update the
header so the generated preamble is preceded by the required license banner.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 87-110: Update the "Troubleshooting" section so it reflects only
the components present in this test bundle (gpu-operator) or clearly marks
commands as generic examples: edit the header and the grep/log/get-nodes
commands that currently reference 'gpu-operator|network-operator|cert-manager'
and the kubectl logs reference to ensure they only target gpu-operator (or
prepend a note like "Example commands for common components — adjust as
needed"); update the References list similarly by removing or annotating Network
Operator and cert-manager links so readers aren't confused by components not
included in this fixture.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 559-568: The current global sweep over all
mutating/validatingwebhookconfigurations (the loop that lists .items[] and
deletes webhooks when their service or namespace is missing) must be removed
because it can delete unrelated tenants' webhooks; instead, call the existing
helper delete_orphaned_webhooks_for_ns for each bundle namespace to limit
cleanup to this bundle's namespaces—replace the global jq/read/sort loop with
iterating over the bundle's namespace list and invoking
delete_orphaned_webhooks_for_ns <namespace> (referencing the
delete_orphaned_webhooks_for_ns helper) so only webhooks tied to those
namespaces are removed.
- Around line 527-537: The loop is force-clearing finalizers for every
Terminating namespace; change it to only act on namespaces this bundle created
by recording created namespaces (e.g., a CREATED_NAMESPACES or BUNDLE_NAMESPACES
array when you create them) and then, inside the loop where TERMINATING is
computed, filter TERMINATING to the intersection with that recorded list (or to
namespaces matching the bundle-specific label) before calling
force_clear_namespace_finalizers and kubectl delete; update the loop that uses
TERMINATING and the calls to force_clear_namespace_finalizers/kubectl delete to
use the filtered list so unrelated terminating namespaces are not mutated.
In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml`:
- Around line 1-2: This YAML test fixture (cluster-values.yaml) is missing the
required addlicense header; run the repo's addlicense tool or manually prepend
the standard license banner to the top of cluster-values.yaml so the file
contains the full license header followed by the existing generator banner
(---), ensuring the header exactly matches the project's canonical addlicense
text.
In `@pkg/bundler/deployer/helm/testdata/skyhook_present/deploy.sh`:
- Around line 326-340: The retry loop calls cleanup_helm_hooks with the
component name variable (${name}) but the function requires the component
namespace; update the call inside the while-true retry block to pass the
namespace variable (e.g., ${namespace}) instead of ${name} so
cleanup_helm_hooks("${namespace}") removes hook Jobs from the correct namespace;
adjust the call near cleanup_helm_hooks in the loop that also calls
dump_kai_scheduler_helm_diagnostics and helm_failed.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 87-110: The Troubleshooting section in README.md references
gpu-operator and network-operator but the component table only lists
cert-manager; update the troubleshooting commands and references to match
cert-manager only (replace kubectl commands that grep for
gpu-operator|network-operator|cert-manager with cert-manager-specific checks,
e.g., kubectl -n cert-manager get pods and kubectl -n cert-manager logs -l
app=cert-manager) and remove or replace GPU/network operator references and
links with a cert-manager reference (e.g., cert-manager docs) so the README’s
component set and troubleshooting steps are consistent.
In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 56-58: The writeFile call that writes the upstream.env file
currently uses mode 0o600; change it to 0o644 so the generated upstream.env
(containing CHART/REPO/VERSION metadata) is world-readable but owner-writable
for consistency with other bundle files — locate the call using
writeFile(envPath, []byte(envContent), 0o600) and update the file mode to 0o644,
keeping envPath/envContent/unrelated logic unchanged.
In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The failing test dereferences folders[0].Kind while
formatting a t.Fatalf message, causing a panic if folders is empty; change the
assertions in writer_test.go to first check len(folders) and call t.Fatalf (or
t.Fatalf with a message that only uses len) if it's 0 or unexpected length, and
only after that check folders[0].Kind in a separate assertion; apply the same
split-check fix to the other occurrence around the later assertion (the block
currently at lines ~309-311) so you never access folders[0] unless the length
check passed.
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 81-101: The loop that builds folder names uses
fmt.Sprintf("%03d-%s", idx, c.Name) and will produce 4+ digit prefixes once idx
>= 1000 which breaks downstream glob/parse logic; modify the loop in the writer
that iterates over opts.Components (the block using idx, folders, classify, dir)
to fail fast when idx > 999 by returning a clear
errors.New/errors.ErrCodeInvalidRequest error (e.g., "too many components; max
999 supported") before formatting the dir, so no folder with more than three
digits is emitted; alternatively, if you prefer to support more components,
change the format to wider padding consistently with deploy/undeploy parsing,
but ensure both the dir creation (fmt.Sprintf("%03d-%s", ...)) and any
parsing/globs are updated together.
In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The test currently uses grep -q 'kind: Gateway' which also
matches 'kind: GatewayParameters' and can produce false positives; update the
assertion that checks for the Gateway kind to match the full line exactly (e.g.,
change the grep invocation that uses "${MANIFEST}" to use an
anchored/word-boundary pattern such as grep -q '^kind:[[:space:]]*Gateway$' or
grep -x 'kind: Gateway' so it only passes when a literal "kind: Gateway" line
exists in the MANIFEST).
In `@tests/chainsaw/bundle-templates/skyhook-customizations/chainsaw-test.yaml`:
- Line 69: The command using --resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-skyhook-customizations/templates/tuning.yaml
2>/dev/null | head -1)" can yield an empty string; assign that output to a
variable (e.g., MANIFEST=$(ls ... | head -1)), then add an explicit not-found
guard [ -n "${MANIFEST}" ] || exit 1 before using it, and replace the inline
subshell with --resource "${MANIFEST}"; apply the same change for the other
occurrences of the identical pattern (the lines corresponding to the other
tuning.yaml lookups).
---
Outside diff comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 418-425: The template currently iterates .ComponentsReversed which
omits injected "-post" wrapper releases added by localformat.Write, so update
the undeploy safety loops (the block using .ComponentsReversed and the similar
block at 620-638) to iterate the actual generated release list or a synthesized
list that includes "-post" variants before rendering; specifically, replace uses
of .ComponentsReversed with the generated releases collection (or generate extra
entries named "{{ .Name }}-post" for mixed components) so that
skip_preflight_for_release and check_release_for_stuck_crds are invoked for
those "-post" releases as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 060ea888-280a-471c-85e5-ff0c65d23be9
📒 Files selected for processing (76)
Makefiledocs/contributor/component.mddocs/user/cli-reference.mdpkg/bundler/bundler_test.gopkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/component-README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/templates/undeploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.envpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/README.mdpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/undeploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.envpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.mdpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.shpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/install.shpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/upstream.envpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yamlpkg/bundler/deployer/helm/testdata/skyhook_present/README.mdpkg/bundler/deployer/helm/testdata/skyhook_present/deploy.shpkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.envpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/README.mdpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.shpkg/bundler/deployer/localformat/doc.gopkg/bundler/deployer/localformat/folder.gopkg/bundler/deployer/localformat/kustomize.gopkg/bundler/deployer/localformat/local_helm.gopkg/bundler/deployer/localformat/templates/chart.yaml.tmplpkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmplpkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmplpkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yamlpkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.shpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.envpkg/bundler/deployer/localformat/upstream_helm.gopkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/handler_test.gotests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yamltests/chainsaw/bundle-templates/kgateway/chainsaw-test.yamltests/chainsaw/bundle-templates/skyhook-customizations/chainsaw-test.yamltests/chainsaw/cli/bundle-dynamic/chainsaw-test.yamltests/chainsaw/cli/bundle-scheduling/chainsaw-test.yamltests/chainsaw/cli/bundle-variants/chainsaw-test.yamltests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
💤 Files with no reviewable changes (1)
- pkg/bundler/deployer/helm/templates/component-README.md.tmpl
@ayuskauskas |
It is unclear to me in the test_data what is the expected and actual content there. So I was thinking it is brittle as there are real component names and I thought it was referencing actual values we have in recipes. It is more obviously fake data once you look closer but I think a readme would help clarify what is going on and how I would update or add to these tests. |
9b57692 to
f9bafb4
Compare
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml (1)
1-5:⚠️ Potential issue | 🟡 MinorAdd Apache license header for consistency with other testdata fixtures.
While this is a testdata fixture, other testdata YAML files in this PR (e.g.,
pkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yamlandpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yaml) include the Apache 2.0 license header. For consistency and to satisfy the addlicense tool, add the standard header here.As per coding guidelines, "License headers must be present and up-to-date on all source files via addlicense tool" for
**/*.{yaml,yml}.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml` around lines 1 - 5, Add the standard Apache 2.0 license header to this YAML test fixture so it matches other testdata files and satisfies the addlicense tool; update the file containing the YAML keys "crds:" and "enabled: true" by prepending the canonical Apache 2.0 header comment block at the top of values.yaml (above the existing '---' marker) so the license is present and up-to-date.pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml (1)
1-3:⚠️ Potential issue | 🟡 MinorAdd Apache license header for consistency.
Like other testdata YAML fixtures in this PR, this file should include the Apache 2.0 license header to satisfy the addlicense tool and maintain consistency across the testdata directory structure.
As per coding guidelines, "License headers must be present and up-to-date on all source files via addlicense tool" for
**/*.{yaml,yml}.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml` around lines 1 - 3, Add the standard Apache 2.0 license header to the top of the values.yaml test fixture (the same header used in other testdata YAML files) so it satisfies the addlicense tool and matches repository conventions; update the file named values.yaml under the skyhook operator testdata to include the multi-line Apache License, Version 2.0 comment block at the very top before the existing YAML document marker.pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)
266-287:⚠️ Potential issue | 🔴 CriticalVariable name mismatch will cause silent taint removal failure.
Line 266 sets
SKYHOOK_TAINT_KEY, but lines 280, 282, and 285 reference the undefinedNODEWRIGHT_TAINT_KEY. This will cause the taint discovery and removal to fail silently since${NODEWRIGHT_TAINT_KEY}will be empty.🐛 Fix variable name references
- stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true) + stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${SKYHOOK_TAINT_KEY}" | awk '{print $1}' || true) if [[ -n "${stale_nodewright}" ]]; then - echo "WARNING: nodes with stale ${NODEWRIGHT_TAINT_KEY} taints (no running nodewright-operator): ${stale_nodewright}" + echo "WARNING: nodes with stale ${SKYHOOK_TAINT_KEY} taints (no running nodewright-operator): ${stale_nodewright}" echo " Removing stale taints to unblock scheduling..." for node in ${stale_nodewright}; do - kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true + kubectl taint node "${node}" "${SKYHOOK_TAINT_KEY}-" 2>/dev/null || true done fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 266 - 287, The script sets SKYHOOK_TAINT_KEY but later uses NODEWRIGHT_TAINT_KEY, causing empty-variable behavior; update all usages of NODEWRIGHT_TAINT_KEY in the kubectl jsonpath/grep/echo/kubectl taint commands to use SKYHOOK_TAINT_KEY so the stale taint discovery and removal logic (variables: SKYHOOK_TAINT_KEY, stale_nodewright, kubectl taint node) operates on the intended taint key.pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
482-500:⚠️ Potential issue | 🔴 CriticalVariable name mismatch will cause silent taint removal failure.
Line 482 sets
SKYHOOK_TAINT_KEY, but lines 497, 498, and 499 reference the undefinedNODEWRIGHT_TAINT_KEY. This is the same bug as indeploy.sh.tmpl.🐛 Fix variable name references
-echo "Removing ${NODEWRIGHT_TAINT_KEY} node taints..." +echo "Removing ${SKYHOOK_TAINT_KEY} node taints..." for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}' 2>/dev/null); do - kubectl taint node "${node}" "${NODEWRIGHT_TAINT_KEY}-" 2>/dev/null || true + kubectl taint node "${node}" "${SKYHOOK_TAINT_KEY}-" 2>/dev/null || true done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 482 - 500, The script sets and possibly overrides SKYHOOK_TAINT_KEY but later echoes and uses NODEWRIGHT_TAINT_KEY, causing silent failures; update the references in the echo and kubectl taint loop to use SKYHOOK_TAINT_KEY (or make the initial variable NODEWRIGHT_TAINT_KEY and keep it consistent) so the taint removal uses the same variable (check occurrences around the echo "Removing ..." and the kubectl taint node "${node}" ... lines).
♻️ Duplicate comments (19)
Makefile (1)
150-150:⚠️ Potential issue | 🟠 MajorNarrow the testdata exclusion to non-source patterns only.
This blanket
**/testdata/**ignore bypasses license enforcement on all testdata content, including shell scripts, Go files, and other source-like fixtures. As flagged in the previous review, this violates the requirement that license headers must be present on all source files.🔍 Verification: Identify source files in testdata that should retain license checks
#!/bin/bash # Description: Find source-like files in testdata that would lose license enforcement echo "=== Shell scripts in testdata ===" fd -e sh -t f . --search-path . -g '**/testdata/**' | head -20 echo "" echo "=== Go files in testdata ===" fd -e go -t f . --search-path . -g '**/testdata/**' | head -20 echo "" echo "=== YAML files in testdata ===" fd -e yaml -e yml -t f . --search-path . -g '**/testdata/**' | head -20 echo "" echo "=== Python files in testdata ===" fd -e py -t f . --search-path . -g '**/testdata/**' | head -10♻️ Recommended fix: Replace blanket ignore with specific non-source patterns
Instead of ignoring the entire testdata tree, ignore only generated/binary/fixture extensions:
LICENSE_IGNORES = \ -ignore '.git/**' \ -ignore '.venv/**' \ -ignore '**/__pycache__/**' \ -ignore '**/.venv/**' \ -ignore '*/.venv/**' \ -ignore '**/.idea/**' \ -ignore '**/*.csv' \ -ignore '**/*.pyc' \ -ignore '**/*.xml' \ -ignore '**/*lock.hcl' \ -ignore '**/*pb2*' \ -ignore 'bundles/**' \ -ignore 'dist/**' \ -ignore 'vendor/**' \ - -ignore '**/testdata/**' \ + -ignore '**/testdata/**/*.json' \ + -ignore '**/testdata/**/*.txt' \ + -ignore '**/testdata/**/*.md' \ + -ignore '**/testdata/**/manifest.yaml' \ + -ignore '**/testdata/**/values.yaml' \ -ignore 'site/public/**' \ -ignore 'site/resources/**' \ -ignore 'site/node_modules/**'This ensures that source files (
.sh,.go, etc.) in testdata still receive license headers while excluding only true fixture data.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` at line 150, The Makefile currently ignores '**/testdata/**' which skips license enforcement for all files under testdata; update the ignore rule to remove the blanket '**/testdata/**' entry and instead add explicit non-source-only patterns (e.g., generated binaries, compressed fixtures, large data files, and known fixture extensions) so that source-like files (e.g., .sh, .go, .py, .yml/.yaml) in testdata continue to be checked; locate the ignore line with the '**/testdata/**' token and replace it with a curated list of non-source patterns to exclude only true fixture/binary files.pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml (1)
1-2:⚠️ Potential issue | 🟠 MajorAdd the standard addlicense header above the generated banner.
This fixture still contains only the generated comment and YAML separator. Prepend the canonical license block so it matches the repo’s required YAML source header.
As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml` around lines 1 - 2, This file currently only has the generated banner "# Generated by Cloud Native Stack" and the YAML separator "---"; prepend the canonical addlicense header block above that banner so the file contains the standard license comment (as required for "*.{go,yaml,yml,mk}") before the existing "# Generated by Cloud Native Stack" line, ensuring the license text is formatted as a comment and preserved as the first content in the file.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml (1)
1-2:⚠️ Potential issue | 🟠 MajorAdd the repository-standard addlicense header.
This fixture still starts with only the generated banner and YAML separator. It needs the standard addlicense block prepended above them.
As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml` around lines 1 - 2, Prepend the repository-standard addlicense header block to this YAML fixture so the file starts with the license header before the existing generated banner and YAML separator; update the top of pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml by inserting the standard addlicense header (as used across **/*.{go,yaml,yml,mk} in the repo) above the existing "# Generated by Cloud Native Stack" line, and verify with the addlicense tool to ensure the header is present and up-to-date.pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)
87-110:⚠️ Potential issue | 🟡 MinorMake the troubleshooting and references match this fixture’s component set.
This README documents only
cert-manager, but these sections still point readers togpu-operatorandnetwork-operatorchecks. Replace them with cert-manager-specific status/log commands and references so the fixture stays accurate.As per coding guidelines, "When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines 87 - 110, Update the Troubleshooting and References to only cover cert-manager: replace the pod/status check that greps for 'gpu-operator|network-operator|cert-manager' with a cert-manager-specific status command (e.g., kubectl get pods -n cert-manager and kubectl get deployments -n cert-manager), change the logs command to target the cert-manager namespace and labels (instead of gpu-operator), and swap the GPU/Network verification command and the two external links for cert-manager-relevant checks and documentation links (e.g., cert-manager docs and common troubleshooting pages). Ensure the README sections "Troubleshooting", "Check deployment status", "View component logs", "Verify GPU access" (rename that subsection appropriately) and "References" consistently reference cert-manager only.pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml (1)
1-1:⚠️ Potential issue | 🟠 MajorReplace the single-line banner with the full addlicense header.
This template still has only the one-line copyright notice. It needs the repository-standard multi-line license header at the top.
As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml` at line 1, The file customization.yaml has only a single-line copyright banner; replace it with the repository-standard multi-line addlicense header at the very top of the file (the same full header used across other YAML files), ensuring the copyright year and owner are correct, and then run the addlicense tool to update/validate headers; target the template name "customization.yaml" and update the top-of-file comment block accordingly.pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml (1)
1-2:⚠️ Potential issue | 🟠 MajorLicense header is still missing in this YAML source.
Line 1 includes only the generated banner; the required addlicense header is not present.
As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml` around lines 1 - 2, This YAML file is missing the required addlicense header at the top; insert the standard project license header (the addlicense-generated comment block) as the very first lines before the existing "# Generated by Cloud Native Stack" banner, then re-run the addlicense tool to ensure the header is up-to-date and correctly formatted for YAML files (so the file starting with '---' remains unchanged after the header).pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)
1-6:⚠️ Potential issue | 🟠 MajorMake this fixture compliant and installable (header + Service spec).
Line 1 is not the full addlicense header, and Lines 3-6 define a
Servicewithout aspec, which makes the fixture non-installable as a realistic golden output.As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml` around lines 1 - 6, The file is missing the full addlicense header and defines a Service (metadata.name: dcgm-exporter) without a spec, making the fixture non-installable; update the top of dcgm-exporter.yaml to include the proper addlicense header used across the repo, then provide a realistic Service spec for the Service named "dcgm-exporter" (include at minimum spec.selector matching the DCGM exporter Deployment/Pod labels, and a spec.ports entry with port and targetPort and protocol, and optionally type: ClusterIP) so the manifest is valid and installable.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml (1)
1-2:⚠️ Potential issue | 🟠 MajorStill missing the required addlicense header.
Line 1 includes only the generated banner; the repository-standard license block is still absent.
As per coding guidelines, "
**/*.{go,yaml,yml,mk}: License headers must be present and up-to-date on all source files via addlicense tool."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml` around lines 1 - 2, This YAML file currently only contains the generated banner "# Generated by Cloud Native Stack" and is missing the repository-standard addlicense header; run the addlicense tool (or insert the standard license block) so the file that contains the generated banner and YAML document start ("---") includes the required license header at the top, matching the project's license template used for **/*.go, *.yaml, *.yml, *.mk files.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml (1)
1-5:⚠️ Potential issue | 🟡 MinorAdd Apache license header (duplicate of earlier review).
This issue was already identified in a previous review round. The file is missing the required license header that other testdata YAML files include.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml` around lines 1 - 5, Add the standard Apache 2.0 license header as a YAML comment block at the top of this values.yaml (above the existing document start '---') to match other testdata files; ensure the header text and year/owner match the project’s existing license header style and keep the rest of the YAML (the top-level "driver" key and its enabled: true) unchanged.tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)
71-72:⚠️ Potential issue | 🟠 MajorTighten
kindgrep to avoid false positives.
grep -q 'kind: Gateway'can pass onkind: GatewayParameters, so this assertion can miss a missingGatewayobject.Suggested fix
- grep -q 'kind: GatewayParameters' "${MANIFEST}" - grep -q 'kind: Gateway' "${MANIFEST}" + grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}" + grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71 - 72, The current grep for 'kind: Gateway' can match 'kind: GatewayParameters' and yield false positives; update the assertion that checks MANIFEST so the grep for the Gateway kind only matches the exact kind value (e.g., use a regex that anchors the line and allows optional spaces such as '^kind:[[:space:]]*Gateway$' or use grep -w/word-boundary matching) while keeping the existing 'kind: GatewayParameters' check; apply this change to the second grep invocation that currently searches for 'kind: Gateway'.pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)
41-41:⚠️ Potential issue | 🟡 MinorTroubleshooting/readiness text is inconsistent with the fixture’s component set.
This fixture lists only
kai-scheduler, but these sections still reference GPU/network/cert-manager checks and docs. Please make this sectionkai-scheduler-specific (or explicitly generic examples) so the fixture remains internally consistent.As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”
Also applies to: 89-110
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line 41, The README's troubleshooting/readiness text mentions GPU/network/cert-manager checks that don't apply to this fixture which only contains kai-scheduler; update the README.md sections (including the paragraph currently referencing the deploy script behavior and the later block around lines 89–110) to either (a) be kai-scheduler-specific — describing only kai-scheduler readiness indicators and relevant logs/commands — or (b) replace the examples with clearly labeled generic placeholders and link to the global deploy docs; ensure all references to GPU, network, and cert-manager checks are removed or annotated as not-applicable for the kai-scheduler-only fixture.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)
87-110:⚠️ Potential issue | 🟡 MinorTroubleshooting references components not present in this fixture.
This fixture contains
gpu-operator, but these commands/links also referencenetwork-operatorandcert-manager. Please scope troubleshooting togpu-operator(or clearly label as generic examples).As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines 87 - 110, The troubleshooting section currently references components not present in this fixture (network-operator, cert-manager); update the README.md Troubleshooting block so it only targets the gpu-operator (e.g., keep the kubectl pod check, logs command, and GPU access verification but remove or rename any grep/filter for network-operator and cert-manager), or explicitly prefix the section as "Generic examples" if you want to keep other commands; ensure headings like "Troubleshooting" and commands such as "kubectl logs -n gpu-operator -l app=gpu-operator" and the GPU verification jsonpath are accurate and scoped to gpu-operator.docs/user/cli-reference.md (1)
1152-1152:⚠️ Potential issue | 🟡 MinorUse the compound modifier form:
local-helm-wrapped chart.Please hyphenate this phrase for correct modifier grammar.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` at line 1152, Replace the un-hyphenated modifier "local-helm wrapped chart" with the hyphenated compound modifier "local-helm-wrapped chart" wherever that phrase appears (e.g., in the sentence starting "Manifest-only components... become a single local-helm wrapped chart"); update the documented phrase to "local-helm-wrapped chart" to fix modifier grammar.pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh (2)
472-475:⚠️ Potential issue | 🟡 MinorRemove the stale skyhook cleanup comment.
This fixture only covers
kai-scheduler. The comment block about removing skyhook node taints (lines 472-475) is irrelevant and misstates what the script cleans up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around lines 472 - 475, The comment block in undeploy.sh that says "Remove skyhook node taints that persist after operator removal" is incorrect for this fixture (only covers kai-scheduler); remove or replace that stale comment with a brief accurate comment about cleaning up kai-scheduler resources. Locate the comment text "Remove skyhook node taints..." in undeploy.sh and either delete those lines or change them to a one-line note describing kai-scheduler-specific cleanup so the script comments match its actual behavior.
43-45:⚠️ Potential issue | 🟠 MajorValidate
--timeoutbefore appendings.
--timeout 5mbecomes5msdownstream, and non-numeric tokens defer failure tohelm/kubectl. Reject non-integer input in the parser.🛡️ Proposed fix
--timeout) if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi + if [[ ! "$2" =~ ^[0-9]+$ ]]; then + echo "Error: --timeout must be an integer number of seconds" + exit 1 + fi HELM_TIMEOUT="$2"; shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around lines 43 - 45, The --timeout branch stores HELM_TIMEOUT from the parser but later appends an "s", so inputs like "5m" become "5ms" and non-numeric tokens get passed to helm/kubectl; change the parser for the "--timeout" case to validate that $2 is a positive integer (e.g., using a regex like ^[0-9]+$) before assigning HELM_TIMEOUT, and if validation fails print an error and exit non-zero; reference the "--timeout" case and the HELM_TIMEOUT variable in undeploy.sh when implementing this check.pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (1)
247-253:⚠️ Potential issue | 🟠 Major
skyhook-customizationspreflight skip is unreachable.The
skip_preflight_for_release()case pattern only matchesnodewright-operator|kgateway, but line 418 calls it withskyhook-customizations. The condition at line 418 always evaluates to false, so the check at line 422 always runs instead of being exempted.🐛 Proposed fix: Add skyhook-customizations to the case pattern
skip_preflight_for_release() { case "$1" in - nodewright-operator|kgateway) return 0 ;; + nodewright-operator|kgateway|skyhook-customizations) return 0 ;; *) return 1 ;; esac }Also applies to: 418-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 247 - 253, The helper function skip_preflight_for_release currently only matches "nodewright-operator" and "kgateway", so calls passing "skyhook-customizations" never hit the early return; update the case pattern in skip_preflight_for_release to include "skyhook-customizations" (e.g., add it alongside nodewright-operator|kgateway) so the function returns 0 for that release name, and also update any duplicate/other occurrences of the same case pattern to keep behavior consistent.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh (2)
527-537:⚠️ Potential issue | 🔴 CriticalLimit force-finalization to bundle-managed namespaces.
After 60 seconds this loop force-clears finalizers on every namespace in
Terminating, not just namespaces this bundle deleted. On a shared cluster that mutates unrelated workloads.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around lines 527 - 537, The loop is force-clearing finalizers on every Terminating namespace; restrict that to only namespaces owned/managed by this bundle by filtering TERMINATING before forcing deletion. Change how TERMINATING is computed (or post-filter it) so it only contains namespaces with the bundle-specific identity (e.g., a label like bundle=<name> or by intersecting with a previously recorded list of namespaces this script deleted), then call force_clear_namespace_finalizers and kubectl delete namespace only for that filtered set; update references to TERMINATING, force_clear_namespace_finalizers, and the kubectl delete namespace --wait=false call accordingly.
559-568:⚠️ Potential issue | 🟠 MajorDo not sweep stale webhooks cluster-wide here.
This pass deletes any webhook whose backing service or namespace is missing anywhere in the cluster, which can remove another tenant’s admission webhooks during unrelated cleanup. Reuse the namespace-scoped helper for bundle namespaces instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around lines 559 - 568, The current loop scans all mutatingwebhookconfiguration/validatingwebhookconfiguration cluster-wide and deletes any webhook whose backing service/namespace is missing (using variables wh_name, svc_ns, svc_name), which can remove other tenants' webhooks; change this to only check bundle namespaces by reusing the existing namespace-scoped helper (instead of the cluster-wide `kubectl get ... | jq ...`); specifically, iterate only over your bundle namespaces (or call the helper like cleanup_webhooks_in_namespace or equivalent) and for each namespace run a scoped check (e.g., list webhooks that reference services in that namespace and verify those services/namespaces) and delete only those matching webhooks (mutatingwebhookconfiguration/validatingwebhookconfiguration) when the service in that same namespace is missing.pkg/bundler/deployer/localformat/writer.go (1)
100-101:⚠️ Potential issue | 🟠 MajorStill emitting 4-digit folder names will desync the generated scripts.
Once
idxreaches 1000, this produces1000-*, but the generated deploy/undeploy loops only match[0-9][0-9][0-9]-*/, so those folders are silently skipped. Fail fast here or widen the script glob/parsing in the same change.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 100 - 101, The code uses fmt.Sprintf("%03d-%s", idx, c.Name) which will produce 4+ digit prefixes once idx >= 1000 and desync the generated scripts; add a bounds check around the existing dir formation (check the idx variable before dir := fmt.Sprintf(...)) and return a clear error (or panic) if idx >= 1000 so we fail fast instead of emitting a mismatched folder, referencing the existing idx and dir creation (dir := fmt.Sprintf("%03d-%s", idx, c.Name)); alternatively, if you prefer to support larger counts, replace "%03d" with "%04d" and ensure the matching deploy/undeploy glob/parsing is updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/contributor/component.md`:
- Line 21: Rewrite the sentence to clearly separate current behavior from future
intent: state which deployers currently consume pkg/bundler/deployer/localformat
(and that deployers call localformat.Write()) and then add a separate clause or
sentence describing planned or intended support (e.g., future helmfile per issue
`#632`) as prospective, not implemented; ensure you remove phrasing like “consumed
by every deployer” and instead use explicit wording such as “currently used by
X, Y” and “intended/future support for Z,” while keeping references to symbols
localformat.Write() and the per-folder contents (Chart.yaml, values.yaml, etc.)
for clarity.
In `@pkg/bundler/bundler_test.go`:
- Around line 1199-1212: The test redundantly checks for cluster-values.yaml
inside directories already asserted to not exist; remove the inner os.Stat check
that tests filepath.Join(tmpDir, dir, "cluster-values.yaml") within the loop
that iterates over []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver",
"002-aws-ebs-csi-driver"} in bundler_test.go (the tmpDir loop), leaving only the
os.Stat check that ensures the directory itself does not exist.
In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yaml`:
- Around line 1-2: Add the standard repository license header (the
addlicense-generated YAML header) to the top of cluster-values.yaml above the
existing generator banner line "# Generated by Cloud Native Stack"; ensure the
header matches the project's addlicense output for YAML files and is up-to-date,
then save the file so the license block is the first lines followed by the
existing "---" and generator comment.
In
`@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yaml`:
- Around line 1-2: This YAML fixture is missing the required addlicense header;
insert the repository's standard license comment block (SPDX or the
addlicense-produced header) as YAML comments at the very top of the file, above
the existing "# Generated by Cloud Native Stack" banner and before the document
marker "---", or run the addlicense tool to automatically add/update the header
so the file has the proper license header format and metadata.
In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yaml`:
- Around line 1-2: The YAML fixture values.yaml is missing the standard
addlicense header; prepend the repository's standard addlicense license block at
the very top of this file (before the generated banner line "---") so the file
contains the required license header format used for "*.go, *.yaml, *.yml, *.mk"
files; ensure the header matches the project's canonical addlicense text
exactly.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Line 21: The table row for the manifest-only component shows empty Version and
a Source rendered as "skyhook-customizations ()"; update the README table entry
(the row containing "skyhook-customizations") so that Version is "N/A" and
Source is "local" (e.g., change the row to use "N/A" for Version and "local" for
Source) to avoid ambiguous empty placeholders in manifest-only component
documentation.
- Around line 87-110: The Troubleshooting section in README.md currently
contains GPU/network operator-specific commands and references; replace or
annotate them so they target the skyhook-customizations fixture instead. Update
the kubectl commands (the pod listing grep, the logs command targeting
namespace/label, and the node GPU jsonpath check) to use skyhook component
names/namespaces/labels (e.g., replace
'gpu-operator|network-operator|cert-manager' and '-n gpu-operator -l
app=gpu-operator' with the appropriate skyhook pod selectors), and swap the two
NVIDIA docs with skyhook-relevant links OR add a clear note that the examples
are generic and not specific to this fixture; ensure the Troubleshooting header
and examples reflect skyhook operational checks.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yaml`:
- Around line 1-2: This YAML fixture is missing the repository standard
multi-line license header; prepend the standard addlicense header (the same
header used across repo YAML files) above the existing generator banner ("#
Generated by Cloud Native Stack") and document separator ("---") so the file
complies with the repo-wide YAML source policy; run the addlicense tool or copy
the canonical header and place it at the very top of the file so the header
appears before any other content.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 85-93: The helm_force_uninstall function currently swallows
failures (calls helm uninstall with --no-hooks || true) so the script proceeds
even if uninstall fails; modify helm_force_uninstall to detect a non-zero exit
from the retry and propagate it (e.g., capture the exit status of the second
helm uninstall, log a clear error including release/ns and HELM_TIMEOUT, and
exit non-zero or return the failure code) instead of using "|| true" so later
cleanup only runs when the release was actually removed.
In `@pkg/bundler/deployer/helm/testdata/skyhook_present/README.md`:
- Around line 87-110: The troubleshooting section in README.md contains
GPU/network-operator-specific commands and links (e.g., the pod grep line
matching 'gpu-operator|network-operator|cert-manager', the kubectl logs command
targeting namespace 'gpu-operator' and label 'app=gpu-operator', and the NVIDIA
docs links) which do not match the skyhook-operator fixture; update this section
to use skyhook-relevant diagnostics (replace the grep/filter to match
skyhook-operator services, change the kubectl logs namespace/label to the
skyhook deployment/service names, and swap the reference links to skyhook
operator docs) or explicitly mark the examples as generic/placeholder so the
fixture reads consistently.
In `@pkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.sh`:
- Around line 490-493: The script uses an undefined variable
NODEWRIGHT_TAINT_KEY in the node taint removal loop which will fail under set
-u; update the loop to reference the existing SKYHOOK_TAINT_KEY (or ensure
NODEWRIGHT_TAINT_KEY is defined earlier) so the variable name matches the
defined constant (see the assignment of SKYHOOK_TAINT_KEY around line ~475 and
the taint removal loop using NODEWRIGHT_TAINT_KEY).
In
`@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yaml`:
- Around line 1-2: This YAML fixture is missing the repository-standard
addlicense header; update the file that begins with the generator comment and
the '---' document marker so it includes the required addlicense YAML license
header block at the top of the file (above the existing comment and '---'),
matching the repository's standard license text used for all *.yaml/*.yml files.
In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 60-66: The envContent creation currently writes CHART, REPO, and
VERSION unquoted which allows shell metacharacters to execute when sourcing
upstream.env; update the envContent assembly in upstream_helm.go (where
envContent is created before calling writeFile) to safely quote each value
(CHART, REPO, c.Version) with shell-safe quoting—e.g., wrap values in single
quotes and escape any embedded single quotes or use a proper shell-quoting
utility—so the file written by writeFile("upstream.env") contains quoted,
escaped values.
In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-102: The grep exclude pattern in the INSTALL assignment is
inconsistent: it uses grep -v -- "-post/" while other occurrences use grep -v --
'-gpu-operator-post'; update the grep in the line that sets INSTALL (the ls |
grep -v ... | head -1 pipeline) to use the same exclude pattern
('-gpu-operator-post') so it consistently filters the same post-release files
when finding the install.sh, and keep the rest of the pipeline (ls, head, and
the subsequent grep -q) unchanged.
---
Outside diff comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 266-287: The script sets SKYHOOK_TAINT_KEY but later uses
NODEWRIGHT_TAINT_KEY, causing empty-variable behavior; update all usages of
NODEWRIGHT_TAINT_KEY in the kubectl jsonpath/grep/echo/kubectl taint commands to
use SKYHOOK_TAINT_KEY so the stale taint discovery and removal logic (variables:
SKYHOOK_TAINT_KEY, stale_nodewright, kubectl taint node) operates on the
intended taint key.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 482-500: The script sets and possibly overrides SKYHOOK_TAINT_KEY
but later echoes and uses NODEWRIGHT_TAINT_KEY, causing silent failures; update
the references in the echo and kubectl taint loop to use SKYHOOK_TAINT_KEY (or
make the initial variable NODEWRIGHT_TAINT_KEY and keep it consistent) so the
taint removal uses the same variable (check occurrences around the echo
"Removing ..." and the kubectl taint node "${node}" ... lines).
In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yaml`:
- Around line 1-3: Add the standard Apache 2.0 license header to the top of the
values.yaml test fixture (the same header used in other testdata YAML files) so
it satisfies the addlicense tool and matches repository conventions; update the
file named values.yaml under the skyhook operator testdata to include the
multi-line Apache License, Version 2.0 comment block at the very top before the
existing YAML document marker.
In
`@pkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yaml`:
- Around line 1-5: Add the standard Apache 2.0 license header to this YAML test
fixture so it matches other testdata files and satisfies the addlicense tool;
update the file containing the YAML keys "crds:" and "enabled: true" by
prepending the canonical Apache 2.0 header comment block at the top of
values.yaml (above the existing '---' marker) so the license is present and
up-to-date.
---
Duplicate comments:
In `@docs/user/cli-reference.md`:
- Line 1152: Replace the un-hyphenated modifier "local-helm wrapped chart" with
the hyphenated compound modifier "local-helm-wrapped chart" wherever that phrase
appears (e.g., in the sentence starting "Manifest-only components... become a
single local-helm wrapped chart"); update the documented phrase to
"local-helm-wrapped chart" to fix modifier grammar.
In `@Makefile`:
- Line 150: The Makefile currently ignores '**/testdata/**' which skips license
enforcement for all files under testdata; update the ignore rule to remove the
blanket '**/testdata/**' entry and instead add explicit non-source-only patterns
(e.g., generated binaries, compressed fixtures, large data files, and known
fixture extensions) so that source-like files (e.g., .sh, .go, .py, .yml/.yaml)
in testdata continue to be checked; locate the ignore line with the
'**/testdata/**' token and replace it with a curated list of non-source patterns
to exclude only true fixture/binary files.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README's troubleshooting/readiness text mentions
GPU/network/cert-manager checks that don't apply to this fixture which only
contains kai-scheduler; update the README.md sections (including the paragraph
currently referencing the deploy script behavior and the later block around
lines 89–110) to either (a) be kai-scheduler-specific — describing only
kai-scheduler readiness indicators and relevant logs/commands — or (b) replace
the examples with clearly labeled generic placeholders and link to the global
deploy docs; ensure all references to GPU, network, and cert-manager checks are
removed or annotated as not-applicable for the kai-scheduler-only fixture.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 472-475: The comment block in undeploy.sh that says "Remove
skyhook node taints that persist after operator removal" is incorrect for this
fixture (only covers kai-scheduler); remove or replace that stale comment with a
brief accurate comment about cleaning up kai-scheduler resources. Locate the
comment text "Remove skyhook node taints..." in undeploy.sh and either delete
those lines or change them to a one-line note describing kai-scheduler-specific
cleanup so the script comments match its actual behavior.
- Around line 43-45: The --timeout branch stores HELM_TIMEOUT from the parser
but later appends an "s", so inputs like "5m" become "5ms" and non-numeric
tokens get passed to helm/kubectl; change the parser for the "--timeout" case to
validate that $2 is a positive integer (e.g., using a regex like ^[0-9]+$)
before assigning HELM_TIMEOUT, and if validation fails print an error and exit
non-zero; reference the "--timeout" case and the HELM_TIMEOUT variable in
undeploy.sh when implementing this check.
In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yaml`:
- Around line 1-2: This YAML file is missing the required addlicense header at
the top; insert the standard project license header (the addlicense-generated
comment block) as the very first lines before the existing "# Generated by Cloud
Native Stack" banner, then re-run the addlicense tool to ensure the header is
up-to-date and correctly formatted for YAML files (so the file starting with
'---' remains unchanged after the header).
In
`@pkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yaml`:
- Line 1: The file customization.yaml has only a single-line copyright banner;
replace it with the repository-standard multi-line addlicense header at the very
top of the file (the same full header used across other YAML files), ensuring
the copyright year and owner are correct, and then run the addlicense tool to
update/validate headers; target the template name "customization.yaml" and
update the top-of-file comment block accordingly.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 247-253: The helper function skip_preflight_for_release currently
only matches "nodewright-operator" and "kgateway", so calls passing
"skyhook-customizations" never hit the early return; update the case pattern in
skip_preflight_for_release to include "skyhook-customizations" (e.g., add it
alongside nodewright-operator|kgateway) so the function returns 0 for that
release name, and also update any duplicate/other occurrences of the same case
pattern to keep behavior consistent.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml`:
- Around line 1-2: Prepend the repository-standard addlicense header block to
this YAML fixture so the file starts with the license header before the existing
generated banner and YAML separator; update the top of
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yaml
by inserting the standard addlicense header (as used across
**/*.{go,yaml,yml,mk} in the repo) above the existing "# Generated by Cloud
Native Stack" line, and verify with the addlicense tool to ensure the header is
present and up-to-date.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yaml`:
- Around line 1-5: Add the standard Apache 2.0 license header as a YAML comment
block at the top of this values.yaml (above the existing document start '---')
to match other testdata files; ensure the header text and year/owner match the
project’s existing license header style and keep the rest of the YAML (the
top-level "driver" key and its enabled: true) unchanged.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 1-6: The file is missing the full addlicense header and defines a
Service (metadata.name: dcgm-exporter) without a spec, making the fixture
non-installable; update the top of dcgm-exporter.yaml to include the proper
addlicense header used across the repo, then provide a realistic Service spec
for the Service named "dcgm-exporter" (include at minimum spec.selector matching
the DCGM exporter Deployment/Pod labels, and a spec.ports entry with port and
targetPort and protocol, and optionally type: ClusterIP) so the manifest is
valid and installable.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yaml`:
- Around line 1-2: This YAML file currently only contains the generated banner
"# Generated by Cloud Native Stack" and is missing the repository-standard
addlicense header; run the addlicense tool (or insert the standard license
block) so the file that contains the generated banner and YAML document start
("---") includes the required license header at the top, matching the project's
license template used for **/*.go, *.yaml, *.yml, *.mk files.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 87-110: The troubleshooting section currently references
components not present in this fixture (network-operator, cert-manager); update
the README.md Troubleshooting block so it only targets the gpu-operator (e.g.,
keep the kubectl pod check, logs command, and GPU access verification but remove
or rename any grep/filter for network-operator and cert-manager), or explicitly
prefix the section as "Generic examples" if you want to keep other commands;
ensure headings like "Troubleshooting" and commands such as "kubectl logs -n
gpu-operator -l app=gpu-operator" and the GPU verification jsonpath are accurate
and scoped to gpu-operator.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 527-537: The loop is force-clearing finalizers on every
Terminating namespace; restrict that to only namespaces owned/managed by this
bundle by filtering TERMINATING before forcing deletion. Change how TERMINATING
is computed (or post-filter it) so it only contains namespaces with the
bundle-specific identity (e.g., a label like bundle=<name> or by intersecting
with a previously recorded list of namespaces this script deleted), then call
force_clear_namespace_finalizers and kubectl delete namespace only for that
filtered set; update references to TERMINATING,
force_clear_namespace_finalizers, and the kubectl delete namespace --wait=false
call accordingly.
- Around line 559-568: The current loop scans all
mutatingwebhookconfiguration/validatingwebhookconfiguration cluster-wide and
deletes any webhook whose backing service/namespace is missing (using variables
wh_name, svc_ns, svc_name), which can remove other tenants' webhooks; change
this to only check bundle namespaces by reusing the existing namespace-scoped
helper (instead of the cluster-wide `kubectl get ... | jq ...`); specifically,
iterate only over your bundle namespaces (or call the helper like
cleanup_webhooks_in_namespace or equivalent) and for each namespace run a scoped
check (e.g., list webhooks that reference services in that namespace and verify
those services/namespaces) and delete only those matching webhooks
(mutatingwebhookconfiguration/validatingwebhookconfiguration) when the service
in that same namespace is missing.
In
`@pkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yaml`:
- Around line 1-2: This file currently only has the generated banner "#
Generated by Cloud Native Stack" and the YAML separator "---"; prepend the
canonical addlicense header block above that banner so the file contains the
standard license comment (as required for "*.{go,yaml,yml,mk}") before the
existing "# Generated by Cloud Native Stack" line, ensuring the license text is
formatted as a comment and preserved as the first content in the file.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 87-110: Update the Troubleshooting and References to only cover
cert-manager: replace the pod/status check that greps for
'gpu-operator|network-operator|cert-manager' with a cert-manager-specific status
command (e.g., kubectl get pods -n cert-manager and kubectl get deployments -n
cert-manager), change the logs command to target the cert-manager namespace and
labels (instead of gpu-operator), and swap the GPU/Network verification command
and the two external links for cert-manager-relevant checks and documentation
links (e.g., cert-manager docs and common troubleshooting pages). Ensure the
README sections "Troubleshooting", "Check deployment status", "View component
logs", "Verify GPU access" (rename that subsection appropriately) and
"References" consistently reference cert-manager only.
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 100-101: The code uses fmt.Sprintf("%03d-%s", idx, c.Name) which
will produce 4+ digit prefixes once idx >= 1000 and desync the generated
scripts; add a bounds check around the existing dir formation (check the idx
variable before dir := fmt.Sprintf(...)) and return a clear error (or panic) if
idx >= 1000 so we fail fast instead of emitting a mismatched folder, referencing
the existing idx and dir creation (dir := fmt.Sprintf("%03d-%s", idx, c.Name));
alternatively, if you prefer to support larger counts, replace "%03d" with
"%04d" and ensure the matching deploy/undeploy glob/parsing is updated
accordingly.
In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The current grep for 'kind: Gateway' can match 'kind:
GatewayParameters' and yield false positives; update the assertion that checks
MANIFEST so the grep for the Gateway kind only matches the exact kind value
(e.g., use a regex that anchors the line and allows optional spaces such as
'^kind:[[:space:]]*Gateway$' or use grep -w/word-boundary matching) while
keeping the existing 'kind: GatewayParameters' check; apply this change to the
second grep invocation that currently searches for 'kind: Gateway'.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d9deff84-4ec2-416e-98ce-029175ef15ab
📒 Files selected for processing (77)
Makefiledocs/contributor/component.mddocs/user/cli-reference.mdpkg/bundler/bundler_test.gopkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/templates/undeploy.sh.tmplpkg/bundler/deployer/helm/testdata/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.envpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/README.mdpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/undeploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.envpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.mdpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.shpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/install.shpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/upstream.envpkg/bundler/deployer/helm/testdata/skyhook_present/001-skyhook-operator/values.yamlpkg/bundler/deployer/helm/testdata/skyhook_present/README.mdpkg/bundler/deployer/helm/testdata/skyhook_present/deploy.shpkg/bundler/deployer/helm/testdata/skyhook_present/undeploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.envpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/README.mdpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.shpkg/bundler/deployer/localformat/doc.gopkg/bundler/deployer/localformat/folder.gopkg/bundler/deployer/localformat/kustomize.gopkg/bundler/deployer/localformat/local_helm.gopkg/bundler/deployer/localformat/templates/chart.yaml.tmplpkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmplpkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmplpkg/bundler/deployer/localformat/testdata/README.mdpkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yamlpkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.shpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.envpkg/bundler/deployer/localformat/upstream_helm.gopkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/handler_test.gotests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yamltests/chainsaw/bundle-templates/kgateway/chainsaw-test.yamltests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yamltests/chainsaw/cli/bundle-dynamic/chainsaw-test.yamltests/chainsaw/cli/bundle-scheduling/chainsaw-test.yamltests/chainsaw/cli/bundle-variants/chainsaw-test.yamltests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
| fi | ||
| fi | ||
| fi | ||
| stale_nodewright=$(kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{range .spec.taints[*]}{.key}{" "}{end}{"\n"}{end}' 2>/dev/null | grep "${NODEWRIGHT_TAINT_KEY}" | awk '{print $1}' || true) |
There was a problem hiding this comment.
[critical] Keep the nodewright taint key variable and folder name consistent.
This block initializes SKYHOOK_TAINT_KEY, but the grep, warning, and kubectl taint calls expand ${NODEWRIGHT_TAINT_KEY} under set -u, so any bundle containing nodewright-operator aborts before installation on a fresh cluster. The directory lookup above also searches for NNN-skyhook-operator, but the registry component and generated folder are NNN-nodewright-operator, so custom runtimeRequiredTaint values are never read. Use one variable name consistently and look up the nodewright-operator folder.
There was a problem hiding this comment.
Fixed in the latest push. The variable is now NODEWRIGHT_TAINT_KEY consistently throughout the block (declaration, default assignment, parse, grep, warning, and kubectl taint), and the directory lookup uses the registry-correct [0-9][0-9][0-9]-nodewright-operator glob. The taint key default value is still skyhook.nvidia.com because the operator's taint domain didn't change in #642 - only the component/repo name did.
Reference: pkg/bundler/deployer/helm/templates/deploy.sh.tmpl:266-285.
| # Handle both key=value:effect and key:effect formats | ||
| NODEWRIGHT_TAINT_KEY="${taint_value%%=*}" | ||
| NODEWRIGHT_TAINT_KEY="${NODEWRIGHT_TAINT_KEY%%:*}" | ||
| {{- if eq .Name "skyhook-operator" }} |
There was a problem hiding this comment.
[critical] This nodewright taint cleanup block never renders for real recipes.
Current recipes use the component name nodewright-operator; there is no registry component named skyhook-operator. Because this condition never fires, undeploy removes the operator but leaves the skyhook.nvidia.com runtime taint on nodes. The block also initializes SKYHOOK_TAINT_KEY but later uses ${NODEWRIGHT_TAINT_KEY} under set -u, so a rendered skyhook fixture would fail as well. Gate on nodewright-operator and use one taint-key variable consistently.
There was a problem hiding this comment.
Fixed in the latest push. The block now gates on nodewright-operator ({{- if eq .Name "nodewright-operator" }} at undeploy.sh.tmpl:494) and uses NODEWRIGHT_TAINT_KEY consistently throughout. Same fix applied to the deploy.sh.tmpl counterpart (Thread 5).
|
Please address coderabbit's comments as needed too. |
f9bafb4 to
e281666
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/bundler/bundler_test.go (1)
469-485:⚠️ Potential issue | 🟡 MinorCheck the YAML key instead of searching for
enabled: true.This assertion is too loose for the behavior it’s trying to verify: it can miss a leaked top-level
enabled: false, and it can also fail on unrelated nestedenabled: truevalues.♻️ Proposed fix
- valuesStr := string(valuesData) - if strings.Contains(valuesStr, "enabled: true") { - t.Errorf("enabled key leaked into Helm values:\n%s", valuesStr) - } + var values map[string]any + if err := yaml.Unmarshal(valuesData, &values); err != nil { + t.Fatalf("failed to parse values.yaml: %v", err) + } + if _, ok := values["enabled"]; ok { + t.Errorf("enabled key leaked into Helm values:\n%s", valuesData) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/bundler_test.go` around lines 469 - 485, The current string search for "enabled: true" is too brittle; instead parse values.yaml into a YAML object and assert the top-level "enabled" key is not present or not set to true by unmarshalling valuesData into a map[string]interface{} and checking map["enabled"]; update the check that uses valuesStr to use the parsed map so it fails only on a leaked top-level enabled, and keep the existing override assertion by verifying the nested controller.replicaCount key exists in the parsed structure (e.g., assert parsed["controller"].(map[string]interface{})["replicaCount"] is present).
♻️ Duplicate comments (27)
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)
3-7:⚠️ Potential issue | 🟠 MajorService manifest is incomplete (
spec/portsmissing).This fixture currently defines a
Servicethat is not installable as-is. Please add a minimalspecwith at least one port (and selector if appropriate).Proposed fix
apiVersion: v1 kind: Service metadata: name: dcgm-exporter +spec: + selector: + app: dcgm-exporter + ports: + - name: metrics + port: 9400 + targetPort: 9400 + protocol: TCP🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml` around lines 3 - 7, The Service manifest for kind: Service named "dcgm-exporter" is missing a spec (ports and optional selector); update the dcgm-exporter Service resource to include a minimal spec block that declares at least one port (port and targetPort and protocol) and, if appropriate for this fixture, a selector matching the Deployment/Pod labels (e.g., app/dcgm-exporter) so the Service is selectable and installable; ensure the resource remains valid YAML and aligns with the Deployment/Pod labels in this testdata set.tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)
71-72:⚠️ Potential issue | 🟠 MajorTighten the
Gatewayassertion.
grep -q 'kind: Gateway'still matchesGatewayParameters, so this test can pass even when theGatewayobject is missing. Use an anchored full-line match here.Suggested fix
- grep -q 'kind: Gateway' "${MANIFEST}" + grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71 - 72, The test currently uses grep -q 'kind: Gateway' which also matches GatewayParameters; update the test to assert an exact full-line match for the Gateway kind (e.g., replace the grep -q 'kind: Gateway' invocation with a grep that matches the entire line "kind: Gateway" using either line anchors or grep -x) so the MANIFEST check detects the Gateway object specifically while keeping the existing GatewayParameters check intact.pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)
91-110:⚠️ Potential issue | 🟡 MinorFix the cert-manager fixture’s troubleshooting section.
This golden still points at GPU/network-operator commands and references, so the cert-manager-only bundle reads incorrectly.
✏️ Proposed fix
- kubectl get pods -A | grep -E 'gpu-operator|network-operator|cert-manager' + kubectl get pods -n cert-manager - kubectl logs -n gpu-operator -l app=gpu-operator + kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines 91 - 110, The README's troubleshooting section currently shows GPU/network-operator commands and references; update the "kubectl get pods -A | grep -E 'gpu-operator|network-operator|cert-manager'" command, the "View component logs" kubectl logs command, the "Verify GPU access" jsonpath check, and the References list to be cert-manager specific. Replace the pod listing/grep with a cert-manager namespace/label lookup (e.g., namespace cert-manager), change the logs command to target cert-manager pods (use -n cert-manager and appropriate label), remove the GPU allocatable jsonpath check and instead include a basic cert-manager health/status check (kubectl get deployments, pods, or certificate statuses), and update the References to point to cert-manager docs/installation pages so the cert-manager-only bundle reads correctly.docs/user/cli-reference.md (1)
1152-1152:⚠️ Potential issue | 🟡 MinorUse compound-modifier hyphenation here.
Please change “local-helm wrapped chart” to “local-helm-wrapped chart” for clarity and consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` at line 1152, Update the wording in the docs where the phrase "local-helm wrapped chart" appears (specifically the sentence "Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart.") to use the compound-modifier hyphenation: change it to "local-helm-wrapped chart" so the modifier is clear and consistent.pkg/bundler/deployer/helm/testdata/nodewright_present/README.md (1)
87-110:⚠️ Potential issue | 🟡 MinorTroubleshooting section still targets GPU/Network Operator, not nodewright.
The commands and links in this section are inconsistent with the
nodewright-operatorfixture (skyhooknamespace) and should be replaced with nodewright-relevant checks/references (or explicitly labeled as generic examples).As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md` around lines 87 - 110, The Troubleshooting section incorrectly targets GPU/Network operators; replace the GPU-operator and network-operator checks/logs/links with nodewright-specific commands and references: update the kubectl pod check to filter for nodewright-operator (e.g., grep for 'nodewright-operator' or namespace 'skyhook'), change the logs command to kubectl logs -n skyhook -l app=nodewright-operator (or the actual label used by the fixture), replace the GPU node allocatable check with a relevant nodewright health or CR status check (e.g., kubectl get nodes or kubectl get <NodewrightCR> -n skyhook -o yaml/jsonpath), and swap the external links to the nodewright-operator documentation or the project's README; alternatively label these entries explicitly as generic examples if you intend to keep them non-nodewright-specific.pkg/bundler/deployer/localformat/doc.go (1)
15-17:⚠️ Potential issue | 🟠 MajorPackage doc should not imply unimplemented deployer adoption.
Please distinguish current consumers from future targets (e.g., “currently used by X; intended for Y/Z”) instead of listing all as present-tense consumers.
As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/doc.go` around lines 15 - 17, Update the package doc comment for package localformat in pkg/bundler/deployer/localformat/doc.go to avoid implying unimplemented deployer adoption: explicitly state which deployers currently consume the local-chart bundle layout (e.g., "currently used by X and Y") and mark others as intended or future targets (e.g., "planned for/compatible with Z in the future"), removing any present-tense claim that Helmfile, ArgoCD, or Flux already consume it if they do not; ensure the text references the package name localformat and clearly separates current consumers from intended future targets.tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml (1)
99-100: 🧹 Nitpick | 🔵 TrivialUnify the post-folder exclusion pattern for readability.
This exclusion uses
"-post/"while the rest of the file generally uses-gpu-operator-post. Aligning to one pattern avoids ambiguity when maintaining these assertions.♻️ Suggested consistency tweak
- INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \ - | grep -v -- "-post/" | head -1) + INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \ + | grep -v -- "-gpu-operator-post" | head -1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml` around lines 99 - 100, The exclusion pattern for determining INSTALL is inconsistent: replace the grep exclusion string "-post/" with the unified "-gpu-operator-post" so the command that sets INSTALL (the shell pipeline producing INSTALL=$(ls ... | grep -v -- "...") ) uses the same "-gpu-operator-post" filter used elsewhere; update the grep -v argument accordingly to grep -v -- "-gpu-operator-post".docs/contributor/component.md (1)
21-21:⚠️ Potential issue | 🟠 MajorSeparate implemented deployers from future intent.
This sentence still reads as if all listed deployers currently consume
localformat.Write(). Please split into “currently used by …” vs “planned/future …” to avoid overstating shipped behavior.As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/contributor/component.md` at line 21, The sentence in docs referencing pkg/bundler/deployer/localformat and the list of deployers conflates current and future behavior; update the paragraph to explicitly separate which deployers currently consume localformat.Write() (e.g., helms, argocd, argocd-helm if they truly do) from planned or proposed ones (e.g., future helmfile per issue `#632`), by rewriting the line to state "currently used by ..." vs "planned/future ..." and ensure the symbol localformat.Write() and the deployer names (`helm`, `helmfile`, `argocd`, `argocd-helm`) are mentioned so readers can locate the relevant implementation and roadmap note.pkg/bundler/deployer/helm/testdata/manifest_only/README.md (2)
87-110:⚠️ Potential issue | 🟡 MinorTroubleshooting content is not aligned with this fixture.
These commands/links target GPU/network operators, not
skyhook-customizations. Replace with skyhook-specific checks (or explicitly label as generic examples).As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` around lines 87 - 110, The Troubleshooting section in README.md currently lists GPU/network operator commands and links that don't apply to this manifest_only fixture; update the "Troubleshooting" block under the "Check deployment status", "View component logs", and "Verify GPU access" headings to include skyhook-customizations-specific commands or explicitly label the existing examples as generic/non-applicable, e.g., replace the kubectl commands and NVIDIA links with checks that target skyhook-customizations (or add a clear note like "Generic examples — not specific to skyhook-customizations") so the content accurately reflects the fixture and operational guidance.
21-21:⚠️ Potential issue | 🟡 MinorUse explicit metadata values for manifest-only components.
The row leaves
Versionempty and rendersSourceasskyhook-customizations (). Please use explicit values likeN/Aandlocalto avoid ambiguous placeholders.As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` at line 21, The table row for the manifest-only component "skyhook-customizations" leaves Version blank and shows Source as "skyhook-customizations ()"; update the README table so the Version cell explicitly reads "N/A" and the Source cell reads "local" (i.e., replace the empty Version and the "()"-style placeholder with explicit values for the "skyhook-customizations" row).pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)
92-110:⚠️ Potential issue | 🟡 MinorTroubleshooting and references include components outside this bundle.
This fixture bundles
gpu-operator, but these sections still include network/cert-manager content. Please scope checks/links togpu-operatoror clearly mark them as generic examples.As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines 92 - 110, The README mixes components outside this bundle (network-operator and cert-manager) with the bundled gpu-operator; restrict commands and links to GPU Operator only or label the examples as generic. Update the grep in the "Get pods" example to only search for gpu-operator, remove or annotate the network-operator/cert-manager references in the troubleshooting text, ensure the "View component logs" and "Verify GPU access" sections explicitly reference the gpu-operator namespace and resource (keep kubectl logs -n gpu-operator -l app=gpu-operator and the nvidia.com/gpu jsonpath) and replace the Network Operator link with either a note that it’s external or remove it so the "References" list only contains GPU Operator docs (or clearly mark others as external examples).pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)
41-41:⚠️ Potential issue | 🟡 MinorFixture docs still reference unrelated components.
This README is for
kai-scheduler, but these sections still reference GPU/network/cert-manager checks and docs. Please replace withkai-scheduler-specific diagnostics (or clearly mark examples as generic) so the fixture remains internally consistent.As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor… Ensure accuracy… and document operational clarity.”
Also applies to: 89-110
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line 41, The README's “Note” paragraph and other sections (around the shown Note and lines 89–110) reference GPU/network/cert-manager checks and external GPU docs which are unrelated to the kai-scheduler fixture; update those sections to contain kai-scheduler-specific diagnostics (for example: expected Deployment/DaemonSet status, scheduler logs, CR status, webhook health, and how to run kubeectl describe/logs for kai-scheduler) or explicitly mark those examples as generic/templated, and replace or annotate the AICR CLI link/notes to point to kai-scheduler behavior docs or a clearly labeled generic deploy behavior section so the fixture README consistently documents kai-scheduler operational checks.pkg/bundler/deployer/localformat/upstream_helm.go (1)
60-66:⚠️ Potential issue | 🔴 CriticalShell-quote
upstream.envvalues before writing.
CHART,REPO, andVERSIONare written unquoted and later sourced by shell. Metacharacters in values can execute during source.Suggested fix
import ( "embed" "fmt" "os" "path/filepath" "strings" "text/template" @@ - envContent := fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", chart, repo, c.Version) + envContent := fmt.Sprintf( + "CHART=%s\nREPO=%s\nVERSION=%s\n", + shellSingleQuote(chart), + shellSingleQuote(repo), + shellSingleQuote(c.Version), + ) @@ } + +func shellSingleQuote(v string) string { + return "'" + strings.ReplaceAll(v, "'", `'"'"'`) + "'" +}#!/bin/bash set -euo pipefail # Verify unquoted serialization in upstream_helm.go rg -n 'envContent := fmt.Sprintf\("CHART=%s\\nREPO=%s\\nVERSION=%s\\n"' pkg/bundler/deployer/localformat/upstream_helm.go # Verify upstream.env is sourced/consumed by install template(s) fd 'install-upstream-helm.sh.tmpl' pkg --type f -x rg -n 'upstream\.env|(^|\s)(source|\.)\s+.*upstream\.env' {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/upstream_helm.go` around lines 60 - 66, The env file is written unquoted which allows shell metacharacters to be executed when sourced; update the envContent construction in upstream_helm.go to shell-quote CHART, REPO and VERSION before writing (e.g. escape any single quotes and wrap each value in single quotes or use a dedicated shell-escape helper), replace the current envContent := fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", ...) with a version that produces CHART='...'\nREPO='...'\nVERSION='...'\n, and keep using deployer.SafeJoin and writeFile as-is so the file path/permissions are unchanged.pkg/bundler/deployer/helm/helm.go (1)
124-149:⚠️ Potential issue | 🟠 MajorReused output directories can install stale components.
Generatewrites new folders but does not prune oldNNN-*directories. Sincedeploy.shinstalls by filesystem glob, stale folders can be deployed on subsequent runs with the same output path.Suggested fix
import ( "context" _ "embed" "fmt" "log/slog" "os" + "path/filepath" "strings" "time" @@ // Create output directory if err := os.MkdirAll(outputDir, 0755); err != nil { return nil, errors.Wrap(errors.ErrCodeInternal, "failed to create output directory", err) } + + // Remove stale numbered component folders from previous bundle generations. + staleDirs, globErr := filepath.Glob(filepath.Join(outputDir, "[0-9][0-9][0-9]-*")) + if globErr != nil { + return nil, errors.Wrap(errors.ErrCodeInternal, "failed to enumerate stale component folders", globErr) + } + for _, d := range staleDirs { + if rmErr := os.RemoveAll(d); rmErr != nil { + return nil, errors.Wrap(errors.ErrCodeInternal, fmt.Sprintf("failed to remove stale folder %s", d), rmErr) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/helm.go` around lines 124 - 149, Generate currently writes new component folders but doesn't remove previous NNN-* directories, allowing stale components to be deployed; update Generate to prune old staged directories before writing (or immediately after localformat.Write but before appending folder files) by listing entries in outputDir and removing any directory matching the NNN-* pattern that are not part of the newly returned folders; reference the localformat.Write result variable folders, the outputDir variable, and ensure use of deployer.SafeJoin when resolving paths so pruning enforces containment and avoids deleting outside paths.pkg/bundler/deployer/localformat/writer.go (3)
154-156:⚠️ Potential issue | 🟡 MinorOnly append
?ref=when a tag is provided.For git-sourced kustomize targets with empty
Tag, the current code emitsrepo//path?ref=. Keeprepo//pathunlessTagis set.Suggested fix
target := c.Path if c.Repository != "" { - target = fmt.Sprintf("%s//%s?ref=%s", c.Repository, c.Path, c.Tag) + target = fmt.Sprintf("%s//%s", c.Repository, c.Path) + if c.Tag != "" { + target = fmt.Sprintf("%s?ref=%s", target, c.Tag) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 154 - 156, The target string construction currently always appends "?ref=" even when c.Tag is empty; update the code that builds target (the branch where c.Repository != "" that sets target) to only append "?ref=%s" when c.Tag is non-empty (use c.Tag to decide), e.g. construct base := fmt.Sprintf("%s//%s", c.Repository, c.Path) and then set target = base only, or target = base + "?ref="+c.Tag when c.Tag != "". Ensure you modify the block that references c.Repository, c.Path, c.Tag and assigns target.
81-102:⚠️ Potential issue | 🔴 CriticalFail fast before emitting folder indices beyond 999.
Folder names are
%03d-*, but deploy/undeploy scripts glob onlyNNN-*. Onceidxreaches 1000, generated folders are silently skipped by orchestration.Suggested fix
for _, c := range opts.Components { + if idx > 999 { + return nil, errors.New(errors.ErrCodeInvalidRequest, + "too many bundle folders; max 999 supported by NNN-* layout") + } if err := ctx.Err(); err != nil { return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) } @@ if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 { + if idx > 999 { + return nil, errors.New(errors.ErrCodeInvalidRequest, + "too many bundle folders; max 999 supported by NNN-* layout") + } postName := c.Name + "-post"Also applies to: 119-130
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 81 - 102, The loop that builds folder names uses idx with fmt.Sprintf("%03d-%s", idx, c.Name) so once idx >= 1000 the glob "NNN-*" used by orchestration will skip folders; add a fail-fast guard that checks idx >= 1000 (or > 999) before formatting dir and return an errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("too many components: index %d would overflow 3-digit folder prefix", idx)); apply the same guard in the other similar block around where dir is created (the other place referenced in the comment) so the routine returns a clear error instead of silently producing unmatchable folder names.
83-90:⚠️ Potential issue | 🟠 MajorReject mixed-component
<name>-postcollisions up front.A real component named
x-postcan collide with the synthetic folder emitted for mixed componentx, causing ambiguous release/folder mapping.Suggested fix
func Write(ctx context.Context, opts Options) ([]Folder, error) { + declared := make(map[string]struct{}, len(opts.Components)) + for _, c := range opts.Components { + if _, exists := declared[c.Name]; exists { + return nil, errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("duplicate component name %q", c.Name)) + } + declared[c.Name] = struct{}{} + } + if err := os.MkdirAll(opts.OutputDir, 0o755); err != nil { return nil, errors.Wrap(errors.ErrCodeInternal, "create output dir", err) } @@ if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 { postName := c.Name + "-post" + if _, exists := declared[postName]; exists { + return nil, errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("component %q injects %q which collides with a declared component", c.Name, postName)) + }Also applies to: 117-124
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 83 - 90, When validating opts.Components in the writer (the loop over opts.Components that calls deployer.IsSafePathComponent), add a check to reject name collisions where a real component named "<base>-post" would conflict with the synthetic folder emitted for a mixed component "<base>": build a map of component names (and track which components are mixed if there is a field like Mixed/IsMixed) and, before accepting a component, if its Name has suffix "-post" and the map already contains a component with the same base name (Name without "-post") that is mixed, return an errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("invalid component name %q: reserved for mixed-component synthetic folder", c.Name)); apply the same collision check in the second validation block (lines 117-124) so both validation sites reject these mixed-component "-post" collisions up front while preserving the IsSafePathComponent check.pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (4)
43-46:⚠️ Potential issue | 🟠 MajorValidate
--timeoutbefore appendings.
HELM_TIMEOUTaccepts arbitrary strings, but later commands force anssuffix. Inputs like--timeout 5mbecome5ms, and invalid tokens fail later in Helm/kubectl calls.Suggested fix
--timeout) if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi - HELM_TIMEOUT="$2"; shift 2 ;; + if ! [[ "$2" =~ ^[0-9]+$ ]]; then + echo "Error: --timeout must be an integer number of seconds" + exit 1 + fi + HELM_TIMEOUT="$2"; shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 43 - 46, The CLI parsing currently blindly sets HELM_TIMEOUT="$2" and later forces a trailing "s", which mangles values like "5m" into "5ms"; modify the --timeout handling in undeploy.sh.tmpl so that after assigning HELM_TIMEOUT="$2" you validate the token (e.g., if it matches a pure number /^[0-9]+$/ then append an "s", if it already contains a valid time unit like s|m|ms|h accept as-is, otherwise reject with "Error: invalid --timeout value"). Update the branch that handles --timeout to perform this regex check and either normalize numeric-only inputs by appending "s" or exit with an error for unrecognized formats.
452-475:⚠️ Potential issue | 🔴 CriticalDelete hook-created resources before uninstalling wrapper releases.
For wrapper charts that use Helm hooks,
helm uninstalldoes not remove hook-created objects. Without an explicit pre-uninstall sweep, resources/finalizers can remain after controller uninstall.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 452 - 475, Add a pre-uninstall sweep for hook-created resources by invoking a cleanup function before calling helm_force_uninstall inside the per-component loop: call pre_uninstall_hook_cleanup "${name}" "${ns}" (or another appropriately named function) immediately before helm_force_uninstall "${name}" "${ns}"; implement that function to locate and delete Helm hook-created objects and remove finalizers for resources owned by the wrapper/chart so they don't remain after controller uninstall (use component name/namespace to scope deletions). Ensure delete_release_cluster_resources and delete_orphaned_webhooks_for_ns remain after the uninstall to handle cluster-wide cleanup.
418-425:⚠️ Potential issue | 🟡 MinorSafety scans still omit injected
-postreleases.The pre-flight and post-flight CRD scans iterate
.ComponentsReversed, but uninstall iterates actualNNN-*folders (which include injected-postwrappers). This leaves wrapper releases outside the same safety surface.Also applies to: 506-515, 620-638
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 418 - 425, The pre-flight/post-flight CRD checks currently iterate .ComponentsReversed, which misses injected "-post" wrapper releases and therefore doesn't cover the same set the uninstall loop (which iterates the actual NNN-* folders) processes; update each pre/post-flight range (the blocks that call skip_preflight_for_release and check_release_for_stuck_crds) to iterate the same collection the uninstall code uses (the list of actual installed release folders) instead of .ComponentsReversed so wrapper "-post" releases are included—locate the template ranges that call skip_preflight_for_release and check_release_for_stuck_crds and replace the range source with the unified/installed-releases collection used by the uninstall loop (use the same variable name used by uninstall to ensure parity).
477-480:⚠️ Potential issue | 🟡 MinorMove Nodewright taint comments inside the conditional block.
These comments render even when
nodewright-operatoris not in the bundle, which makes generated scripts misleading.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 477 - 480, The comments about Nodewright taints are placed before the components loop and thus appear in generated scripts even when nodewright-operator isn't present; move those comment lines so they are inside the conditional that handles the nodewright component (e.g., inside the {{- range .ComponentsReversed }} block and specifically within the branch that checks for the nodewright-operator component), and keep the reference to runtimeRequiredTaint (defaults to skyhook.nvidia.com) next to the taint-removal logic so the comment only renders when nodewright-related code is generated.pkg/bundler/deployer/localformat/writer_test.go (2)
117-119:⚠️ Potential issue | 🟡 MinorAvoid dereferencing
folders[0]in the fatal path.If
Writereturns zero folders, theset.Fatalfcalls panic while formattingfolders[0].Kindinstead of reporting the real assertion failure.Suggested fix
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { - t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind) - } + if len(folders) != 1 { + t.Fatalf("want 1 local-helm folder, got %d", len(folders)) + } + if folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind) + }Also applies to: 310-312
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The test currently dereferences folders[0] inside t.Fatalf which will panic when folders is empty; change the assertion in the test to guard the index first (check len(folders) == 0 and fail with a clear message) before inspecting folders[0].Kind (or split into two checks: length check then kind check) for the occurrences around the assertion referencing folders, localformat.KindLocalHelm, and t.Fatalf (also fix the duplicate at the later occurrence).
325-364:⚠️ Potential issue | 🟡 MinorInclude a kustomize component in the determinism assertion.
The determinism test currently exercises upstream and mixed-manifest paths, but not the kustomize render path (
templates/manifest.yaml), which is part of the new layout contract.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh (3)
85-93:⚠️ Potential issue | 🟠 Major
helm_force_uninstallstill suppresses terminal uninstall failures.If both uninstall attempts fail, execution continues and reports completion despite a potentially still-installed release.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around lines 85 - 93, The helper helm_force_uninstall currently swallows failures (it redirects stderr on first attempt and uses "|| true" on the second), so if both uninstalls fail the script still continues; modify helm_force_uninstall to propagate errors instead: remove the stderr discard on the first helm uninstall and remove the "|| true" on the second uninstall (or explicitly check both command exit codes), and if both attempts fail print an error message naming the release and namespace and then exit or return a non-zero status (e.g., return 1) so callers know the uninstall failed; the function name to update is helm_force_uninstall.
527-537:⚠️ Potential issue | 🔴 CriticalNamespace force-finalization is still applied to all terminating namespaces.
This remains unsafe on shared clusters because unrelated namespaces can be patched/deleted.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around lines 527 - 537, The current loop force-clears finalizers for every terminating namespace (TERMINATING) which is unsafe on shared clusters; change the logic to first restrict TERMINATING to only the known/test namespaces you own (e.g., compute a filtered list like OWN_TERMINATING by matching names against a whitelist, label/annotation, or the specific expected namespaces for this mixed_gpu_operator test), then iterate over that filtered list when calling force_clear_namespace_finalizers and kubectl delete; ensure you reference the existing TERMINATING variable and the force_clear_namespace_finalizers function so the code only patches/deletes namespaces that match the allowed set and explicitly skip system or unrelated namespaces (e.g., kube-system, kube-public).
559-568:⚠️ Potential issue | 🟠 MajorGlobal stale-webhook sweep is still deleting outside bundle scope.
This can remove webhook configurations owned by other tenants/workloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh` around lines 559 - 568, The sweep is deleting webhook configurations globally without verifying ownership; update the while loop that reads wh_name, svc_ns, svc_name to only delete webhooks that are owned by this bundle (e.g., check metadata.ownerReferences or a bundle-specific label) and skip others. Concretely, before the kubectl delete mutatingwebhookconfiguration/validatingwebhookconfiguration calls, query the webhook JSON for an identifying ownerReference or label (using kubectl get mutatingwebhookconfiguration/validatingwebhookconfiguration "$wh_name" -o jsonpath=...) and only perform the delete when that ownerReference or label matches the bundle's expected value; otherwise echo a skip message and continue. Ensure checks use wh_name, svc_ns, svc_name and guard both delete commands so unrelated tenants are never removed.pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (1)
249-253:⚠️ Potential issue | 🟠 Major
skip_preflight_for_releasedoes not matchskyhook-customizations, so the skip branch is unreachable.At Line 418 you call
skip_preflight_for_release "skyhook-customizations", but this function never returns success for that value.Suggested fix
skip_preflight_for_release() { case "$1" in - nodewright-operator|kgateway) return 0 ;; + nodewright-operator|kgateway|skyhook-customizations) return 0 ;; *) return 1 ;; esac }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 249 - 253, The skip_preflight_for_release function never returns success for "skyhook-customizations", making the call skip_preflight_for_release "skyhook-customizations" unreachable; update the function (skip_preflight_for_release) to include "skyhook-customizations" in the successful case pattern (e.g., add it alongside nodewright-operator and kgateway) so the function returns 0 for that release, or alternatively adjust the caller to use a release name that the function recognizes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 559-568: The deletion loop currently inspects all
mutatingwebhookconfigurations/validatingwebhookconfigurations cluster-wide and
may remove unrelated webhooks; restrict scope by filtering the jq pipeline to
only consider webhooks whose .clientConfig.service.namespace equals the target
release namespace (e.g., an environment variable like RELEASE_NAMESPACE or a
provided NAMESPACE variable) before attempting deletion, and use the existing
variables (wh_name, svc_ns, svc_name) to only delete when svc_ns matches that
target namespace; update the kubectl/jq command that produces the list and the
if-condition in the while loop to enforce svc_ns == "$TARGET_NAMESPACE" so only
webhooks tied to the release namespace are removed.
- Around line 85-93: The helm_force_uninstall function currently swallows
failures from the second uninstall (it uses "|| true"), so if both attempts fail
the script continues and masks a partial undeploy; change helm_force_uninstall
to detect the exit status of the second helm uninstall (the one with --no-hooks)
and propagate a non-zero failure (e.g., remove the "|| true" and return the
command's exit code or explicitly log an error and return non-zero) so callers
see the undeploy failure; reference the helm_force_uninstall function and the
helm uninstall invocation that uses --no-hooks and HELM_TIMEOUT to locate and
update the behavior.
- Around line 527-537: The loop currently collects all Terminating namespaces
into TERMINATING and force-clears finalizers for every one; restrict this to
only bundle-managed namespaces by filtering the kubectl query to bundle
namespaces (e.g. use a label selector like kubectl get ns -l
"app.kubernetes.io/managed-by=bundler" --no-headers or a name-prefix filter
using a BUNDLE_PREFIX variable) instead of the global get; update the
TERMINATING assignment and the subsequent for-loop to only iterate over those
filtered names, and keep using force_clear_namespace_finalizers and kubectl
delete namespace "${ns}" --ignore-not-found --wait=false as before.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Line 41: The README note is too broad for this nodewright-only fixture; update
the note in README.md to either (a) scope it to nodewright-specific post-install
convergence (mention Nodewright node tuning and that convergence continues
asynchronously) and remove references to GPU Operator operand rollout and DRA
kubelet plugin registration, or (b) make the note explicitly generic by
prefacing with “If deploying GPU workloads or additional components…” and link
to the CLI reference; locate and edit the paragraph that begins “Note: The
deploy script's final status…” and adjust the text to only mention
nodewright-related async actions or clearly state it applies to broader
deployments.
---
Outside diff comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 469-485: The current string search for "enabled: true" is too
brittle; instead parse values.yaml into a YAML object and assert the top-level
"enabled" key is not present or not set to true by unmarshalling valuesData into
a map[string]interface{} and checking map["enabled"]; update the check that uses
valuesStr to use the parsed map so it fails only on a leaked top-level enabled,
and keep the existing override assertion by verifying the nested
controller.replicaCount key exists in the parsed structure (e.g., assert
parsed["controller"].(map[string]interface{})["replicaCount"] is present).
---
Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: The sentence in docs referencing pkg/bundler/deployer/localformat and
the list of deployers conflates current and future behavior; update the
paragraph to explicitly separate which deployers currently consume
localformat.Write() (e.g., helms, argocd, argocd-helm if they truly do) from
planned or proposed ones (e.g., future helmfile per issue `#632`), by rewriting
the line to state "currently used by ..." vs "planned/future ..." and ensure the
symbol localformat.Write() and the deployer names (`helm`, `helmfile`, `argocd`,
`argocd-helm`) are mentioned so readers can locate the relevant implementation
and roadmap note.
In `@docs/user/cli-reference.md`:
- Line 1152: Update the wording in the docs where the phrase "local-helm wrapped
chart" appears (specifically the sentence "Manifest-only components (no upstream
Helm chart, just raw manifests) become a single local-helm wrapped chart.") to
use the compound-modifier hyphenation: change it to "local-helm-wrapped chart"
so the modifier is clear and consistent.
In `@pkg/bundler/deployer/helm/helm.go`:
- Around line 124-149: Generate currently writes new component folders but
doesn't remove previous NNN-* directories, allowing stale components to be
deployed; update Generate to prune old staged directories before writing (or
immediately after localformat.Write but before appending folder files) by
listing entries in outputDir and removing any directory matching the NNN-*
pattern that are not part of the newly returned folders; reference the
localformat.Write result variable folders, the outputDir variable, and ensure
use of deployer.SafeJoin when resolving paths so pruning enforces containment
and avoids deleting outside paths.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 43-46: The CLI parsing currently blindly sets HELM_TIMEOUT="$2"
and later forces a trailing "s", which mangles values like "5m" into "5ms";
modify the --timeout handling in undeploy.sh.tmpl so that after assigning
HELM_TIMEOUT="$2" you validate the token (e.g., if it matches a pure number
/^[0-9]+$/ then append an "s", if it already contains a valid time unit like
s|m|ms|h accept as-is, otherwise reject with "Error: invalid --timeout value").
Update the branch that handles --timeout to perform this regex check and either
normalize numeric-only inputs by appending "s" or exit with an error for
unrecognized formats.
- Around line 452-475: Add a pre-uninstall sweep for hook-created resources by
invoking a cleanup function before calling helm_force_uninstall inside the
per-component loop: call pre_uninstall_hook_cleanup "${name}" "${ns}" (or
another appropriately named function) immediately before helm_force_uninstall
"${name}" "${ns}"; implement that function to locate and delete Helm
hook-created objects and remove finalizers for resources owned by the
wrapper/chart so they don't remain after controller uninstall (use component
name/namespace to scope deletions). Ensure delete_release_cluster_resources and
delete_orphaned_webhooks_for_ns remain after the uninstall to handle
cluster-wide cleanup.
- Around line 418-425: The pre-flight/post-flight CRD checks currently iterate
.ComponentsReversed, which misses injected "-post" wrapper releases and
therefore doesn't cover the same set the uninstall loop (which iterates the
actual NNN-* folders) processes; update each pre/post-flight range (the blocks
that call skip_preflight_for_release and check_release_for_stuck_crds) to
iterate the same collection the uninstall code uses (the list of actual
installed release folders) instead of .ComponentsReversed so wrapper "-post"
releases are included—locate the template ranges that call
skip_preflight_for_release and check_release_for_stuck_crds and replace the
range source with the unified/installed-releases collection used by the
uninstall loop (use the same variable name used by uninstall to ensure parity).
- Around line 477-480: The comments about Nodewright taints are placed before
the components loop and thus appear in generated scripts even when
nodewright-operator isn't present; move those comment lines so they are inside
the conditional that handles the nodewright component (e.g., inside the {{-
range .ComponentsReversed }} block and specifically within the branch that
checks for the nodewright-operator component), and keep the reference to
runtimeRequiredTaint (defaults to skyhook.nvidia.com) next to the taint-removal
logic so the comment only renders when nodewright-related code is generated.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README's “Note” paragraph and other sections (around the shown
Note and lines 89–110) reference GPU/network/cert-manager checks and external
GPU docs which are unrelated to the kai-scheduler fixture; update those sections
to contain kai-scheduler-specific diagnostics (for example: expected
Deployment/DaemonSet status, scheduler logs, CR status, webhook health, and how
to run kubeectl describe/logs for kai-scheduler) or explicitly mark those
examples as generic/templated, and replace or annotate the AICR CLI link/notes
to point to kai-scheduler behavior docs or a clearly labeled generic deploy
behavior section so the fixture README consistently documents kai-scheduler
operational checks.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Around line 87-110: The Troubleshooting section in README.md currently lists
GPU/network operator commands and links that don't apply to this manifest_only
fixture; update the "Troubleshooting" block under the "Check deployment status",
"View component logs", and "Verify GPU access" headings to include
skyhook-customizations-specific commands or explicitly label the existing
examples as generic/non-applicable, e.g., replace the kubectl commands and
NVIDIA links with checks that target skyhook-customizations (or add a clear note
like "Generic examples — not specific to skyhook-customizations") so the content
accurately reflects the fixture and operational guidance.
- Line 21: The table row for the manifest-only component
"skyhook-customizations" leaves Version blank and shows Source as
"skyhook-customizations ()"; update the README table so the Version cell
explicitly reads "N/A" and the Source cell reads "local" (i.e., replace the
empty Version and the "()"-style placeholder with explicit values for the
"skyhook-customizations" row).
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 249-253: The skip_preflight_for_release function never returns
success for "skyhook-customizations", making the call skip_preflight_for_release
"skyhook-customizations" unreachable; update the function
(skip_preflight_for_release) to include "skyhook-customizations" in the
successful case pattern (e.g., add it alongside nodewright-operator and
kgateway) so the function returns 0 for that release, or alternatively adjust
the caller to use a release name that the function recognizes.
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 3-7: The Service manifest for kind: Service named "dcgm-exporter"
is missing a spec (ports and optional selector); update the dcgm-exporter
Service resource to include a minimal spec block that declares at least one port
(port and targetPort and protocol) and, if appropriate for this fixture, a
selector matching the Deployment/Pod labels (e.g., app/dcgm-exporter) so the
Service is selectable and installable; ensure the resource remains valid YAML
and aligns with the Deployment/Pod labels in this testdata set.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 92-110: The README mixes components outside this bundle
(network-operator and cert-manager) with the bundled gpu-operator; restrict
commands and links to GPU Operator only or label the examples as generic. Update
the grep in the "Get pods" example to only search for gpu-operator, remove or
annotate the network-operator/cert-manager references in the troubleshooting
text, ensure the "View component logs" and "Verify GPU access" sections
explicitly reference the gpu-operator namespace and resource (keep kubectl logs
-n gpu-operator -l app=gpu-operator and the nvidia.com/gpu jsonpath) and replace
the Network Operator link with either a note that it’s external or remove it so
the "References" list only contains GPU Operator docs (or clearly mark others as
external examples).
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.sh`:
- Around line 85-93: The helper helm_force_uninstall currently swallows failures
(it redirects stderr on first attempt and uses "|| true" on the second), so if
both uninstalls fail the script still continues; modify helm_force_uninstall to
propagate errors instead: remove the stderr discard on the first helm uninstall
and remove the "|| true" on the second uninstall (or explicitly check both
command exit codes), and if both attempts fail print an error message naming the
release and namespace and then exit or return a non-zero status (e.g., return 1)
so callers know the uninstall failed; the function name to update is
helm_force_uninstall.
- Around line 527-537: The current loop force-clears finalizers for every
terminating namespace (TERMINATING) which is unsafe on shared clusters; change
the logic to first restrict TERMINATING to only the known/test namespaces you
own (e.g., compute a filtered list like OWN_TERMINATING by matching names
against a whitelist, label/annotation, or the specific expected namespaces for
this mixed_gpu_operator test), then iterate over that filtered list when calling
force_clear_namespace_finalizers and kubectl delete; ensure you reference the
existing TERMINATING variable and the force_clear_namespace_finalizers function
so the code only patches/deletes namespaces that match the allowed set and
explicitly skip system or unrelated namespaces (e.g., kube-system, kube-public).
- Around line 559-568: The sweep is deleting webhook configurations globally
without verifying ownership; update the while loop that reads wh_name, svc_ns,
svc_name to only delete webhooks that are owned by this bundle (e.g., check
metadata.ownerReferences or a bundle-specific label) and skip others.
Concretely, before the kubectl delete
mutatingwebhookconfiguration/validatingwebhookconfiguration calls, query the
webhook JSON for an identifying ownerReference or label (using kubectl get
mutatingwebhookconfiguration/validatingwebhookconfiguration "$wh_name" -o
jsonpath=...) and only perform the delete when that ownerReference or label
matches the bundle's expected value; otherwise echo a skip message and continue.
Ensure checks use wh_name, svc_ns, svc_name and guard both delete commands so
unrelated tenants are never removed.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Around line 87-110: The Troubleshooting section incorrectly targets
GPU/Network operators; replace the GPU-operator and network-operator
checks/logs/links with nodewright-specific commands and references: update the
kubectl pod check to filter for nodewright-operator (e.g., grep for
'nodewright-operator' or namespace 'skyhook'), change the logs command to
kubectl logs -n skyhook -l app=nodewright-operator (or the actual label used by
the fixture), replace the GPU node allocatable check with a relevant nodewright
health or CR status check (e.g., kubectl get nodes or kubectl get <NodewrightCR>
-n skyhook -o yaml/jsonpath), and swap the external links to the
nodewright-operator documentation or the project's README; alternatively label
these entries explicitly as generic examples if you intend to keep them
non-nodewright-specific.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 91-110: The README's troubleshooting section currently shows
GPU/network-operator commands and references; update the "kubectl get pods -A |
grep -E 'gpu-operator|network-operator|cert-manager'" command, the "View
component logs" kubectl logs command, the "Verify GPU access" jsonpath check,
and the References list to be cert-manager specific. Replace the pod
listing/grep with a cert-manager namespace/label lookup (e.g., namespace
cert-manager), change the logs command to target cert-manager pods (use -n
cert-manager and appropriate label), remove the GPU allocatable jsonpath check
and instead include a basic cert-manager health/status check (kubectl get
deployments, pods, or certificate statuses), and update the References to point
to cert-manager docs/installation pages so the cert-manager-only bundle reads
correctly.
In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 15-17: Update the package doc comment for package localformat in
pkg/bundler/deployer/localformat/doc.go to avoid implying unimplemented deployer
adoption: explicitly state which deployers currently consume the local-chart
bundle layout (e.g., "currently used by X and Y") and mark others as intended or
future targets (e.g., "planned for/compatible with Z in the future"), removing
any present-tense claim that Helmfile, ArgoCD, or Flux already consume it if
they do not; ensure the text references the package name localformat and clearly
separates current consumers from intended future targets.
In `@pkg/bundler/deployer/localformat/upstream_helm.go`:
- Around line 60-66: The env file is written unquoted which allows shell
metacharacters to be executed when sourced; update the envContent construction
in upstream_helm.go to shell-quote CHART, REPO and VERSION before writing (e.g.
escape any single quotes and wrap each value in single quotes or use a dedicated
shell-escape helper), replace the current envContent :=
fmt.Sprintf("CHART=%s\nREPO=%s\nVERSION=%s\n", ...) with a version that produces
CHART='...'\nREPO='...'\nVERSION='...'\n, and keep using deployer.SafeJoin and
writeFile as-is so the file path/permissions are unchanged.
In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The test currently dereferences folders[0] inside
t.Fatalf which will panic when folders is empty; change the assertion in the
test to guard the index first (check len(folders) == 0 and fail with a clear
message) before inspecting folders[0].Kind (or split into two checks: length
check then kind check) for the occurrences around the assertion referencing
folders, localformat.KindLocalHelm, and t.Fatalf (also fix the duplicate at the
later occurrence).
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 154-156: The target string construction currently always appends
"?ref=" even when c.Tag is empty; update the code that builds target (the branch
where c.Repository != "" that sets target) to only append "?ref=%s" when c.Tag
is non-empty (use c.Tag to decide), e.g. construct base := fmt.Sprintf("%s//%s",
c.Repository, c.Path) and then set target = base only, or target = base +
"?ref="+c.Tag when c.Tag != "". Ensure you modify the block that references
c.Repository, c.Path, c.Tag and assigns target.
- Around line 81-102: The loop that builds folder names uses idx with
fmt.Sprintf("%03d-%s", idx, c.Name) so once idx >= 1000 the glob "NNN-*" used by
orchestration will skip folders; add a fail-fast guard that checks idx >= 1000
(or > 999) before formatting dir and return an
errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("too many components: index
%d would overflow 3-digit folder prefix", idx)); apply the same guard in the
other similar block around where dir is created (the other place referenced in
the comment) so the routine returns a clear error instead of silently producing
unmatchable folder names.
- Around line 83-90: When validating opts.Components in the writer (the loop
over opts.Components that calls deployer.IsSafePathComponent), add a check to
reject name collisions where a real component named "<base>-post" would conflict
with the synthetic folder emitted for a mixed component "<base>": build a map of
component names (and track which components are mixed if there is a field like
Mixed/IsMixed) and, before accepting a component, if its Name has suffix "-post"
and the map already contains a component with the same base name (Name without
"-post") that is mixed, return an errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("invalid component name %q: reserved for mixed-component synthetic
folder", c.Name)); apply the same collision check in the second validation block
(lines 117-124) so both validation sites reject these mixed-component "-post"
collisions up front while preserving the IsSafePathComponent check.
In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The test currently uses grep -q 'kind: Gateway' which also
matches GatewayParameters; update the test to assert an exact full-line match
for the Gateway kind (e.g., replace the grep -q 'kind: Gateway' invocation with
a grep that matches the entire line "kind: Gateway" using either line anchors or
grep -x) so the MANIFEST check detects the Gateway object specifically while
keeping the existing GatewayParameters check intact.
In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-100: The exclusion pattern for determining INSTALL is
inconsistent: replace the grep exclusion string "-post/" with the unified
"-gpu-operator-post" so the command that sets INSTALL (the shell pipeline
producing INSTALL=$(ls ... | grep -v -- "...") ) uses the same
"-gpu-operator-post" filter used elsewhere; update the grep -v argument
accordingly to grep -v -- "-gpu-operator-post".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
e281666 to
811ecac
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (21)
pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml (1)
3-6:⚠️ Potential issue | 🟠 MajorService manifest is incomplete and not installable.
Line 4 declares
kind: Service, but Lines 3-6 omitspecentirely. A non-ExternalNameService requiresspec(includingports), otherwise rendered output is invalid.Suggested fix
apiVersion: v1 kind: Service metadata: name: dcgm-exporter +spec: + selector: + app.kubernetes.io/name: dcgm-exporter + ports: + - name: metrics + port: 9400 + targetPort: 9400 + protocol: TCP🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml` around lines 3 - 6, The Service manifest for "metadata.name: dcgm-exporter" declares "kind: Service" but lacks a required "spec" block; add a proper spec for a non-ExternalName Service including a selector that matches the exporter pods (or appropriate labels) and at least one ports entry (e.g., port, targetPort, protocol) so the rendered Service is valid and installable. Ensure the manifest contains "spec:" with "selector:" referencing the pod labels and "ports:" with the correct port definitions for dcgm-exporter.docs/contributor/component.md (1)
21-21:⚠️ Potential issue | 🟠 MajorSeparate current support from future intent in this sentence.
“Consumed by every deployer” conflicts with “future
helmfileper#632” and overstates what is implemented today. Please rewrite to explicitly state currently supported deployers and then a separate future-intent sentence.As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/contributor/component.md` at line 21, The sentence overstates current support by saying the `pkg/bundler/deployer/localformat` layout is "consumed by every deployer" while also mentioning a future `helmfile` support; update the text to separate current behavior from future intent: explicitly state that `localformat.Write()` is currently used by the listed deployers (e.g., `helm`, `argocd`, `argocd-helm`) and then add a distinct, clearly marked sentence indicating that support for `helmfile` is planned per issue `#632` (or similar future intent), ensuring you keep `localformat` ownership and the fact that deployers add top-level orchestration files (deploy.sh, helmfile.yaml, Application CRs) unchanged.tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)
71-72:⚠️ Potential issue | 🟠 MajorHarden kind assertions to avoid false positives.
Line 72 can pass when only
GatewayParametersexists, becausegrep -q 'kind: Gateway'matches that substring too. Anchor both kind checks to full lines.✅ Tighten assertions
- grep -q 'kind: GatewayParameters' "${MANIFEST}" - grep -q 'kind: Gateway' "${MANIFEST}" + grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}" + grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71 - 72, The grep checks for manifest kinds are too loose and can false-match (e.g., "kind: GatewayParameters" satisfies the "kind: Gateway" check); update the two assertions that run grep -q 'kind: GatewayParameters' "${MANIFEST}" and grep -q 'kind: Gateway' "${MANIFEST}" to anchor the pattern to the full line (e.g., use start/end anchors or grep -x) so they only match exact lines "kind: GatewayParameters" and "kind: Gateway" in the ${MANIFEST}.pkg/bundler/bundler_test.go (1)
1199-1206: 🧹 Nitpick | 🔵 TrivialRemove redundant nested file checks under already-missing directories.
Once the directory non-existence assertion passes, checking
cluster-values.yamlbeneath that same directory adds no extra signal.♻️ Suggested simplification
// Disabled component should NOT have a directory at all (under any numbering) for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} { if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) { t.Errorf("expected %s directory to NOT be created (component is disabled)", dir) } - if _, statErr := os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")); !os.IsNotExist(statErr) { - t.Errorf("expected %s/cluster-values.yaml to NOT exist (component is disabled)", dir) - } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/bundler_test.go` around lines 1199 - 1206, The test is redundantly checking for cluster-values.yaml under directories already asserted absent; remove the nested file existence checks and only assert the directory non-existence for each of the dirs (e.g., "aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver") using os.Stat on filepath.Join(tmpDir, dir) in the test in pkg/bundler/bundler_test.go so the loop that references tmpDir and dir only contains the directory non-existence assertion and not the additional filepath.Join(tmpDir, dir, "cluster-values.yaml") checks.tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml (1)
99-100:⚠️ Potential issue | 🟡 MinorNormalize the post-folder exclusion pattern in the install script lookup.
Line 100 uses
-post/while the rest of this file uses-gpu-operator-post; keeping one pattern avoids confusion and drift.Suggested patch
- INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \ - | grep -v -- "-post/" | head -1) + INSTALL=$(ls "${WORK}"/helm-dynamic/[0-9]*-gpu-operator/install.sh 2>/dev/null \ + | grep -v -- "-gpu-operator-post" | head -1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml` around lines 99 - 100, The install script lookup uses a mismatched exclusion pattern "-post/" which is inconsistent with the rest of the file; update the grep pattern in the INSTALL assignment (the line that builds INSTALL) to exclude "-gpu-operator-post" instead of "-post/" so the filter matches the project's existing post-folder naming convention (update the grep -v argument used when selecting from "${WORK}/helm-dynamic/[0-9]*-gpu-operator/install.sh").pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)
41-41:⚠️ Potential issue | 🟡 MinorAlign readiness note and references with the cert-manager-only fixture.
Line 41 and Lines 106-107 describe unrelated GPU/network topics; please scope these to
cert-manageror mark them as generic guidance.As per coding guidelines: “When writing documentation... Ensure accuracy... and document operational clarity.”
Also applies to: 106-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` at line 41, The README contains a "Note:" paragraph that references GPU/node convergence and network topics unrelated to the cert-manager-only fixture; update that paragraph (the "**Note:**" block) and the other similar guidance later in the file so they either (a) explicitly scope the content to GPU/network deployments (e.g., prefix with "For GPU/network deployments:") or (b) reword to generic deployment readiness guidance applicable to cert-manager-only fixtures, and align the references to the AICR CLI docs accordingly (ensure the cert-manager-only fixture README does not mention Nodewright/GPU operator/DRA kubelet plugin unless scoped).docs/user/cli-reference.md (1)
1152-1152:⚠️ Potential issue | 🟡 MinorHyphenate the compound modifier in Line 1152.
Use
local-helm-wrapped chartfor consistent grammar.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` at line 1152, The sentence "Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart." uses an unhyphenated compound modifier; update that phrase to "local-helm-wrapped chart" so the compound modifier is hyphenated correctly (search for the exact sentence "Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart." and replace the modifier).pkg/bundler/deployer/helm/testdata/nodewright_present/README.md (1)
41-41:⚠️ Potential issue | 🟡 MinorScope this fixture README to nodewright-only behavior and references.
Line 41 and Lines 106-107 currently describe broader GPU stack concerns/docs, but this fixture documents only
nodewright-operator. Please make the note and references component-accurate (or explicitly mark them as generic).As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy... and document operational clarity.”
Also applies to: 106-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md` at line 41, The README note currently references broad GPU-stack convergence and the AICR CLI entry about deploy script behavior but this fixture documents only nodewright-operator; update the text around the deploy script note (line referencing the deploy script) and the references at lines 106-107 to be component-accurate by limiting scope to nodewright-operator behavior (e.g., mention Nodewright node tuning and any nodewright-specific async actions) or explicitly label the statements as “generic GPU-stack behavior” if you intend to keep the broader content; ensure references to the AICR CLI or GPU operator are either removed or replaced with nodewright-specific docs/links and that occurrences of “GPU operator”/“DRA kubelet plugin” are not asserted as part of this fixture’s expected state unless you mark them as generic.pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)
41-41:⚠️ Potential issue | 🟡 MinorUpdate note and references to be kai-scheduler-specific (or explicitly generic).
Line 41 and Lines 106-107 are currently GPU-stack oriented and don’t match this fixture’s component set.
As per coding guidelines: “When writing documentation... Ensure accuracy, use neutral tone... and document operational clarity.”
Also applies to: 106-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` at line 41, The README note paragraph that begins with "**Note:** The deploy script's final status..." and the later GPU-specific sentences (referencing `--best-effort`, "GPU workloads", "fresh GPU nodes", "Nodewright node tuning", "GPU operator operand rollout", "DRA kubelet plugin registration") must be rewritten to either be kai-scheduler-specific or explicitly generic: remove or replace GPU-stack terms with kai-scheduler component names or neutral wording about asynchronous convergence of cluster components after the script exits, retain mention of `--best-effort` behavior and the link to the AICR CLI Reference but adjust its context to match kai-scheduler (or state it's a generic deploy behavior link), and ensure the revised text clarifies that final script exit does not guarantee full operational readiness and users should check warnings and logs.pkg/bundler/deployer/helm/templates/README.md.tmpl (2)
131-134:⚠️ Potential issue | 🟡 MinorOnly emit reference links that apply to the generated bundle.
This template always appends GPU Operator and Network Operator docs, so manifest-only bundles and gpu-operator-only bundles end up with unrelated references. Please gate these links on component membership, or clearly label them as generic examples. As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 131 - 134, The README.md.tmpl currently unconditionally emits GPU Operator and Network Operator links in the "References" section; update the template (README.md.tmpl) to only include each operator link when the corresponding component is present (e.g., use template conditionals like has_gpu_operator / has_network_operator or the bundle/component membership variables available in the templating context), or alternatively add a clear header such as "Example operator references (may not apply to all bundles)" to avoid implying they apply to manifest-only or single-operator bundles; adjust the "References" block generation accordingly so links are gated by component membership or clearly labeled as generic examples.
20-27:⚠️ Potential issue | 🟡 MinorRender local/mixed bundle metadata explicitly in the components section.
For manifest-only/local-chart bundles this currently generates blank Version fields and
name ()sources, and for mixed bundles the intro still implies a 1:1 component-to-folder mapping even though the bundle can emit an injectedNNN+1-<name>-post/folder. Please make the local case explicit (N/A/local) and mention the synthetic-postfolder behavior so the generated README matches the actual layout. As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor with CNCF/Linux Foundation experience. Ensure accuracy, use neutral tone, clearly structure content, consider audience awareness, and document operational clarity.”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 20 - 27, The README template currently renders empty Version and source fields for manifest-only/local-chart bundles; update the template logic around the Components loop (referencing .Components, .Name, .Version, .ChartName, .Repository, .Namespace) to print "N/A" for missing .Version and to use a clear local source label such as "local" or "manifest-only" when .ChartName/.Repository are blank, and also update the intro text to explicitly state that mixed bundles may emit an extra synthetic folder named "NNN+1-<name>-post/" so the numbering and component-to-folder mapping in the generated README accurately reflect the actual layout.pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)
309-321:⚠️ Potential issue | 🟠 MajorDon't install arbitrary stale
NNN-*folders from a reused output directory.This loop executes whatever numbered folders happen to exist on disk. If a user regenerates into an existing
--outputpath, removed or renumbered components from the previous bundle are still picked up and installed. Please either clean the output directory before writing or render an explicit generated-folder manifest and iterate that instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 309 - 321, The current install loop in deploy.sh.tmpl iterates whatever NNN-* directories exist under "${SCRIPT_DIR}", which can install stale items from a reused output dir; change it to either (A) purge/clean the output directory before writing generated folders so the for loop over "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ only sees fresh folders, or (B) stop globbing the filesystem and instead read an explicit generated-folder manifest (e.g., a file written at bundle render time like "${SCRIPT_DIR}/generated-manifest.txt") and iterate that list to set dir/base/name/namespace (keep using the same variable names dir, base, name, namespace and the awk extraction for namespace) so only intended folders are installed. Ensure whichever approach you pick is implemented where the loop currently starts (the for dir in ... block) and that the manifest is produced by the renderer or the cleanup occurs before this loop runs.pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh (2)
485-488:⚠️ Potential issue | 🟡 MinorStale comment about nodewright taints in kai-scheduler fixture.
This comment block mentions "nodewright node taints" and "runtimeRequiredTaint (defaults to skyhook.nvidia.com)" but this is the kai-scheduler-only test fixture. The comment is misleading since no taint removal code follows. Consider removing or updating the comment to match the fixture's scope.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around lines 485 - 488, Remove or update the stale comment block that references "nodewright node taints" and "runtimeRequiredTaint (defaults to skyhook.nvidia.com)" in the undeploy.sh fixture so it accurately reflects kai-scheduler's scope; specifically either delete the comment or replace it with a short note that no node taint removal occurs in this kai-scheduler-only fixture (search for the comment text "nodewright" or "runtimeRequiredTaint" to find the exact spot).
43-45:⚠️ Potential issue | 🟡 MinorValidate
--timeoutto reject non-integer values.The
--timeoutparser accepts any string but later appends anssuffix (e.g., line 88:"${HELM_TIMEOUT}s"). If a user passes--timeout 5m, it becomes5ms, causing silent misbehavior.🛡️ Proposed fix
--timeout) if [[ $# -lt 2 ]]; then echo "Error: --timeout requires a value"; exit 1; fi + if ! [[ "$2" =~ ^[0-9]+$ ]]; then + echo "Error: --timeout must be an integer number of seconds" + exit 1 + fi HELM_TIMEOUT="$2"; shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh` around lines 43 - 45, The --timeout parsing block assigns HELM_TIMEOUT from the next arg but doesn't validate it, causing suffix concatenation like "${HELM_TIMEOUT}s" to misinterpret values like "5m"; update the parsing case that sets HELM_TIMEOUT to validate that the supplied value is a positive integer (e.g., match against a numeric regex such as ^[0-9]+$) and if it fails print a clear error and exit non‑zero; place this check immediately after HELM_TIMEOUT="$2" in the same case handling so invalid values are rejected before shift 2 and subsequent use.pkg/bundler/deployer/localformat/writer_test.go (2)
310-312:⚠️ Potential issue | 🟡 MinorSame panic guard needed here.
Same issue as above—if
foldersis empty, thet.Fatalfformat string panics when evaluatingfolders[0].Kind.🐛 Proposed fix
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { - t.Fatalf("want 1 local-helm folder (kustomize wrapped), got %d folders kind=%v", len(folders), folders[0].Kind) + if len(folders) != 1 { + t.Fatalf("want 1 folder (kustomize wrapped), got %d", len(folders)) + } + if folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want local-helm folder (kustomize wrapped), got kind=%v", folders[0].Kind) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 310 - 312, Test panics if folders is empty because folders[0].Kind is evaluated inside t.Fatalf; guard the access by first checking len(folders) and handling the zero-length case separately (e.g., if len(folders) == 0 then fail with a message that 0 folders were returned), otherwise proceed to assert Kind on folders[0]; update the assertion around folders and t.Fatalf usage in writer_test.go to avoid indexing when folders is empty.
117-119:⚠️ Potential issue | 🟡 MinorGuard against panic when
foldersis empty.If
Writereturns zero folders, thet.Fatalfformat string evaluatesfolders[0].Kindcausing a panic before the assertion failure is reported. Split the length check from the kind check.🐛 Proposed fix
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { - t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind) + if len(folders) != 1 { + t.Fatalf("want 1 folder, got %d", len(folders)) + } + if folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The test currently indexes folders[0].Kind which panics if folders is empty; change the assertions in the test to first assert len(folders) == 1 (using t.Fatalf/t.Fatalf message that only references len(folders)), and only after that check folders[0].Kind == localformat.KindLocalHelm (with a separate t.Fatalf message if the kind mismatches). Ensure you update the test that references the folders slice so the length guard executes before any indexing (look for the folders variable and localformat.KindLocalHelm usage in writer_test.go).pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh (4)
249-253:⚠️ Potential issue | 🟠 MajorAdd
skyhook-customizationsto the preflight skip list.The generated call site later checks
skip_preflight_for_release "skyhook-customizations", but this case statement can never return success for that release. That makes the advertised skip path unreachable and can block undeploy on CRs the bundle deletes before uninstall.Suggested fix
skip_preflight_for_release() { case "$1" in - nodewright-operator|kgateway) return 0 ;; + nodewright-operator|kgateway|skyhook-customizations) return 0 ;; *) return 1 ;; esac }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 249 - 253, The case in the skip_preflight_for_release function currently only returns success for "nodewright-operator" and "kgateway", but the code later calls skip_preflight_for_release "skyhook-customizations", so add "skyhook-customizations" to the case list so the function returns 0 for that release; update the case pattern in skip_preflight_for_release to include skyhook-customizations alongside nodewright-operator and kgateway so the preflight skip path becomes reachable.
572-581:⚠️ Potential issue | 🟠 MajorDo not sweep stale webhooks cluster-wide.
This post-flight pass deletes any webhook whose service or namespace is missing anywhere in the cluster. That reaches outside the bundle and can remove another workload’s admission webhooks. Reuse
delete_orphaned_webhooks_for_ns "skyhook"here instead of the global scan.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 572 - 581, The script currently performs a cluster-wide scan deleting any webhook whose backing service/namespace is missing; replace that global sweep with a call to the existing helper delete_orphaned_webhooks_for_ns with the target namespace "skyhook" so only webhooks tied to this bundle are removed. Locate the block that runs kubectl/jq to enumerate webhooks and remove it, and instead invoke delete_orphaned_webhooks_for_ns "skyhook" (ensuring the function is defined/imported earlier in the script) so deletion is scoped to the skyhook namespace only.
540-550:⚠️ Potential issue | 🔴 CriticalOnly force-finalize namespaces owned by this bundle.
After 60 seconds this loop strips finalizers from every
Terminatingnamespace in the cluster, not justskyhook. On a shared cluster that can mutate unrelated tenants’ teardown flows. FilterTERMINATINGto the bundle namespace set before callingforce_clear_namespace_finalizers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 540 - 550, The loop that sets TERMINATING currently collects all Terminating namespaces cluster-wide and later force-clears finalizers for each, which can affect unrelated tenants; modify the selection so TERMINATING is filtered to only include the bundle's namespaces (e.g., the skyhook namespace set) before iterating. Specifically, update the logic that computes TERMINATING (used in the loop and in the call sites to force_clear_namespace_finalizers and kubectl delete) to intersect/grep against the bundle namespace list (the set you use elsewhere for this deployment), so force_clear_namespace_finalizers and kubectl delete are only invoked for those matching namespaces rather than every Terminating namespace. Ensure references to TERMINATING and force_clear_namespace_finalizers are used consistently after filtering.
85-93:⚠️ Potential issue | 🟠 MajorPropagate the fallback
helm uninstallfailure.If the
--no-hooksretry also fails, the script keeps deleting cluster resources and still ends withUndeployment complete., which hides a partial teardown. Return non-zero fromhelm_force_uninstallso the undeploy stops on an actual uninstall failure. This same generated block appears in the other new undeploy goldens as well.Suggested fix
helm_force_uninstall() { local release="$1" local ns="$2" if helm uninstall "${release}" -n "${ns}" --timeout "${HELM_TIMEOUT}s" --ignore-not-found 2>/dev/null; then return fi echo " Retrying ${release} removal with --no-hooks..." - helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found || true + if ! helm uninstall "${release}" -n "${ns}" --no-hooks --timeout "${HELM_TIMEOUT}s" --ignore-not-found; then + echo "ERROR: failed to uninstall ${release} from ${ns}" >&2 + return 1 + fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh` around lines 85 - 93, The helm_force_uninstall function currently swallows failures of the fallback uninstall (it calls helm uninstall ... --no-hooks ... || true) so undeploy proceeds even on real uninstall failure; change helm_force_uninstall to propagate the fallback command's exit status: run the second helm uninstall without "|| true", capture its exit code (or simply let it return), and return a non-zero value when that uninstall fails so callers see the failure (i.e., ensure helm_force_uninstall returns non-zero on failure and zero on success).pkg/bundler/deployer/localformat/writer.go (1)
113-163:⚠️ Potential issue | 🟠 MajorReject bundles that exceed the three-digit folder contract.
The writer formats
%03d-..., but onceidxreaches 1000 it emits1000-*while the generated deploy/undeploy scripts only match[0-9][0-9][0-9]-*/. Mixed components consume two indices, so this can happen before 1000 declared components. Fail fast before formatting both the primary and injected-postfolder names.Suggested fix
folders := make([]Folder, 0, len(opts.Components)) idx := 1 for _, c := range opts.Components { + if idx > 999 { + return nil, errors.New( + errors.ErrCodeInvalidRequest, + "too many bundle folders; max 999 supported by the current NNN-* layout", + ) + } if err := ctx.Err(); err != nil { return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) } @@ if manifests := opts.ComponentManifests[c.Name]; len(manifests) > 0 { + if idx > 999 { + return nil, errors.New( + errors.ErrCodeInvalidRequest, + "too many bundle folders; max 999 supported by the current NNN-* layout", + ) + } postName := c.Name + "-post" postDir := fmt.Sprintf("%03d-%s", idx, postName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 113 - 163, The loop uses idx to format folder names with "%03d-%s" but never checks overflow, so when idx >= 1000 (or idx+1 for mixed components) the produced names break the three-digit contract; add a guard before creating primary and injected "-post" folders: if idx > 999 return a clear errors.New(errors.ErrCodeInvalidRequest, ...) about too many components, and for the mixed-case (KindUpstreamHelm branch) also check if idx+1 > 999 and fail before calling writeUpstreamHelmFolder or writeLocalHelmFolder; reference the loop’s idx, opts.Components/opts.ComponentManifests, the KindUpstreamHelm branch, and the two helper functions writeUpstreamHelmFolder and writeLocalHelmFolder when adding the checks and error returns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 121-129: The template's use of trimming (the {{- ... -}} markers)
around the gpu conditional removes the blank line between the preceding code
fence and the "### Verify GPU access" heading causing Markdown lint errors;
update the template by removing the trailing/leading hyphen(s) from the
conditional delimiters (e.g., change {{- if $hasGPU }} / {{- end }} to {{ if
$hasGPU }} / {{ end }}) or explicitly emit a blank line before the heading so
that the block guarded by $hasGPU (set in the range .Components logic) preserves
the separator in the generated README.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh`:
- Around line 43-45: The --timeout option currently assigns any string to
HELM_TIMEOUT (in the --timeout case) and later uses it as "${HELM_TIMEOUT}s",
which corrupts values like "5m"; fix this by validating the value is an integer
before accepting it: in the --timeout case that sets HELM_TIMEOUT, check the
supplied value against an integer regex (e.g. digits-only) and emit an
error/exit if it fails, otherwise assign and shift; reference the --timeout
case, the HELM_TIMEOUT variable, and the later use of "${HELM_TIMEOUT}s" when
applying the validation.
In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 16-17: Update the package doc comment in
pkg/bundler/deployer/localformat/doc.go to separate current supported deployers
from planned ones: replace the single line that mixes both with two clear
statements such as one saying "Currently consumed by deployers: helm, helmfile."
and a separate sentence prefaced with "Planned/intent: for future use by argocd
and Flux" (or similar wording) so the godoc clearly distinguishes shipped
behavior from intended future support.
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 82-88: The Write function currently performs filesystem operations
before honoring context cancellation; check the provided ctx for cancellation
before mutating the output tree by calling ctx.Err() (or select on ctx.Done())
at the top of Write, and return an appropriate error when canceled instead of
proceeding to os.MkdirAll or pruneStaleFolders; ensure subsequent long-running
loops or calls like pruneStaleFolders also observe ctx (propagate ctx to
pruneStaleFolders or add checks there) so pre-canceled contexts cause an
immediate return from Write rather than creating directories or returning
success.
In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml`:
- Around line 64-68: The glob used to find the manifest uses a permissive prefix
"[0-9]*-gpu-operator-post" which can match non-NNN prefixes; change the glob to
require exactly three digits (e.g., "[0-9][0-9][0-9]-gpu-operator-post")
wherever it's used (the MANIFEST assignment and the similar pattern around lines
97-98) so the folder contract is enforced; update both occurrences to use the
three-digit glob.
In
`@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml`:
- Line 69: The inline resource substitution --resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml
2>/dev/null | head -1)" can yield an empty string and cause a non-actionable
failure; before the corresponding chainsaw assert invocation capture the ls
output into a variable (e.g. template_path=$(ls ... | head -1)), test that the
variable is non-empty (if [ -z "$template_path" ]; then echo "ERROR: tuning
template not found" >&2; exit 1; fi) and then pass --resource "$template_path"
to chainsaw assert; apply the same guard pattern for the other three assertion
steps that use similar --resource "$(ls ... | head -1)" calls (the ones around
the chainsaw assert usages).
In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml`:
- Around line 52-55: Both shell globs are too permissive; tighten them to
require exactly three leading digits. Replace the first pattern
basic/[0-9]*/values.yaml with basic/[0-9][0-9][0-9]*/values.yaml (or
basic/[0-9][0-9][0-9]-*/values.yaml if the directory includes a dash) and
replace the gpu-operator check basic/[0-9]*-gpu-operator with
basic/[0-9][0-9][0-9]-gpu-operator so the tests only match the deterministic
NNN-<component> layout.
---
Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: The sentence overstates current support by saying the
`pkg/bundler/deployer/localformat` layout is "consumed by every deployer" while
also mentioning a future `helmfile` support; update the text to separate current
behavior from future intent: explicitly state that `localformat.Write()` is
currently used by the listed deployers (e.g., `helm`, `argocd`, `argocd-helm`)
and then add a distinct, clearly marked sentence indicating that support for
`helmfile` is planned per issue `#632` (or similar future intent), ensuring you
keep `localformat` ownership and the fact that deployers add top-level
orchestration files (deploy.sh, helmfile.yaml, Application CRs) unchanged.
In `@docs/user/cli-reference.md`:
- Line 1152: The sentence "Manifest-only components (no upstream Helm chart,
just raw manifests) become a single local-helm wrapped chart." uses an
unhyphenated compound modifier; update that phrase to "local-helm-wrapped chart"
so the compound modifier is hyphenated correctly (search for the exact sentence
"Manifest-only components (no upstream Helm chart, just raw manifests) become a
single local-helm wrapped chart." and replace the modifier).
In `@pkg/bundler/bundler_test.go`:
- Around line 1199-1206: The test is redundantly checking for
cluster-values.yaml under directories already asserted absent; remove the nested
file existence checks and only assert the directory non-existence for each of
the dirs (e.g., "aws-ebs-csi-driver", "001-aws-ebs-csi-driver",
"002-aws-ebs-csi-driver") using os.Stat on filepath.Join(tmpDir, dir) in the
test in pkg/bundler/bundler_test.go so the loop that references tmpDir and dir
only contains the directory non-existence assertion and not the additional
filepath.Join(tmpDir, dir, "cluster-values.yaml") checks.
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 309-321: The current install loop in deploy.sh.tmpl iterates
whatever NNN-* directories exist under "${SCRIPT_DIR}", which can install stale
items from a reused output dir; change it to either (A) purge/clean the output
directory before writing generated folders so the for loop over
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ only sees fresh folders, or (B) stop globbing
the filesystem and instead read an explicit generated-folder manifest (e.g., a
file written at bundle render time like "${SCRIPT_DIR}/generated-manifest.txt")
and iterate that list to set dir/base/name/namespace (keep using the same
variable names dir, base, name, namespace and the awk extraction for namespace)
so only intended folders are installed. Ensure whichever approach you pick is
implemented where the loop currently starts (the for dir in ... block) and that
the manifest is produced by the renderer or the cleanup occurs before this loop
runs.
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 131-134: The README.md.tmpl currently unconditionally emits GPU
Operator and Network Operator links in the "References" section; update the
template (README.md.tmpl) to only include each operator link when the
corresponding component is present (e.g., use template conditionals like
has_gpu_operator / has_network_operator or the bundle/component membership
variables available in the templating context), or alternatively add a clear
header such as "Example operator references (may not apply to all bundles)" to
avoid implying they apply to manifest-only or single-operator bundles; adjust
the "References" block generation accordingly so links are gated by component
membership or clearly labeled as generic examples.
- Around line 20-27: The README template currently renders empty Version and
source fields for manifest-only/local-chart bundles; update the template logic
around the Components loop (referencing .Components, .Name, .Version,
.ChartName, .Repository, .Namespace) to print "N/A" for missing .Version and to
use a clear local source label such as "local" or "manifest-only" when
.ChartName/.Repository are blank, and also update the intro text to explicitly
state that mixed bundles may emit an extra synthetic folder named
"NNN+1-<name>-post/" so the numbering and component-to-folder mapping in the
generated README accurately reflect the actual layout.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Line 41: The README note paragraph that begins with "**Note:** The deploy
script's final status..." and the later GPU-specific sentences (referencing
`--best-effort`, "GPU workloads", "fresh GPU nodes", "Nodewright node tuning",
"GPU operator operand rollout", "DRA kubelet plugin registration") must be
rewritten to either be kai-scheduler-specific or explicitly generic: remove or
replace GPU-stack terms with kai-scheduler component names or neutral wording
about asynchronous convergence of cluster components after the script exits,
retain mention of `--best-effort` behavior and the link to the AICR CLI
Reference but adjust its context to match kai-scheduler (or state it's a generic
deploy behavior link), and ensure the revised text clarifies that final script
exit does not guarantee full operational readiness and users should check
warnings and logs.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.sh`:
- Around line 485-488: Remove or update the stale comment block that references
"nodewright node taints" and "runtimeRequiredTaint (defaults to
skyhook.nvidia.com)" in the undeploy.sh fixture so it accurately reflects
kai-scheduler's scope; specifically either delete the comment or replace it with
a short note that no node taint removal occurs in this kai-scheduler-only
fixture (search for the comment text "nodewright" or "runtimeRequiredTaint" to
find the exact spot).
- Around line 43-45: The --timeout parsing block assigns HELM_TIMEOUT from the
next arg but doesn't validate it, causing suffix concatenation like
"${HELM_TIMEOUT}s" to misinterpret values like "5m"; update the parsing case
that sets HELM_TIMEOUT to validate that the supplied value is a positive integer
(e.g., match against a numeric regex such as ^[0-9]+$) and if it fails print a
clear error and exit non‑zero; place this check immediately after
HELM_TIMEOUT="$2" in the same case handling so invalid values are rejected
before shift 2 and subsequent use.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/undeploy.sh`:
- Around line 249-253: The case in the skip_preflight_for_release function
currently only returns success for "nodewright-operator" and "kgateway", but the
code later calls skip_preflight_for_release "skyhook-customizations", so add
"skyhook-customizations" to the case list so the function returns 0 for that
release; update the case pattern in skip_preflight_for_release to include
skyhook-customizations alongside nodewright-operator and kgateway so the
preflight skip path becomes reachable.
- Around line 572-581: The script currently performs a cluster-wide scan
deleting any webhook whose backing service/namespace is missing; replace that
global sweep with a call to the existing helper delete_orphaned_webhooks_for_ns
with the target namespace "skyhook" so only webhooks tied to this bundle are
removed. Locate the block that runs kubectl/jq to enumerate webhooks and remove
it, and instead invoke delete_orphaned_webhooks_for_ns "skyhook" (ensuring the
function is defined/imported earlier in the script) so deletion is scoped to the
skyhook namespace only.
- Around line 540-550: The loop that sets TERMINATING currently collects all
Terminating namespaces cluster-wide and later force-clears finalizers for each,
which can affect unrelated tenants; modify the selection so TERMINATING is
filtered to only include the bundle's namespaces (e.g., the skyhook namespace
set) before iterating. Specifically, update the logic that computes TERMINATING
(used in the loop and in the call sites to force_clear_namespace_finalizers and
kubectl delete) to intersect/grep against the bundle namespace list (the set you
use elsewhere for this deployment), so force_clear_namespace_finalizers and
kubectl delete are only invoked for those matching namespaces rather than every
Terminating namespace. Ensure references to TERMINATING and
force_clear_namespace_finalizers are used consistently after filtering.
- Around line 85-93: The helm_force_uninstall function currently swallows
failures of the fallback uninstall (it calls helm uninstall ... --no-hooks ...
|| true) so undeploy proceeds even on real uninstall failure; change
helm_force_uninstall to propagate the fallback command's exit status: run the
second helm uninstall without "|| true", capture its exit code (or simply let it
return), and return a non-zero value when that uninstall fails so callers see
the failure (i.e., ensure helm_force_uninstall returns non-zero on failure and
zero on success).
In
`@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yaml`:
- Around line 3-6: The Service manifest for "metadata.name: dcgm-exporter"
declares "kind: Service" but lacks a required "spec" block; add a proper spec
for a non-ExternalName Service including a selector that matches the exporter
pods (or appropriate labels) and at least one ports entry (e.g., port,
targetPort, protocol) so the rendered Service is valid and installable. Ensure
the manifest contains "spec:" with "selector:" referencing the pod labels and
"ports:" with the correct port definitions for dcgm-exporter.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Line 41: The README note currently references broad GPU-stack convergence and
the AICR CLI entry about deploy script behavior but this fixture documents only
nodewright-operator; update the text around the deploy script note (line
referencing the deploy script) and the references at lines 106-107 to be
component-accurate by limiting scope to nodewright-operator behavior (e.g.,
mention Nodewright node tuning and any nodewright-specific async actions) or
explicitly label the statements as “generic GPU-stack behavior” if you intend to
keep the broader content; ensure references to the AICR CLI or GPU operator are
either removed or replaced with nodewright-specific docs/links and that
occurrences of “GPU operator”/“DRA kubelet plugin” are not asserted as part of
this fixture’s expected state unless you mark them as generic.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Line 41: The README contains a "Note:" paragraph that references GPU/node
convergence and network topics unrelated to the cert-manager-only fixture;
update that paragraph (the "**Note:**" block) and the other similar guidance
later in the file so they either (a) explicitly scope the content to GPU/network
deployments (e.g., prefix with "For GPU/network deployments:") or (b) reword to
generic deployment readiness guidance applicable to cert-manager-only fixtures,
and align the references to the AICR CLI docs accordingly (ensure the
cert-manager-only fixture README does not mention Nodewright/GPU operator/DRA
kubelet plugin unless scoped).
In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 310-312: Test panics if folders is empty because folders[0].Kind
is evaluated inside t.Fatalf; guard the access by first checking len(folders)
and handling the zero-length case separately (e.g., if len(folders) == 0 then
fail with a message that 0 folders were returned), otherwise proceed to assert
Kind on folders[0]; update the assertion around folders and t.Fatalf usage in
writer_test.go to avoid indexing when folders is empty.
- Around line 117-119: The test currently indexes folders[0].Kind which panics
if folders is empty; change the assertions in the test to first assert
len(folders) == 1 (using t.Fatalf/t.Fatalf message that only references
len(folders)), and only after that check folders[0].Kind ==
localformat.KindLocalHelm (with a separate t.Fatalf message if the kind
mismatches). Ensure you update the test that references the folders slice so the
length guard executes before any indexing (look for the folders variable and
localformat.KindLocalHelm usage in writer_test.go).
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 113-163: The loop uses idx to format folder names with "%03d-%s"
but never checks overflow, so when idx >= 1000 (or idx+1 for mixed components)
the produced names break the three-digit contract; add a guard before creating
primary and injected "-post" folders: if idx > 999 return a clear
errors.New(errors.ErrCodeInvalidRequest, ...) about too many components, and for
the mixed-case (KindUpstreamHelm branch) also check if idx+1 > 999 and fail
before calling writeUpstreamHelmFolder or writeLocalHelmFolder; reference the
loop’s idx, opts.Components/opts.ComponentManifests, the KindUpstreamHelm
branch, and the two helper functions writeUpstreamHelmFolder and
writeLocalHelmFolder when adding the checks and error returns.
In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The grep checks for manifest kinds are too loose and can
false-match (e.g., "kind: GatewayParameters" satisfies the "kind: Gateway"
check); update the two assertions that run grep -q 'kind: GatewayParameters'
"${MANIFEST}" and grep -q 'kind: Gateway' "${MANIFEST}" to anchor the pattern to
the full line (e.g., use start/end anchors or grep -x) so they only match exact
lines "kind: GatewayParameters" and "kind: Gateway" in the ${MANIFEST}.
In `@tests/chainsaw/cli/bundle-dynamic/chainsaw-test.yaml`:
- Around line 99-100: The install script lookup uses a mismatched exclusion
pattern "-post/" which is inconsistent with the rest of the file; update the
grep pattern in the INSTALL assignment (the line that builds INSTALL) to exclude
"-gpu-operator-post" instead of "-post/" so the filter matches the project's
existing post-folder naming convention (update the grep -v argument used when
selecting from "${WORK}/helm-dynamic/[0-9]*-gpu-operator/install.sh").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7784bcaa-fe35-49c3-9754-e17d419df09b
📒 Files selected for processing (78)
Makefiledocs/contributor/component.mddocs/user/cli-reference.mdpkg/bundler/bundler_test.gopkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/templates/undeploy.sh.tmplpkg/bundler/deployer/helm/testdata/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.envpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/README.mdpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/undeploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.envpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.mdpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.shpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.envpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yamlpkg/bundler/deployer/helm/testdata/nodewright_present/README.mdpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.envpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/README.mdpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.shpkg/bundler/deployer/localformat/doc.gopkg/bundler/deployer/localformat/folder.gopkg/bundler/deployer/localformat/kustomize.gopkg/bundler/deployer/localformat/local_helm.gopkg/bundler/deployer/localformat/templates/chart.yaml.tmplpkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmplpkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmplpkg/bundler/deployer/localformat/testdata/README.mdpkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yamlpkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.shpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.envpkg/bundler/deployer/localformat/upstream_helm.gopkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/handler_test.gotests/chainsaw/ai-conformance/offline/chainsaw-test.yamltests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yamltests/chainsaw/bundle-templates/kgateway/chainsaw-test.yamltests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yamltests/chainsaw/cli/bundle-dynamic/chainsaw-test.yamltests/chainsaw/cli/bundle-scheduling/chainsaw-test.yamltests/chainsaw/cli/bundle-variants/chainsaw-test.yamltests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
811ecac to
24374b4
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (20)
tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml (1)
71-72:⚠️ Potential issue | 🟠 MajorAnchor
kindassertions to avoid false positives.At Line 72,
grep -q 'kind: Gateway'can still pass onkind: GatewayParameters, so this assertion can mask regressions.Proposed fix
- grep -q 'kind: GatewayParameters' "${MANIFEST}" - grep -q 'kind: Gateway' "${MANIFEST}" + grep -q '^kind:[[:space:]]*GatewayParameters$' "${MANIFEST}" + grep -q '^kind:[[:space:]]*Gateway$' "${MANIFEST}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml` around lines 71 - 72, The grep asserting 'kind: Gateway' can be satisfied by 'kind: GatewayParameters' and thus give false positives; update the check that reads from MANIFEST to anchor the kind assertion so it matches only the exact line (e.g., match start/end and optional leading whitespace) for 'kind: Gateway' rather than a substring, leaving the existing 'kind: GatewayParameters' assertion intact; target the two grep assertions that reference MANIFEST and replace the loose substring match for 'kind: Gateway' with an anchored exact-line match.tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml (1)
52-55:⚠️ Potential issue | 🟡 MinorTighten glob patterns to enforce exact
NNN-layout.Line 52 and Line 54 still accept any digit-prefixed folder, so this test can pass even when the folder naming contract is violated.
🔧 Suggested patch
- ls "${WORK}"/basic/[0-9]*/values.yaml >/dev/null 2>&1 + ls "${WORK}"/basic/[0-9][0-9][0-9]-*/values.yaml >/dev/null 2>&1 @@ - ls -d "${WORK}"/basic/[0-9]*-gpu-operator >/dev/null 2>&1 \ + ls -d "${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator >/dev/null 2>&1 \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml` around lines 52 - 55, The glob patterns in the two shell checks (ls "${WORK}"/basic/[0-9]*/values.yaml and ls -d "${WORK}"/basic/[0-9]*-gpu-operator) are too permissive; tighten them to require exactly three digits for the NNN prefix by replacing [0-9]* with [0-9][0-9][0-9] (so use ls "${WORK}"/basic/[0-9][0-9][0-9]/values.yaml and ls -d "${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator).pkg/bundler/bundler_test.go (1)
1204-1206: 🧹 Nitpick | 🔵 TrivialRemove redundant
cluster-values.yamlchecks under non-existent directories.These checks are unreachable once the parent directory non-existence assertion passes, so they add noise without additional signal.
Suggested patch
for _, dir := range []string{"aws-ebs-csi-driver", "001-aws-ebs-csi-driver", "002-aws-ebs-csi-driver"} { if _, statErr := os.Stat(filepath.Join(tmpDir, dir)); !os.IsNotExist(statErr) { t.Errorf("expected %s directory to NOT be created (component is disabled)", dir) } - if _, statErr := os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")); !os.IsNotExist(statErr) { - t.Errorf("expected %s/cluster-values.yaml to NOT exist (component is disabled)", dir) - } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/bundler_test.go` around lines 1204 - 1206, The test contains a redundant unreachable check using os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")) after already asserting the parent directory does not exist; remove this extra cluster-values.yaml existence assertion (the os.Stat + os.IsNotExist branch) so the test only asserts the parent directory non-existence using tmpDir and dir, and does not perform the unreachable filepath.Join("cluster-values.yaml") check.tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml (1)
69-69:⚠️ Potential issue | 🟡 MinorAdd explicit non-empty path guards before
chainsaw assert.Each inline
--resource "$(ls ... | head -1)"can resolve to an empty string and produce a non-actionable failure. Capture into a variable, validate, then assert.Suggested pattern (apply to all four assertions)
- chainsaw assert \ - --resource "$(ls "${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1)" \ + TUNING_PATH=$(ls "${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1) + [ -n "${TUNING_PATH}" ] || { echo "tuning.yaml not found in bundle-defaults" >&2; exit 1; } + chainsaw assert \ + --resource "${TUNING_PATH}" \ --file ./assert-tuning-defaults.yaml \ --no-color --timeout 5sAlso applies to: 95-95, 121-121, 149-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml` at line 69, Replace each inline --resource "$(ls ... | head -1)" usage with a two-step guard: capture the command output into a variable, test that the variable is non-empty and points to an existing file, then pass that variable to the chainsaw assert invocation; specifically update occurrences of the pattern --resource "$(ls "${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml 2>/dev/null | head -1)" so you first set something like resource_path=$(ls ... | head -1) (unique symbol: resource_path), validate [ -n "$resource_path" ] && [ -f "$resource_path" ] (or exit with a clear error) and only then call chainsaw assert --resource "$resource_path"; apply the same change to the three other occurrences noted.tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml (1)
67-67:⚠️ Potential issue | 🟡 MinorEnforce exact
NNN-prefix in both glob checks.Using
[0-9]*-...is too permissive and can mask regressions away from the numbered-folder contract. Please require exactly three digits in both places.Suggested patch
- MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9]*-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | head -1) + MANIFEST=$(ls "${WORK}"/bundle-defaults/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | head -1) @@ - MATCHES=$(ls "${WORK}"/bundle-dcgm-disabled/[0-9]*-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | wc -l) + MATCHES=$(ls "${WORK}"/bundle-dcgm-disabled/[0-9][0-9][0-9]-gpu-operator-post/templates/dcgm-exporter.yaml 2>/dev/null | wc -l)Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml` at line 67, Update the loose glob used when assigning MANIFEST to require exactly three leading digits: replace occurrences of the pattern "[0-9]*-gpu-operator-post/..." with an exact-three-digit pattern "[0-9][0-9][0-9]-gpu-operator-post/..." (the line assigning MANIFEST and the other occurrence around line 97); ensure both glob checks use [0-9][0-9][0-9]- so only folders with a 3-digit prefix match.pkg/bundler/deployer/helm/templates/README.md.tmpl (1)
121-129:⚠️ Potential issue | 🟡 MinorPreserve the blank line before the GPU heading.
The trim markers around the
$hasGPUconditional can collapse spacing after the code fence; remove trim on theif/enddelimiters to keep markdown structure stable.🧩 Proposed template fix
-{{- if $hasGPU }} +{{ if $hasGPU }} ### Verify GPU access ```bash kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq '.["nvidia.com/gpu"]'-{{- end }}
+{{ end }}</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@pkg/bundler/deployer/helm/templates/README.md.tmplaround lines 121 - 129,
The template's trim markers on the $hasGPU conditional are collapsing the blank
line before the "### Verify GPU access" heading; update the Helm template by
removing the leading/trailing dash trim on the conditional delimiters so the
block uses {{ if $hasGPU }} ... {{ end }} (keeping the inner code fence and
content unchanged) so the markdown blank line and code fence are preserved;
locate the $hasGPU declaration and the conditional around the "### Verify GPU
access" heading and change {{- if $hasGPU }} / {{- end }} to {{ if $hasGPU }} /
{{ end }}.</details> </blockquote></details> <details> <summary>docs/user/cli-reference.md (1)</summary><blockquote> `1152-1152`: _⚠️ Potential issue_ | _🟡 Minor_ **Hyphenate the compound modifier for clarity.** Use `local-helm-wrapped chart` instead of `local-helm wrapped chart`. <details> <summary>✏️ Proposed doc fix</summary> ```diff -- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart. +- Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm-wrapped chart.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` at line 1152, Replace the unhyphenated phrase "local-helm wrapped chart" with the hyphenated compound modifier "local-helm-wrapped chart" in the sentence "Manifest-only components (no upstream Helm chart, just raw manifests) become a single local-helm wrapped chart." so the compound modifier is clear and grammatically correct; locate and update that exact phrase in docs/user/cli-reference.md.pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md (1)
109-112:⚠️ Potential issue | 🟡 MinorReferences include a component not present in this fixture.
Line 112 links Network Operator docs in a gpu-operator-only fixture. Please remove it or explicitly mark the references list as generic.
As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md` around lines 109 - 112, The README references a component not present in this fixture: remove or mark the Network Operator link ("Network Operator Documentation" / https://docs.nvidia.com/networking/display/cokan10/network+operator) so the references reflect this GPU-operator-only fixture; either delete that list item or replace it with a generic note such as "Additional NVIDIA operator docs (not included in this fixture)" to make the references accurate and avoid implying the Network Operator is part of this testdata.docs/contributor/component.md (1)
21-21:⚠️ Potential issue | 🟠 MajorSeparate implemented behavior from roadmap language.
Line 21 still reads as if all deployers currently consume
localformat.Write(). Please split this into: (1) deployers that use it today, and (2) future intent (e.g., helmfile) as prospective.As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/contributor/component.md` at line 21, Split the sentence to clearly separate current implemented behavior from future intent: state that pkg/bundler/deployer/localformat produces the NNN-<component>/ layout and that current deployers (list the ones actually using it today, e.g., helm, argocd, argocd-helm if true) call localformat.Write(), then add a second clarifying sentence that mentions helmfile (and any other roadmap items) only as prospective future consumers; keep the reference to localformat.Write() and the per-folder contents (Chart.yaml, values.yaml, cluster-values.yaml, install.sh, templates/, upstream.env) intact while removing any language implying that all listed deployers already use localformat.Write().pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md (1)
104-107:⚠️ Potential issue | 🟡 MinorReference links are not aligned with the kai-scheduler fixture scope.
Line 106–107 points to GPU/Network Operator docs in a
kai_scheduler_presentREADME. Please replace with kai-scheduler-relevant references or label these links as generic.As per coding guidelines: “When writing documentation… Ensure accuracy… and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md` around lines 104 - 107, The References section in the kai_scheduler_present README currently lists GPU/Network Operator links that are out of scope for the kai-scheduler fixture; update the "References" under the "## References" header so it only contains kai-scheduler-relevant links or change those two entries into a clearly labeled "Generic operator references" note. Locate the README's "References" block (the lines containing the NVIDIA GPU Operator and Network Operator URLs) and either replace them with kai-scheduler documentation links or prepend a label like "Generic operator docs (not directly related to kai-scheduler):" to make the scope clear.pkg/bundler/deployer/localformat/doc.go (1)
15-16:⚠️ Potential issue | 🟠 MajorSeparate current consumers from future intent in the package doc.
This wording reads as if helmfile/argocd/Flux already consume
localformat, but this PR only wires Helm to it. Please split this into “currently used by …” vs “intended for …” so the godoc does not overstate shipped support.As per coding guidelines: “Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/doc.go` around lines 15 - 16, Package doc comment in pkg/bundler/deployer/localformat/doc.go overstates consumers by implying helmfile/argocd/Flux already use localformat; update the package comment to clearly separate current consumers from intended future consumers. Edit the top-level godoc for package localformat to state which tools currently consume the format (e.g., Helm) and then add a distinct sentence like "intended for" or "planned for" listing helmfile, argocd, Flux as future targets, ensuring no language implies they already consume it; keep the package name localformat and any existing descriptive text otherwise unchanged.pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md (1)
104-107:⚠️ Potential issue | 🟡 MinorReplace the unrelated operator references.
This fixture only documents
cert-manager, so linking GPU Operator and Network Operator docs here makes the README inaccurate. Point this section at cert-manager docs or generic bundle docs instead.As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor... Ensure accuracy... and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md` around lines 104 - 107, Update the References section in the README.md by removing the unrelated "GPU Operator Documentation" and "Network Operator Documentation" links and replacing them with authoritative cert-manager documentation (e.g., cert-manager installation and usage pages) or a generic bundle docs link; target the "## References" header and the two existing bullet link entries so the fixture only documents cert-manager and stays accurate.pkg/bundler/deployer/helm/testdata/manifest_only/README.md (1)
104-107:⚠️ Potential issue | 🟡 MinorAlign the references with
skyhook-customizations.These links are still GPU/network-operator specific, but this README documents only
skyhook-customizations. Replace them with skyhook-relevant docs or generic bundle documentation so the fixture stays accurate.As per coding guidelines: “When writing documentation, act as a senior open-source documentation editor... Ensure accuracy... and document operational clarity.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md` around lines 104 - 107, The References section currently under the "## References" heading contains GPU/operator-specific links; update that section in pkg/bundler/deployer/helm/testdata/manifest_only/README.md to reference skyhook-relevant resources (e.g., the skyhook-customizations documentation or a generic bundle/operator overview) instead of the NVIDIA GPU Operator and Network Operator links; specifically replace the two links currently pointing to "GPU Operator Documentation" and "Network Operator Documentation" with appropriate skyhook-customizations doc URLs or canonical bundle docs and adjust link labels accordingly so the fixture accurately documents skyhook-customizations.pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)
309-321:⚠️ Potential issue | 🟠 MajorDon't drive installs from whatever
NNN-*folders happen to be on disk.This loop will run every numbered directory under the bundle root. If regeneration reuses an existing
--outputpath and leaves stale folders behind, removed or renumbered components are installed again. Prefer rendering an explicit generated-folder manifest and iterating that, or clear old numbered folders before writing.🔎 Verification script
#!/bin/bash set -euo pipefail echo "=== localformat.Write call sites ===" rg -n -C3 'localformat\.Write' pkg --type go echo echo "=== output-dir cleanup around bundle generation ===" rg -n -C3 'RemoveAll|os\.RemoveAll|ReadDir|MkdirAll|NNN-|[0-9]\[0-9]\[0-9\]-' pkg/bundler --type go echo echo "=== filesystem-driven folder enumeration in templates ===" rg -n -C2 '\[0-9\]\[0-9\]\[0-9\]-\*/|NNN-\*' pkg/bundler/deployer/helm/templates🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 309 - 321, The current install loop in deploy.sh.tmpl (the for dir in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; ... block that reads namespace from "${dir}/install.sh") is unsafe because it installs whatever numbered folders happen to exist on disk; change it to iterate only over an explicit generated manifest or clean the output before generation: either (A) have the bundler emit a deterministic list file (e.g. GENERATED_MANIFEST or "${HELM_WORKDIR}/components.list") and replace the glob loop with reading that list and using each listed path to read install.sh and extract --namespace, or (B) ensure the bundler removes/cleans all existing "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ folders before writing new output so the existing glob cannot pick up stale dirs; update the code that writes the numbered folders (bundle generation) or add a pre-install cleanup step in deploy.sh.tmpl to call out/RemoveAll on those dirs accordingly so only intended components are installed.pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (1)
463-466:⚠️ Potential issue | 🟠 MajorInclude injected
-postreleases in the CRD/finalizer safety scans.The uninstall loop now treats
<name>-postas a real Helm release, but the pre-flightcheck_release_for_stuck_crds, orphan-CRD cleanup, and post-flight orphaned-CRD scan still render only from.ComponentsReversed. Any CRDs or finalizer-bearing CRs owned by the wrapper release bypass those safeguards.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 463 - 466, The pre/post-flight CRD and finalizer safety scans currently iterate only .ComponentsReversed and therefore miss injected "-post" releases; update the scans (check_release_for_stuck_crds, the orphan-CRD cleanup loop, and the post-flight orphaned-CRD scan) to also consider the injected "<name>-post" releases by either: (a) when looping .ComponentsReversed, treat both the component.Name and component.Name+"-post" (using the same component.Namespace mapping set earlier), or (b) build a combined list that appends "{{ .Name }}-post" for any component that has injected mixed-component post folders and iterate that combined list for all CRD/finalizer checks; ensure you reference the same namespace resolution logic used in the existing injected "-post" if-block so those releases are included in check_release_for_stuck_crds and the orphan-CRD scans.pkg/bundler/deployer/localformat/writer.go (2)
82-88:⚠️ Potential issue | 🟠 MajorCheck context cancellation before filesystem I/O.
The function performs
os.MkdirAllandpruneStaleFoldersbefore the firstctx.Err()check at line 116. A pre-canceled context will still create/prune directories before returning an error, and with zero components it returns success instead of a timeout error.🐛 Proposed fix: add context check at function entry
func Write(ctx context.Context, opts Options) ([]Folder, error) { + if err := ctx.Err(); err != nil { + return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) + } if err := os.MkdirAll(opts.OutputDir, 0o755); err != nil { return nil, errors.Wrap(errors.ErrCodeInternal, "create output dir", err) }As per coding guidelines: "Always check
ctx.Done()in long-running operations and loops to respect context cancellation".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 82 - 88, The Write function performs filesystem operations (os.MkdirAll and pruneStaleFolders) before honoring context cancellation; add an early check for ctx cancellation at the start of Write (use ctx.Err() or select on ctx.Done()) and return an appropriate wrapped cancellation error instead of proceeding with os.MkdirAll/pruneStaleFolders so that a pre-canceled context does not create/prune directories or return success.
113-134:⚠️ Potential issue | 🟠 MajorFail fast when folder index exceeds 999.
The folder name format
%03d-produces 4+ digit prefixes onceidx >= 1000, butdeploy.sh/undeploy.shonly match[0-9][0-9][0-9]-*/. Components beyond 999 would be silently skipped during deployment.🐛 Proposed fix: add a cap check before formatting
folders := make([]Folder, 0, len(opts.Components)) idx := 1 for _, c := range opts.Components { + if idx > 999 { + return nil, errors.New(errors.ErrCodeInvalidRequest, + "too many components; max 999 supported by NNN-* folder layout") + } if err := ctx.Err(); err != nil { return nil, errors.Wrap(errors.ErrCodeTimeout, "context cancelled", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer.go` around lines 113 - 134, The loop that builds directory names with dir := fmt.Sprintf("%03d-%s", idx, c.Name) can produce 4+ digit prefixes when idx >= 1000 which break deploy.sh pattern matching; add a guard before formatting (e.g., if idx >= 1000) to return an appropriate error (use the same errors.New signature and ErrCodeInvalidRequest used elsewhere) so components beyond 999 fail fast; update the loop around idx, fmt.Sprintf and classify(...) to perform this cap check and return early with a clear message like "too many components; maximum supported is 999".pkg/bundler/deployer/localformat/writer_test.go (2)
117-119:⚠️ Potential issue | 🟡 MinorGuard against panic when formatting error message with empty
folders.The condition
len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelmuses short-circuit evaluation, but thet.Fatalfmessage still formatsfolders[0].Kindunconditionally. IfWritereturns zero folders, this panics while trying to report the failure.🐛 Proposed fix: separate length check from kind check
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { - t.Fatalf("want 1 local-helm folder, got %d folders kind=%v", len(folders), folders[0].Kind) + if len(folders) != 1 { + t.Fatalf("want 1 folder, got %d", len(folders)) + } + if folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want local-helm folder, got kind=%v", folders[0].Kind) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 117 - 119, The current failure message can panic because it always formats folders[0].Kind even when folders is empty; change the checks in the test to first assert len(folders) == 1 and call t.Fatalf with only len(folders) if that fails, then separately check folders[0].Kind != localformat.KindLocalHelm and call t.Fatalf including folders[0].Kind only in that branch—look for the test using the local variable folders and the comparison to localformat.KindLocalHelm and split the length check from the kind check to avoid indexing an empty slice.
310-312:⚠️ Potential issue | 🟡 MinorSame panic risk as above — separate length check from kind check.
If
Writereturns zero folders,folders[0].Kindin the error message will panic.🐛 Proposed fix
- if len(folders) != 1 || folders[0].Kind != localformat.KindLocalHelm { - t.Fatalf("want 1 local-helm folder (kustomize wrapped), got %d folders kind=%v", len(folders), folders[0].Kind) + if len(folders) != 1 { + t.Fatalf("want 1 folder (kustomize wrapped), got %d", len(folders)) + } + if folders[0].Kind != localformat.KindLocalHelm { + t.Fatalf("want local-helm folder (kustomize wrapped), got kind=%v", folders[0].Kind) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/localformat/writer_test.go` around lines 310 - 312, The test currently accesses folders[0].Kind inside the same conditional as the length check, which can panic when len(folders) == 0; change the logic to first assert len(folders) == 1 (and call t.Fatalf or t.Fatalf with a message that does not index folders) and only after that check folders[0].Kind against localformat.KindLocalHelm, using separate t.Fatalf calls for the length vs kind failures; refer to the variables/symbols folders, localformat.KindLocalHelm and the t.Fatalf usages in writer_test.go to locate and update the checks.pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh (1)
485-487:⚠️ Potential issue | 🟡 MinorDrop the unconditional nodewright cleanup banner from non-nodewright outputs.
This fixture never emits the corresponding taint-removal commands, so the generated script claims cleanup it does not perform. Please gate this comment block on the same condition as the actual nodewright taint-removal section in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh` around lines 485 - 487, The fixture's undeploy.sh testdata emits a banner about removing nodewright taints even when no taint-removal commands are generated; update the test fixture to gate that comment block behind the same condition used in the template (the nodewright taint-removal section) so the banner only appears when nodewright taints are actually removed—match the condition used in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (the runtimeRequiredTaint/nodewright presence check) and ensure the comment block in pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh is wrapped with that same conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 102-103: The markdown has no blank line between a fenced code
block and the following heading "### Verify GPU access", which triggers
MD031/MD022; add a single blank line after the closing triple-backtick fence so
there is an empty line before the "### Verify GPU access" heading to satisfy the
linter.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/README.md`:
- Around line 106-107: The README.md in the nodewright_present test fixture
currently includes links titled "GPU Operator Documentation" and "Network
Operator Documentation" that are not specific to nodewright-operator; update the
fixture's README.md so those two entries either (a) are replaced with
nodewright-operator–relevant references (e.g., official nodewright operator
docs, repo, API reference) or (b) are clearly annotated as generic/example
operator docs (prefix with "Generic example:" or similar) to avoid mis-scoping;
ensure the change is applied within the nodewright_present README.md where those
link texts appear.
In `@pkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.sh`:
- Around line 489-499: The grep that sets custom_taint_line is too permissive
and matches commented examples; update the extraction so it only matches actual
YAML keys by anchoring to line start and allowing leading whitespace (but not
'#'). Replace the grep call used to set custom_taint_line (the one reading
"${nodewright_dir}/values.yaml") with a regex like
'^[[:space:]]*runtimeRequiredTaint:' (e.g. grep -E -m1
'^[[:space:]]*runtimeRequiredTaint:' "${nodewright_dir}/values.yaml") so only
real key lines are captured; leave subsequent taint_value and
NODEWRIGHT_TAINT_KEY processing unchanged.
In `@pkg/bundler/deployer/localformat/kustomize.go`:
- Around line 33-47: The current buildKustomize checks ctx.Err() only before
calling k.Run, so cancellation during the long-running k.Run is ignored; change
buildKustomize to run k.Run(fs, path) in a goroutine and send its result (rm,
err) over a channel, then select on ctx.Done() and that channel so you can
return an errors.ErrCodeTimeout wrapped error if the context is cancelled while
k.Run is still executing; after receiving the k.Run result handle the existing
error wrapping for "kustomize build failed" and the rm.AsYaml() flow (rm and out
variables) as before.
In `@tests/chainsaw/cli/bundle-scheduling/chainsaw-test.yaml`:
- Around line 78-79: The glob in the assignment to VAL is too permissive (it
uses [0-9]*); tighten it to the fixed three-digit layout by matching exactly
three digits and the trailing dash (e.g., use [0-9][0-9][0-9]-) and remove the
now-unnecessary grep -v filter; update the command that sets VAL (the ls
"${WORK}"/bundle/.../values.yaml invocation) to use the stricter pattern so only
NNN-gpu-operator folders are matched.
---
Duplicate comments:
In `@docs/contributor/component.md`:
- Line 21: Split the sentence to clearly separate current implemented behavior
from future intent: state that pkg/bundler/deployer/localformat produces the
NNN-<component>/ layout and that current deployers (list the ones actually using
it today, e.g., helm, argocd, argocd-helm if true) call localformat.Write(),
then add a second clarifying sentence that mentions helmfile (and any other
roadmap items) only as prospective future consumers; keep the reference to
localformat.Write() and the per-folder contents (Chart.yaml, values.yaml,
cluster-values.yaml, install.sh, templates/, upstream.env) intact while removing
any language implying that all listed deployers already use localformat.Write().
In `@docs/user/cli-reference.md`:
- Line 1152: Replace the unhyphenated phrase "local-helm wrapped chart" with the
hyphenated compound modifier "local-helm-wrapped chart" in the sentence
"Manifest-only components (no upstream Helm chart, just raw manifests) become a
single local-helm wrapped chart." so the compound modifier is clear and
grammatically correct; locate and update that exact phrase in
docs/user/cli-reference.md.
In `@pkg/bundler/bundler_test.go`:
- Around line 1204-1206: The test contains a redundant unreachable check using
os.Stat(filepath.Join(tmpDir, dir, "cluster-values.yaml")) after already
asserting the parent directory does not exist; remove this extra
cluster-values.yaml existence assertion (the os.Stat + os.IsNotExist branch) so
the test only asserts the parent directory non-existence using tmpDir and dir,
and does not perform the unreachable filepath.Join("cluster-values.yaml") check.
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 309-321: The current install loop in deploy.sh.tmpl (the for dir
in "${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/; ... block that reads namespace from
"${dir}/install.sh") is unsafe because it installs whatever numbered folders
happen to exist on disk; change it to iterate only over an explicit generated
manifest or clean the output before generation: either (A) have the bundler emit
a deterministic list file (e.g. GENERATED_MANIFEST or
"${HELM_WORKDIR}/components.list") and replace the glob loop with reading that
list and using each listed path to read install.sh and extract --namespace, or
(B) ensure the bundler removes/cleans all existing
"${SCRIPT_DIR}"/[0-9][0-9][0-9]-*/ folders before writing new output so the
existing glob cannot pick up stale dirs; update the code that writes the
numbered folders (bundle generation) or add a pre-install cleanup step in
deploy.sh.tmpl to call out/RemoveAll on those dirs accordingly so only intended
components are installed.
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 121-129: The template's trim markers on the $hasGPU conditional
are collapsing the blank line before the "### Verify GPU access" heading; update
the Helm template by removing the leading/trailing dash trim on the conditional
delimiters so the block uses {{ if $hasGPU }} ... {{ end }} (keeping the inner
code fence and content unchanged) so the markdown blank line and code fence are
preserved; locate the $hasGPU declaration and the conditional around the "###
Verify GPU access" heading and change {{- if $hasGPU }} / {{- end }} to {{ if
$hasGPU }} / {{ end }}.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 463-466: The pre/post-flight CRD and finalizer safety scans
currently iterate only .ComponentsReversed and therefore miss injected "-post"
releases; update the scans (check_release_for_stuck_crds, the orphan-CRD cleanup
loop, and the post-flight orphaned-CRD scan) to also consider the injected
"<name>-post" releases by either: (a) when looping .ComponentsReversed, treat
both the component.Name and component.Name+"-post" (using the same
component.Namespace mapping set earlier), or (b) build a combined list that
appends "{{ .Name }}-post" for any component that has injected mixed-component
post folders and iterate that combined list for all CRD/finalizer checks; ensure
you reference the same namespace resolution logic used in the existing injected
"-post" if-block so those releases are included in check_release_for_stuck_crds
and the orphan-CRD scans.
In `@pkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.md`:
- Around line 104-107: The References section in the kai_scheduler_present
README currently lists GPU/Network Operator links that are out of scope for the
kai-scheduler fixture; update the "References" under the "## References" header
so it only contains kai-scheduler-relevant links or change those two entries
into a clearly labeled "Generic operator references" note. Locate the README's
"References" block (the lines containing the NVIDIA GPU Operator and Network
Operator URLs) and either replace them with kai-scheduler documentation links or
prepend a label like "Generic operator docs (not directly related to
kai-scheduler):" to make the scope clear.
In `@pkg/bundler/deployer/helm/testdata/manifest_only/README.md`:
- Around line 104-107: The References section currently under the "##
References" heading contains GPU/operator-specific links; update that section in
pkg/bundler/deployer/helm/testdata/manifest_only/README.md to reference
skyhook-relevant resources (e.g., the skyhook-customizations documentation or a
generic bundle/operator overview) instead of the NVIDIA GPU Operator and Network
Operator links; specifically replace the two links currently pointing to "GPU
Operator Documentation" and "Network Operator Documentation" with appropriate
skyhook-customizations doc URLs or canonical bundle docs and adjust link labels
accordingly so the fixture accurately documents skyhook-customizations.
In `@pkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.md`:
- Around line 109-112: The README references a component not present in this
fixture: remove or mark the Network Operator link ("Network Operator
Documentation" /
https://docs.nvidia.com/networking/display/cokan10/network+operator) so the
references reflect this GPU-operator-only fixture; either delete that list item
or replace it with a generic note such as "Additional NVIDIA operator docs (not
included in this fixture)" to make the references accurate and avoid implying
the Network Operator is part of this testdata.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/README.md`:
- Around line 104-107: Update the References section in the README.md by
removing the unrelated "GPU Operator Documentation" and "Network Operator
Documentation" links and replacing them with authoritative cert-manager
documentation (e.g., cert-manager installation and usage pages) or a generic
bundle docs link; target the "## References" header and the two existing bullet
link entries so the fixture only documents cert-manager and stays accurate.
In `@pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh`:
- Around line 485-487: The fixture's undeploy.sh testdata emits a banner about
removing nodewright taints even when no taint-removal commands are generated;
update the test fixture to gate that comment block behind the same condition
used in the template (the nodewright taint-removal section) so the banner only
appears when nodewright taints are actually removed—match the condition used in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl (the
runtimeRequiredTaint/nodewright presence check) and ensure the comment block in
pkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.sh is wrapped
with that same conditional.
In `@pkg/bundler/deployer/localformat/doc.go`:
- Around line 15-16: Package doc comment in
pkg/bundler/deployer/localformat/doc.go overstates consumers by implying
helmfile/argocd/Flux already use localformat; update the package comment to
clearly separate current consumers from intended future consumers. Edit the
top-level godoc for package localformat to state which tools currently consume
the format (e.g., Helm) and then add a distinct sentence like "intended for" or
"planned for" listing helmfile, argocd, Flux as future targets, ensuring no
language implies they already consume it; keep the package name localformat and
any existing descriptive text otherwise unchanged.
In `@pkg/bundler/deployer/localformat/writer_test.go`:
- Around line 117-119: The current failure message can panic because it always
formats folders[0].Kind even when folders is empty; change the checks in the
test to first assert len(folders) == 1 and call t.Fatalf with only len(folders)
if that fails, then separately check folders[0].Kind !=
localformat.KindLocalHelm and call t.Fatalf including folders[0].Kind only in
that branch—look for the test using the local variable folders and the
comparison to localformat.KindLocalHelm and split the length check from the kind
check to avoid indexing an empty slice.
- Around line 310-312: The test currently accesses folders[0].Kind inside the
same conditional as the length check, which can panic when len(folders) == 0;
change the logic to first assert len(folders) == 1 (and call t.Fatalf or
t.Fatalf with a message that does not index folders) and only after that check
folders[0].Kind against localformat.KindLocalHelm, using separate t.Fatalf calls
for the length vs kind failures; refer to the variables/symbols folders,
localformat.KindLocalHelm and the t.Fatalf usages in writer_test.go to locate
and update the checks.
In `@pkg/bundler/deployer/localformat/writer.go`:
- Around line 82-88: The Write function performs filesystem operations
(os.MkdirAll and pruneStaleFolders) before honoring context cancellation; add an
early check for ctx cancellation at the start of Write (use ctx.Err() or select
on ctx.Done()) and return an appropriate wrapped cancellation error instead of
proceeding with os.MkdirAll/pruneStaleFolders so that a pre-canceled context
does not create/prune directories or return success.
- Around line 113-134: The loop that builds directory names with dir :=
fmt.Sprintf("%03d-%s", idx, c.Name) can produce 4+ digit prefixes when idx >=
1000 which break deploy.sh pattern matching; add a guard before formatting
(e.g., if idx >= 1000) to return an appropriate error (use the same errors.New
signature and ErrCodeInvalidRequest used elsewhere) so components beyond 999
fail fast; update the loop around idx, fmt.Sprintf and classify(...) to perform
this cap check and return early with a clear message like "too many components;
maximum supported is 999".
In `@tests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yaml`:
- Line 67: Update the loose glob used when assigning MANIFEST to require exactly
three leading digits: replace occurrences of the pattern
"[0-9]*-gpu-operator-post/..." with an exact-three-digit pattern
"[0-9][0-9][0-9]-gpu-operator-post/..." (the line assigning MANIFEST and the
other occurrence around line 97); ensure both glob checks use [0-9][0-9][0-9]-
so only folders with a 3-digit prefix match.
In `@tests/chainsaw/bundle-templates/kgateway/chainsaw-test.yaml`:
- Around line 71-72: The grep asserting 'kind: Gateway' can be satisfied by
'kind: GatewayParameters' and thus give false positives; update the check that
reads from MANIFEST to anchor the kind assertion so it matches only the exact
line (e.g., match start/end and optional leading whitespace) for 'kind: Gateway'
rather than a substring, leaving the existing 'kind: GatewayParameters'
assertion intact; target the two grep assertions that reference MANIFEST and
replace the loose substring match for 'kind: Gateway' with an anchored
exact-line match.
In
`@tests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yaml`:
- Line 69: Replace each inline --resource "$(ls ... | head -1)" usage with a
two-step guard: capture the command output into a variable, test that the
variable is non-empty and points to an existing file, then pass that variable to
the chainsaw assert invocation; specifically update occurrences of the pattern
--resource "$(ls
"${WORK}"/bundle-defaults/[0-9]*-nodewright-customizations/templates/tuning.yaml
2>/dev/null | head -1)" so you first set something like resource_path=$(ls ... |
head -1) (unique symbol: resource_path), validate [ -n "$resource_path" ] && [
-f "$resource_path" ] (or exit with a clear error) and only then call chainsaw
assert --resource "$resource_path"; apply the same change to the three other
occurrences noted.
In `@tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml`:
- Around line 52-55: The glob patterns in the two shell checks (ls
"${WORK}"/basic/[0-9]*/values.yaml and ls -d
"${WORK}"/basic/[0-9]*-gpu-operator) are too permissive; tighten them to require
exactly three digits for the NNN prefix by replacing [0-9]* with [0-9][0-9][0-9]
(so use ls "${WORK}"/basic/[0-9][0-9][0-9]/values.yaml and ls -d
"${WORK}"/basic/[0-9][0-9][0-9]-gpu-operator).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: db0724c0-8d24-4bf4-a339-2116ce4900af
📒 Files selected for processing (78)
Makefiledocs/contributor/component.mddocs/user/cli-reference.mdpkg/bundler/bundler_test.gopkg/bundler/deployer/helm/helm.gopkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/templates/undeploy.sh.tmplpkg/bundler/deployer/helm/testdata/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/cluster-values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/install.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/upstream.envpkg/bundler/deployer/helm/testdata/kai_scheduler_present/001-kai-scheduler/values.yamlpkg/bundler/deployer/helm/testdata/kai_scheduler_present/README.mdpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/kai_scheduler_present/undeploy.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/cluster-values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/helm/testdata/manifest_only/001-skyhook-customizations/values.yamlpkg/bundler/deployer/helm/testdata/manifest_only/README.mdpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/undeploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/upstream.envpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/001-gpu-operator/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/Chart.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/cluster-values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/install.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/templates/dcgm-exporter.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/002-gpu-operator-post/values.yamlpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/README.mdpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/undeploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/cluster-values.yamlpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/install.shpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/upstream.envpkg/bundler/deployer/helm/testdata/nodewright_present/001-nodewright-operator/values.yamlpkg/bundler/deployer/helm/testdata/nodewright_present/README.mdpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/undeploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/cluster-values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/install.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/upstream.envpkg/bundler/deployer/helm/testdata/upstream_helm_only/001-cert-manager/values.yamlpkg/bundler/deployer/helm/testdata/upstream_helm_only/README.mdpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/undeploy.shpkg/bundler/deployer/localformat/doc.gopkg/bundler/deployer/localformat/folder.gopkg/bundler/deployer/localformat/kustomize.gopkg/bundler/deployer/localformat/local_helm.gopkg/bundler/deployer/localformat/templates/chart.yaml.tmplpkg/bundler/deployer/localformat/templates/install-local-helm.sh.tmplpkg/bundler/deployer/localformat/templates/install-upstream-helm.sh.tmplpkg/bundler/deployer/localformat/testdata/README.mdpkg/bundler/deployer/localformat/testdata/kustomize_input/cm.yamlpkg/bundler/deployer/localformat/testdata/kustomize_input/kustomization.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/Chart.yamlpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/install.shpkg/bundler/deployer/localformat/testdata/local_helm_manifest_only/001-skyhook-customizations/templates/customization.yamlpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/install.shpkg/bundler/deployer/localformat/testdata/upstream_helm_only/001-nfd/upstream.envpkg/bundler/deployer/localformat/upstream_helm.gopkg/bundler/deployer/localformat/writer.gopkg/bundler/deployer/localformat/writer_test.gopkg/bundler/handler_test.gotests/chainsaw/ai-conformance/offline/chainsaw-test.yamltests/chainsaw/bundle-templates/gpu-operator/chainsaw-test.yamltests/chainsaw/bundle-templates/kgateway/chainsaw-test.yamltests/chainsaw/bundle-templates/nodewright-customizations/chainsaw-test.yamltests/chainsaw/cli/bundle-dynamic/chainsaw-test.yamltests/chainsaw/cli/bundle-scheduling/chainsaw-test.yamltests/chainsaw/cli/bundle-variants/chainsaw-test.yamltests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
24374b4 to
42ab5e1
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
Re-checked current head 42ab5e1. Several earlier inline issues are addressed, but the five blockers from my latest changes-requested summary are still unresolved.
|
|
||
| - All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it. | ||
| - Raw manifests for mixed components now apply **post-install only**, via the injected `-post` wrapped chart. The earlier pre-apply mechanism with a CRD-race retry wrapper is gone — Helm now owns CRD ordering for mixed components natively. | ||
| - Tooling that parsed bundle paths by bare component name must account for the `NNN-` prefix. |
There was a problem hiding this comment.
This path-format change still has an in-tree consumer using the old layout: kwok/scripts/validate-scheduling.sh:384 still points at ${WORK_DIR}/bundle/kube-prometheus-stack/values.yaml. Under the new layout it should glob ${WORK_DIR}/bundle/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml; otherwise the KWOK storage patch is silently skipped.
There was a problem hiding this comment.
Fixed in the latest push. kwok/scripts/validate-scheduling.sh:381-394 now globs the NNN-prefixed folder:
prom_values=$(ls -1 "${WORK_DIR}/bundle"/[0-9][0-9][0-9]-kube-prometheus-stack/values.yaml 2>/dev/null | head -1)
if [[ -n "$prom_values" && -f "$prom_values" ]] && yq eval ...This restores the KWOK storage patch under the new layout.
|
|
||
| // Generate creates a per-component Helm bundle from the configured generator fields. | ||
| // Per-component folder content (Chart.yaml, values.yaml, install.sh, templates/*) | ||
| // is delegated to pkg/bundler/deployer/localformat. The helm deployer owns only |
There was a problem hiding this comment.
This code comment is updated, but the public godoc still describes the legacy Helm layout. Please update pkg/bundler/handler.go, pkg/bundler/doc.go, and pkg/bundler/deployer/helm/doc.go, which still mention <component>/values.yaml, <component>/README.md, and <component>/manifests/.
There was a problem hiding this comment.
Fixed in the latest push. Updated all three godoc files to describe the NNN-prefixed local-chart layout:
pkg/bundler/handler.go:58-66— API zip-archive description now listsNNN-<component>/install.sh,values.yaml,cluster-values.yaml, plusundeploy.sh.pkg/bundler/doc.go:55-65— Helm output format now lists the per-folder layout (install.sh,values.yaml,cluster-values.yaml,upstream.env, orChart.yaml + templates/).pkg/bundler/deployer/helm/doc.go:15-30— package-level godoc now distinguishes per-folder content (owned bylocalformat) from top-level files (owned by this deployer).
|
|
||
| Previous releases used a flat `<component>/` layout with `manifests/` siblings and a `--deployer helm` script that branched on component kind. The new format is uniform: | ||
|
|
||
| - All folders carry a rendered `install.sh`. The top-level `deploy.sh` is a generic loop with no per-component branching — name-matched special-case blocks (nodewright-operator taint cleanup, kai-scheduler async timeout, orphan-CRD scan, DRA kubelet-plugin restart) live around the loop, not inside it. |
There was a problem hiding this comment.
This new layout section correctly says folders carry install.sh, but the later deployment note still says each component subdirectory contains a README.md with the manual Helm command. That is no longer true in the NNN-* layout; please update the note around line 1317 to point at per-folder install.sh instead.
There was a problem hiding this comment.
Fixed in the latest push. docs/user/cli-reference.md:1300-1317 now describes the NNN-prefixed layout and points at per-folder install.sh:
# Navigate to bundle
cd bundles
# Review root README and a component's values
cat README.md
cat 001-gpu-operator/values.yaml
...Note:
deploy.shandundeploy.share convenience scripts — not the only deployment path. EachNNN-<component>/folder contains a renderedinstall.shthat runs the exacthelm upgrade --installcommand for manual or pipeline-driven deployment.
| // Generate creates a per-component Helm bundle from the configured generator fields. | ||
| // Per-component folder content (Chart.yaml, values.yaml, install.sh, templates/*) | ||
| // is delegated to pkg/bundler/deployer/localformat. The helm deployer owns only | ||
| // the top-level orchestration: README.md, deploy.sh, undeploy.sh, and checksums. |
There was a problem hiding this comment.
With the helm deployer now owning only top-level files and no longer embedding/rendering component-README.md.tmpl, pkg/bundler/deployer/helm/templates/component-README.md.tmpl appears orphaned. Please delete it or restore a real use.
There was a problem hiding this comment.
Fixed in the latest push. pkg/bundler/deployer/helm/templates/component-README.md.tmpl had no remaining call sites and is now removed. Per-folder content (including any user-facing per-component documentation) is owned by pkg/bundler/deployer/localformat; the helm deployer's templates dir now only carries top-level orchestration templates (README.md.tmpl, deploy.sh.tmpl, undeploy.sh.tmpl).
| fi | ||
| export COMPONENT_WAIT_ARGS | ||
|
|
||
| # Invoke the per-folder install.sh with retry. On failure, dump kai-scheduler |
There was a problem hiding this comment.
This loop now implements retry inline, but the old retry() helper is still defined earlier in this template and has no call sites. Please remove the dead helper to avoid future drift.
There was a problem hiding this comment.
Fixed in the latest push. The retry() helper at the top of deploy.sh.tmpl had no remaining call sites (the per-component install loop at line ~340 implements its own retry inline) and is now removed. backoff_seconds() is kept since the inline loop still uses it. Updated the corresponding TestGenerate_DeployScriptExecutable assertion (helm_test.go:179-193) to drop "retry()" from the expected structural markers.
f4d774c to
8c40e99
Compare
yuanchen8911
left a comment
There was a problem hiding this comment.
All five blockers resolved on 8c40e99 — KWOK glob, godoc layout in 3 files, cli-reference note, orphan template deletion, and dead retry() removal. LGTM.
Introduce pkg/bundler/deployer/localformat — a deployer-agnostic writer
that produces the NNN-<component>/ numbered folder layout. Each folder
is one of two kinds: upstream-helm (no Chart.yaml; upstream.env carries
{CHART,REPO,VERSION}) or local-helm (Chart.yaml + templates/). Mixed
components (Helm + raw manifests) emit two adjacent folders: a primary
upstream-helm + an injected -post local-helm wrapper. Each folder
carries a rendered install.sh; deploy.sh is now a generic loop with
no per-component branching, only name-matched special-case blocks.
Rewire --deployer helm to consume localformat. Generator.Generate
delegates per-component content (values.yaml, cluster-values.yaml,
Chart.yaml, templates/, install.sh) to localformat.Write and owns only
top-level orchestration files (README.md, deploy.sh, undeploy.sh,
checksums). Deletes ~195 lines of dead code: generateComponentDirectories,
buildComponentDataList (full version), writeClusterValuesFile,
hasYAMLObjects (relocated to localformat). Deletes ComponentData fields
HasChart, HasManifests, IsKustomize.
Rewrite deploy.sh.tmpl: generic install loop replaces ~90 lines of
{{ if .IsKustomize/.HasChart/.HasManifests }} branching. Delete the
~50-line CRD-race retry block — its trigger (pre-apply raw manifests)
is structurally eliminated; mixed components now apply manifests
post-install via the injected -post wrapped chart, where Helm owns
CRD ordering natively. Rewrite undeploy.sh.tmpl to a generic reverse
loop. Preserve name-matched blocks (skyhook taint cleanup, kai-scheduler
async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD
scan) around the loops; their simplification is deferred.
Refactor helm_test.go: introduce assertBundleGolden helper (mirroring
localformat's writer_test) + 5 golden-file scenarios under testdata/
exercising upstream-helm-only, manifest-only, mixed gpu-operator,
kai-scheduler-present, skyhook-present bundles. Goldens double as
reference examples of rendered output. Delete 11 redundant tests
covering removed code or duplicating golden coverage; add focused
error-path tests. Net: 48 → 37 tests, 3408 → 2130 lines (−37%),
coverage 87.6% → 87.9%.
Update bundler_test.go, handler_test.go, and 6 chainsaw e2e fixtures
to assert on NNN-prefixed paths.
Breaking changes for --deployer helm consumers:
- Bundle directory layout: <component>/ → NNN-<component>/
- Mixed components emit two folders (primary + -post)
- Raw manifests for mixed components apply post-install only
- CRD-race retry block removed; pre-apply mechanism gone
- No more per-component README.md (top-level README still present)
Docs: docs/user/cli-reference.md updated with new layout, folder kinds,
mixed-component -post injection, and breaking-change callout.
docs/contributor/component.md describes localformat as the shared
base format and the load-bearing invariants (localformat never writes
deployer-specific files; install.sh never name-customized; Write is
deterministic).
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Introduce pkg/bundler/deployer/localformat — a deployer-agnostic writer
that produces the NNN-<component>/ numbered folder layout. Each folder
is one of two kinds: upstream-helm (no Chart.yaml; upstream.env carries
{CHART,REPO,VERSION}) or local-helm (Chart.yaml + templates/). Mixed
components (Helm + raw manifests) emit two adjacent folders: a primary
upstream-helm + an injected -post local-helm wrapper. Each folder
carries a rendered install.sh; deploy.sh is now a generic loop with
no per-component branching, only name-matched special-case blocks.
Rewire --deployer helm to consume localformat. Generator.Generate
delegates per-component content (values.yaml, cluster-values.yaml,
Chart.yaml, templates/, install.sh) to localformat.Write and owns only
top-level orchestration files (README.md, deploy.sh, undeploy.sh,
checksums). Deletes ~195 lines of dead code: generateComponentDirectories,
buildComponentDataList (full version), writeClusterValuesFile,
hasYAMLObjects (relocated to localformat). Deletes ComponentData fields
HasChart, HasManifests, IsKustomize.
Rewrite deploy.sh.tmpl: generic install loop replaces ~90 lines of
{{ if .IsKustomize/.HasChart/.HasManifests }} branching. Delete the
~50-line CRD-race retry block — its trigger (pre-apply raw manifests)
is structurally eliminated; mixed components now apply manifests
post-install via the injected -post wrapped chart, where Helm owns
CRD ordering natively. Rewrite undeploy.sh.tmpl to a generic reverse
loop. Preserve name-matched blocks (skyhook taint cleanup, kai-scheduler
async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD
scan) around the loops; their simplification is deferred.
Refactor helm_test.go: introduce assertBundleGolden helper (mirroring
localformat's writer_test) + 5 golden-file scenarios under testdata/
exercising upstream-helm-only, manifest-only, mixed gpu-operator,
kai-scheduler-present, skyhook-present bundles. Goldens double as
reference examples of rendered output. Delete 11 redundant tests
covering removed code or duplicating golden coverage; add focused
error-path tests. Net: 48 → 37 tests, 3408 → 2130 lines (−37%),
coverage 87.6% → 87.9%.
Update bundler_test.go, handler_test.go, and 6 chainsaw e2e fixtures
to assert on NNN-prefixed paths.
Breaking changes for --deployer helm consumers:
- Bundle directory layout: <component>/ → NNN-<component>/
- Mixed components emit two folders (primary + -post)
- Raw manifests for mixed components apply post-install only
- CRD-race retry block removed; pre-apply mechanism gone
- No more per-component README.md (top-level README still present)
Docs: docs/user/cli-reference.md updated with new layout, folder kinds,
mixed-component -post injection, and breaking-change callout.
docs/contributor/component.md describes localformat as the shared
base format and the load-bearing invariants (localformat never writes
deployer-specific files; install.sh never name-customized; Write is
deterministic).
215930a to
61ac420
Compare
The registry entry for `gke-nccl-tcpxo` declared its namespace under a
top-level `manifest:` block:
- name: gke-nccl-tcpxo
...
manifest:
defaultNamespace: kube-system
`manifest:` is **not** a parsed field on the registry's `ComponentConfig`
struct — only `helm:` and `kustomize:` are recognized — so
`manifest.defaultNamespace` was silently ignored. The
established manifest-only Helm-wrapper pattern (used today by
`nodewright-customizations`) is to declare the component as `helm:`
with an empty `defaultRepository`:
helm:
defaultRepository: ""
defaultNamespace: kube-system
Bug surfacing timeline:
- Pre-NVIDIA#706, manifest-only components were installed by the root
`deploy.sh` via raw `kubectl apply -f .../manifests/`. Those manifests
carry inline `metadata.namespace: kube-system`, so the empty registry
default was harmless; `kubectl apply` did not need
`ComponentRef.Namespace` for routing.
- NVIDIA#706 (`feat(bundler)\!: uniform NNN-folder bundle layout via
localformat`) wraps every component — manifest-only included — as a
local Helm chart. The generated `install.sh` now always emits
`helm upgrade --install <name> ./ --namespace <ns> --create-namespace`,
which requires `ComponentRef.Namespace`. With the unparsed `manifest:`
block, that field is empty, producing:
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace --create-namespace \
Shell argument collapsing makes Helm parse the literal
`--create-namespace` as the namespace name and fails with:
Error: create: failed to create:
namespaces "--create-namespace" not found
- The first KWOK GPU run after NVIDIA#706 was cancelled, and earlier runs
used the pre-NVIDIA#706 deployer path where the empty namespace was inert.
PR NVIDIA#715 is one of the first post-NVIDIA#706 runs to actually complete the
H100 GKE-COS training jobs (its registry/base.yaml changes
auto-promote the GKE-COS Tier-2 KWOK matrix), and it surfaced the
failure.
Fix: switch `gke-nccl-tcpxo` to the existing manifest-only Helm
pattern, matching `nodewright-customizations`. Verified locally:
$ aicr recipe --service gke --accelerator h100 \
--intent training --os cos -o /tmp/recipe.yaml
$ aicr bundle -r /tmp/recipe.yaml -o /tmp/bundle
$ grep "helm upgrade" /tmp/bundle/*-gke-nccl-tcpxo/install.sh
helm upgrade --install gke-nccl-tcpxo ./ \
--namespace kube-system --create-namespace \
Refs: NVIDIA#706, NVIDIA#715
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. - recipes/registry.yaml: also fix the gke-nccl-tcpxo registry entry to use the established manifest-only Helm pattern (empty `helm.defaultRepository` plus `defaultNamespace: kube-system`) instead of the unparsed `manifest:` block. The `manifest:` field is not on the ComponentConfig struct, so its `defaultNamespace` was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only components were installed via raw `kubectl apply`, which routed via inline `metadata.namespace`). After NVIDIA#706 wraps every component as a local Helm chart, the generated install.sh emits `--namespace --create-namespace` (empty) and Helm fails. This blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when registry.yaml or base.yaml change. Switches to the same pattern used by `nodewright-customizations`. Verified bundled install.sh now contains `--namespace kube-system`. Supersedes NVIDIA#718. Refs: NVIDIA#698 Closes: NVIDIA#716, NVIDIA#718
The assert-bundle-topology-nfd chainsaw step hardcoded
${WORK}/bundle/nfd/values.yaml, which doesn't exist after the
NNN-folder bundle layout introduced by NVIDIA#706. CI fails with
"sed: can't read .../bundle/nfd/values.yaml".
Resolve the path via the same glob the sibling assert-bundle-scheduling-nfd
step uses (`bundle/[0-9][0-9][0-9]-nfd/values.yaml`).
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
Introduce
pkg/bundler/deployer/localformat, a deployer-agnostic writer producing a uniformNNN-<component>/bundle layout, and rewire--deployer helmto consume it.deploy.shbecomes a generic, branch-free install loop with name-matched special-case blocks around it.Motivation / Context
Today's
--deployer helmdeploy.shbranches every component on three kinds (Helm / Kustomize / raw manifests) with distinct install mechanisms. The complexity makes the script fragile, hard to share with the planned helmfile deployer (#632) or a future Flux deployer, and conflates classification with execution. A uniform on-disk layout lets every deployer consume one shape and eliminates component-type branching from the install path.This is scope A of #516 — pure format refactor. Vendoring upstream Helm chart bytes lands separately.
Fixes: #662
Related: #516, #632
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)tests/chainsaw/**)Implementation Notes
New package:
pkg/bundler/deployer/localformatWrites per-folder content (
Chart.yaml,values.yaml,cluster-values.yaml,install.sh,templates/,upstream.env) for every deployer. Two folder kinds, distinguished byChart.yamlpresence:Chart.yaml;upstream.envcarries{CHART, REPO, VERSION};install.shinstalls the upstream chart viahelm upgrade --install ... --repo ${REPO}.Chart.yaml+templates/present;install.shinstalls./as a local chart.Mixed components (Helm chart + raw manifests, e.g., gpu-operator with dcgm-exporter) emit two adjacent folders: a primary upstream-helm
NNN-<name>/plus an injected(NNN+1)-<name>-post/local-helm wrapper for the manifests. The "mixed" concept lives only at the bundle layer — recipe types don't know-postexists. Subsequent components shift by 1.Kustomize-typed components run
kustomize buildat bundle time (via the already-vendoredsigs.k8s.io/kustomize/api), producing a singletemplates/manifest.yamlinside a local-helm folder.Helm deployer rewire
Generator.Generatedelegates per-component content tolocalformat.Writeand owns only top-level orchestration files (README.md,deploy.sh,undeploy.sh, checksums). Deletes ~195 lines of dead code:generateComponentDirectories,buildComponentDataList(full version),writeClusterValuesFile.ComponentDatashrinks:HasChart,HasManifests,IsKustomizeremoved.deploy.sh.tmplinstall loop replaces ~90 lines of{{ if .IsKustomize/.HasChart/.HasManifests }}branching with a generic loop that delegates to each folder's renderedinstall.sh. The ~50-line CRD-race retry block is removed entirely — its trigger (pre-apply raw manifests) is structurally eliminated; mixed components now apply manifests post-install via the injected-postwrapped chart, where Helm owns CRD ordering natively.undeploy.sh.tmplsimilarly replaces three-way kind branching with a generic reverse loop. Name-matched blocks around the loops (skyhook taint cleanup, kai-scheduler async timeout/diagnostics, DRA kubelet-plugin restart, orphaned-CRD scan) are preserved; their migration into per-component install.sh hooks is deferred to a follow-up.Hard-rule invariants documented in
pkg/bundler/deployer/localformat/doc.go:localformatnever writes deployer-specific files (deploy.sh, helmfile.yaml, Application CRs, etc. stay with their respective deployers).install.shis never name-customized — rendered from exactly two templates parameterized only by data. Name-keyed quirks stay indeploy.sh.Writeis deterministic and idempotent.Test reorganization
pkg/bundler/deployer/localformat/writer_test.go— 9 tests covering upstream-helm, local-helm-manifest-only, mixed-postinjection, ordering, kustomize, kustomize+manifests rejection, determinism, path containment, ctx cancellation. 75.4% package coverage.pkg/bundler/deployer/helm/helm_test.gorewritten to use golden-file harness (assertBundleGolden) covering 5 representative bundle scenarios; goldens undertestdata/{upstream_helm_only,manifest_only,mixed_gpu_operator,kai_scheduler_present,skyhook_present}/double as reference examples. 11 redundant tests deleted; net: 48 → 37 tests, 3408 → 2130 lines (−37%), coverage 87.6% → 87.9%.pkg/bundler/bundler_test.go,pkg/bundler/handler_test.go, and 6 chainsaw e2e fixtures updated for the newNNN-<component>/paths.Testing
make test-coverage— total 72.7% (threshold 70%);localformat75.4%; combined deployer family 79.7%.make lint— go vet, golangci-lint, license headers, AGENTS.md sync, docs sidebar all clean.make e2e— 17/17 chainsaw tests pass (5 skipped, attestation-related, unchanged).make scan— no new vulnerabilities introduced; 3 pre-existing minor findings (postcss/pygments/uuid in node/python deps).make license-check— pass.README.md.tmplno longerhardcodes
001-gpu-operator; chainsawgrep -vpatterns dropped trailing slashes).Risk Assessment
Rollout notes:
Breaking changes for
--deployer helmconsumers (called out indocs/user/cli-reference.md):<component>/→NNN-<component>/. Tooling that parses bundle paths by bare component name must account for theNNN-prefix.-post).-postwrapped chart. The earlier pre-apply mechanism with CRD-race retry is gone — Helm now owns CRD ordering natively.README.mdfiles no longer generated; the top-level bundleREADME.mdremains.--deployer argocdand--deployer argocd-helmoutputs are unchanged in this PR.No data migration; users regenerate bundles on every run anyway.
Checklist
make testwith-race)make lint)git commit -S)