feat: move to latest NVSentinel#309
Merged
Merged
Conversation
mchmarny
previously approved these changes
Mar 9, 2026
2 tasks
10 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
11 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Verified locally. - vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md: resync with the v0.26.2 content already declared in vendor/modules.txt. The prior dep-update commit on main (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't refresh those two doc files, so `go mod vendor` in CI produces a diff against the committed vendor and the `tests/Test` gate fails. Running `go mod vendor` here picks up the consistent v0.26.2 docs. Refs: NVIDIA#698 Closes: NVIDIA#716
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. - recipes/registry.yaml: also fix the gke-nccl-tcpxo registry entry to use the established manifest-only Helm pattern (empty `helm.defaultRepository` plus `defaultNamespace: kube-system`) instead of the unparsed `manifest:` block. The `manifest:` field is not on the ComponentConfig struct, so its `defaultNamespace` was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only components were installed via raw `kubectl apply`, which routed via inline `metadata.namespace`). After NVIDIA#706 wraps every component as a local Helm chart, the generated install.sh emits `--namespace --create-namespace` (empty) and Helm fails. This blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when registry.yaml or base.yaml change. Switches to the same pattern used by `nodewright-customizations`. Verified bundled install.sh now contains `--namespace kube-system`. Supersedes NVIDIA#718. Refs: NVIDIA#698 Closes: NVIDIA#716, NVIDIA#718
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Refs: NVIDIA#698 Closes: NVIDIA#716
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch bumps across registry defaults and overlay/mixin pins. No values schema changes required. aws-ebs-csi-driver 2.55.0 -> 2.59.0 cert-manager v1.17.2 -> v1.20.2 kube-prometheus-stack 82.8.0 -> 84.4.0 kueue 0.17.0 -> 0.17.1 nodewright-operator v0.14.0 -> v0.15.1 nvsentinel v1.1.0 -> v1.3.0 Excluded from this PR: - kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently drops the `inferenceExtension.enabled` value (no longer in the chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml (ClusterRole granting access to inference.networking.x-k8s.io inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env; v2.2.3 renders neither. AICR uses kgateway specifically for the CNCF AI Conformance "Advanced Ingress for AI/ML Inference" requirement, so a silent feature regression here would break inference bundles. Migration to v2.2.3 needs a values + RBAC rework — deferred. - aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup including a real security-posture change (chart now defaults to privileged: true for EFA hardware access, conflicting with our hardened allowPrivilegeEscalation: false override). Deferred to a follow-up so the change can get proper EKS/security review. - kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred from NVIDIA/ to kai-scheduler/ org and chart publishing moved with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler` (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0). This is an OCI-source migration plus a bump — coupled changes worth their own follow-up PR rather than mixing into pure pin bumps here. - kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a Go change in validators/performance/trainer_lifecycle.go (the hardcoded fallback archive URL needs to track the chart pin). The validator + chart bumps belong together in a follow-up PR to keep this PR pure config / no Go changes. Companion changes: - examples/recipes/{kind,eks-training,aks-training,eks-gb200- ubuntu-training-with-validation}.yaml: refresh the cert-manager, nodewright-operator, kube-prometheus-stack, and nvsentinel pins to match the bumped registry defaults. Matches the convention from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450). - examples/recipes/aks-training.yaml: also remove an orphaned `manifestFiles:` reference to components/nvsentinel/manifests/allow-intra-namespace.yaml that has been broken since NVIDIA#415 (the workaround source file was deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the AKS example was added later by copying from another template and kept the now-stale reference). Bundling examples/recipes/aks-training.yaml currently fails with "file does not exist"; this fix restores it. Refs: NVIDIA#698 Closes: NVIDIA#716
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upgrade to NVSentinel v0.10.1 and clean up configuration
Motivation / Context
NVSentinel released a patch version with fixes for some orchestrators
Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Testing
# Commands run (prefer `make qualify` for non-trivial changes) make qualifyRisk Assessment
Rollout notes:
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info