Releases: NVIDIA/aicr
Releases · NVIDIA/aicr
v0.12.1
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- cf3cd33: feat(bundler)!: uniform NNN-folder bundle layout via localformat (#662) (#706) (@lockwobr)
- 8843981: feat(bundler): add headless OIDC paths for bundle --attest (#707) (@lockwobr)
- 6593894: feat(cli): add skill command for AI agent integration (#691) (@yuanchen8911)
- b1e38fe: feat(cli): add snapshot analysis skill for Claude Code (@mchmarny)
- af8def3: feat(collector): add Talos OS support via Kubernetes Node info (#714) (@ayuskauskas)
- 639c53e: feat(recipes): enable NFD Topology Updater on production GPU recipes (#711) (@ArangoGutierrez)
- c1703eb: feat(release): publish THIRD_PARTY_NOTICES.md as a release asset (#722) (@ayuskauskas)
Bug Fixes
- af4df7c: fix(bundler): demote nodewright selector warnings to info severity (#704) (@mchmarny)
- 67ea746: fix(bundler): layer-neutral dynamic declaration errors (#703) (@mchmarny)
- cb6b98c: fix(bundler): preserve inner error codes instead of double-wrapping (#702) (@mchmarny)
- f94c66c: fix(ci): centralize GPU CI runtime pins (#710) (@yuanchen8911)
- 3c9f6ec: fix(ci): eliminate redundant CI workflow executions (@mchmarny)
- 2c99373: fix(ci): move organization-projects permission to workflow level (@mchmarny)
- 2fb1719: fix(ci): only send Slack notification on critical/high vulns (@mchmarny)
- 6900259: fix(ci): remove invalid organization-projects permission key (@mchmarny)
- 4c5c748: fix(ci): remove project board integration from issue report (@mchmarny)
- 7b96afa: fix(ci): trigger H100 GPU tests on shared recipe changes (#717) (@yuanchen8911)
- cd3abd2: fix(ci): use project board priority field instead of labels for issue report (@mchmarny)
- 39c8c29: fix(recipes): correct nvsentinel registry default to OCI source (#725) (@yuanchen8911)
- d66ba76: fix(recipes): drop hook-succeeded from torch-distributed runtime (#719) (@yuanchen8911)
- 604a324: fix(recipes): handle kubeflow-trainer v2.2.0 API changes (#724) (@yuanchen8911)
- 2038255: fix(recipes): use Helm manifest-only pattern for gke-nccl-tcpxo (#718) (@yuanchen8911)
- 8d0168e: fix(recipes): use NFD chart version 0.18.3 without v prefix (#688) (@yuanchen8911)
- eec81c5: fix(tools): pin golangci-lint installer URL to version tag (@mchmarny)
- 7274cab: fix(verifier): add trust level reason to verify output (#705) (@mchmarny)
- de84b0f: fix: address top-7 code-review findings across packages (#721) (@mchmarny)
- 07d9ab9: fix: post-release code quality and correctness cleanup (@mchmarny)
- 0a04439: fix: update license check (#712) (@lockwobr)
- c56f142: fix: update license check (#713) (@lockwobr)
Other Tasks
- 8ffea23: Refactor and harden H100 GPU CI workflow (#694) (@yuanchen8911)
- eb1a673: chore(deps): bump controller-runtime, apiextensions-apiserver, kube-openapi, semver (@mchmarny)
- a165fae: chore(recipe): bump dynamo-platform from 0.9.x to 1.0.2 and add Grove chart (#459) (@Jont828)
- fa5c02b: chore(recipes): bump 6 components to upstream latest (phase 1) (#715) (@yuanchen8911)
- 14ff3fa: chore(recipes): bump kai-scheduler v0.14.1 and kubeflow-trainer 2.2.0 (#720) (@yuanchen8911)
- e2da266: chore: bump postcss from 8.5.8 to 8.5.10 in /site in the npm_and_yarn group across 1 directory (#672) (@dependabot[bot])
- 0c939ce: chore: dep update (@mchmarny)
- 3cc4e26: chore: deps: bump goreleaser/goreleaser-action from 7.1.0 to 7.2.1 (#692) (@dependabot[bot])
- b266684: chore: update change log (@mchmarny)
- fc0daca: ci: enable CodeRabbit auto-review on draft PRs (#690) (@yuanchen8911)
- 3b7e970: ci: retry grype install on transient github 502s (#701) (@yuanchen8911)
- c7c3154: docs(kwok): add prerequisites and fix copy-paste pitfalls (#709) (@arun-gupta)
- fc2eeca: docs(roadmap): restructure around v1 objectives (#708) (@mchmarny)
- 0a8d6e1: feat(nodewright-customizations): add gb200 eks support (#699) (@ayuskauskas)
v0.12.0
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- cbaba36: feat(bundler): add --dynamic flag for install-time values (#515) (#527) (@lockwobr)
- 1e550c7: feat(bundler): enable --attest and --data for argocd-helm (#573) (#627) (@lockwobr)
- 142c0d2: feat(ci): add aggregate merge-gate workflow (#651) (@mchmarny)
- 1caf260: feat(ci): add daily Slack issue status report (@mchmarny)
- ad682ef: feat(ci): add daily image vulnerability scan workflow (@mchmarny)
- 42cfd26: feat(ci): auto-assign issues based on area labels (#513) (@mchmarny)
- 9b09c94: feat(cli): add dynamic shell completion for flag values (#339) (#512) (@lockwobr)
- 1b25135: feat(cli): auto-hydrate RecipeMetadata overlays in validate and bundle (#595) (@njhensley)
- f2aeaf2: feat(evidence): add NIM support to evidence collection and restructure conformance docs (#479) (@yuanchen8911)
- 6137c0b: feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images (#463) (@yuanchen8911)
- 4e158cf: feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check (#640) (@njhensley)
- 7f91140: feat(recipe): add NFD as standalone shared component (#518) (@ArangoGutierrez)
- f47e95f: feat(recipe): add mixin composition for OS and platform fragments (#501) (@yuanchen8911)
- 94fb041: feat(recipe): merge external validator catalog with embedded when provided through DataProvider (#588) (@njhensley)
- a66de21: feat(recipes): add NIM Operator recipe for CNCF AI Conformance (#478) (@yuanchen8911)
- 16d670d: feat(release): add pre-release support (#639) (@mchmarny)
- 306cb9b: feat(validator): add --node-selector and --toleration flags for validation workload scheduling (#444) (@atif1996)
- 1db88e4: feat(validator): add AICR_VALIDATOR_IMAGE_TAG env-var override (#666) (@yuanchen8911)
- 3a86364: feat(validator): add inference performance validation (#641) (@yuanchen8911)
- 6f7b4c1: feat: Add AKS UAT chainsaw tests for training and inference CUJs (#476) (@Jont828)
- 306b785: feat: GB200 EKS NET/NVLS NCCL validation and driver bump (#668) (@njhensley)
- ba20188: feat: add HardwareDetector interface and measurement keys for NFD integration (#482) (@ArangoGutierrez)
- 83c18bc: feat: add component contributor test harness (#508) (@ArangoGutierrez)
- 340452b: feat: add support for Akamai (#517) (@lalitadithya)
- 5a33265: feat: auto-install shell completions via install (#504) (@lockwobr)
- 81cf701: feat: implement NFD-based GPU hardware detection (#494) (@ArangoGutierrez)
- 7283e9c: feat: two-phase GPU collection with hardware detection support (#495) (@ArangoGutierrez)
- 02002ca: feat: wire NFDHardwareDetector into production snapshot pipeline (#502) (@ArangoGutierrez)
Bug Fixes
- 9d57dfb: fix(build): use FullCommit in goreleaser to match CI image tags (#658) (@mchmarny)
- 228e518: fix(bundler): add pre-flight finalizer check to undeploy.sh (#406) (#561) (@lockwobr)
- 64b8759: fix(bundler): allow Helm-style array indexing in --set paths (#643) (@yuanchen8911)
- c12c783: fix(bundler): fix undeploy template pre/post-flight checks (#602) (@yuanchen8911)
- 6f4ec0e: fix(bundler): harden filepath.Join with SafeJoin for path-traversal protection (#578) (@lockwobr)
- 966d775: fix(bundler): resolve ArgoCD RepoURL placeholder in child applications (#520) (@mchmarny)
- ca9d96c: fix(bundler): scope cleanup to bundle components and remove stale skyhook taints (#477) (@yuanchen8911)
- 50825cc: fix(bundler): skip helm commands for manifest-only components in README (@mchmarny)
- 8eac760: fix(ci): add --platform to aiperf-bench E2E docker build (#674) (@xdu31)
- 57e6b8d: fix(ci): add -mod=vendor to snapshot agent build (#534) (@yuanchen8911)
- 0a78409: fix(ci): add MDX safety check for non-self-closing img tags (#620) (@pdmack)
- 9e481d7: fix(ci): add diagnostic logging and multi-assignee support to issue triage (@mchmarny)
- db9f3ab: fix(ci): add failure diagnostics and fix Grafana resource starvation in Kind (#563) (@yuanchen8911)
- e2586f0: fix(ci): auto-label new issues by area and assign owners (#535) (@yuanchen8911)
- 7ad4f96: fix(ci): correct artifact action SHA pins in vuln scan workflow (@mchmarny)
- f5f7387: fix(ci): deduplicate conformance coverage in GPU CI (#577) (@yuanchen8911)
- 4a08c63: fix(ci): enable manual trigger for fern-docs-ci workflow (@mchmarny)
- d30e235: fix(ci): expand GPU test triggers to cover collector, snapshotter, validator, and add run-gpu-tests label (#514) (@xdu31)
- f489db3: fix(ci): fix fern instances URL basepath and surface publish URL in step summary (#568) (@pdmack)
- 42877f6: fix(ci): fix fern preview metadata and add continuous staging publish (#546) (@pdmack)
- d821306: fix(ci): improve GPU test reliability and deploy timeout handling (#539) (@yuanchen8911)
- 4988346: fix(ci): install gke-gcloud-auth-plugin before cluster connect (@mchmarny)
- 7b0dbb1: fix(ci): make issue report counts clickable Slack links (@mchmarny)
- 39e3114: fix(ci): match artifact download pattern to upload names (@mchmarny)
- 1fb1695: fix(ci): move GPU concurrency to test jobs (#581) (@yuanchen8911)
- 945a57d: fix(ci): pin e2e goreleaser and exclude local build artifacts (#580) (@yuanchen8911)
- c334bfc: fix(ci): query GPU snapshot by subtype name instead of index (#509) (@yuanchen8911)
- c294788: fix(ci): remove invalid --base-image flag from ko build (@mchmarny)
- abad89a: fix(ci): replace middle-dot separators with commas in issue report (@mchmarny)
- d988b02: fix(ci): replace push path filters with runtime path gate in GPU workflows (#558) (@yuanchen8911)
- 352b006: fix(ci): safe manifest publishing (#586) (@njhensley)
- 40bb85d: fix(ci): set GKE cluster name at correct config path (@mchmarny)
- d61dbfa: fix(ci): set deployment.destroy as boolean, not string (@mchmarny)
- 9e44eb7: fix(ci): shorten GKE deployment ID to fit SA name limit (@mchmarny)
- 5785d81: fix(ci): skip capacity pre-check for shared GCP reservations (@mchmarny)
- a7c8bf6: fix(ci): surface fern generate errors in preview (#650) (@mchmarny)
- 0eaa16f: fix(ci): use --bare flag for ko build in vuln scan workflow (@mchmarny)
- 67f03f4: fix(ci): use KEY_CONTENT env var for GKE provisioner credentials (@mchmarny)
- 56f20f6: fix(ci): use anchored regex for lychee exclude-path patterns (#547) (@pdmack)
- d117fbd: fix(ci): use config-based destroy for GKE provisioner (@mchmarny)
- 984b244: fix(ci): use correct field name 'subtype' in GPU snapshot validation (#511) (@yuanchen8911)
- bb072b1: fix(ci): use explicit empty mapping for workflow_dispatch (@mchmarny)
- b2f20fa: fix(ci): use search API for first-time contributor detection (#524) (@yuanchen8911)
- 8015fa8: fix(cli): --no-cluster must not deploy the snapshot-capture agent (#604) (@yuanchen8911)
- 173dba5: fix(recipe): handle null override in mergeValues to delete keys (#458) (@Jont828)
- a228540: fix(recipe): scope mixin fallback to affected candidates (#521) (@yuanchen8911)
- 0b98847: fix(recipes): disable Dynamo ssh-keygen on Kind (#670) (@yuanchen8911)
- 08b2cb2: fix(recipes): fix NIM operator validation and demo script issues (#483) (@yuanchen8911)
- 7466275: fix(scan): add pillow and python CVEs to grype ignore list (@mchmarny)
- 97be223: fix(scan): revert aiperf-bench base image to python:3-slim (@mchmarny)
- 5788bca: fix(scan): revert aiperf-bench base image to python:3.12-slim (@mchmarny)
- 7c547ad: fix(scan): suppress all critical/high C...
v0.11.1
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- 76d27c7: feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling (#450) (@yuanchen8911)
Bug Fixes
- 0d267c9: fix(api): add b200 accelerator to OpenAPI spec enum (#455) (@nvidiajeff)
- cdc9bf4: fix(cli): replace broken shell completion with full flag+alias support (#454) (@nvidiajeff)
- 692bbf0: fix(validator): templatize EKS NCCL runtime for dynamic EFA and instance type discovery (#447) (@xdu31)
Other Tasks
v0.11.0
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- 500b561: feat(recipes): add GKE COS inference and Dynamo overlay recipes (#414) (@yuanchen8911)
- 3e46e47: feat(snapshot): add --runtime-class flag for CDI environments (#434) (@atif1996)
- d3fd483: feat(validator): add EKS/GKE cluster autoscaling fallback (#438) (@yuanchen8911)
- 87fd28f: feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays (#415) (@Jont828)
- 0866ef0: feat: add B200 accelerator type support (#437) (@atif1996)
- 46736f8: feat: add query command for hydrated recipe value extraction (#445) (@mchmarny)
Bug Fixes
- 7c377c1: fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy (#416) (@yuanchen8911)
- 437126c: fix(gke): remove CAP_ prefix from capability names in TCPXO manifests (#428) (@yuanchen8911)
- f2ec6b2: fix(gke): update TCPXO to NRI profile without hostNetwork (#420) (@yuanchen8911)
- 8a65335: fix(validator): add retry for ai-service-metrics Prometheus query (#393) (@yuanchen8911)
- d99235e: fix(validator): remove hostNetwork and privileged from GKE NCCL runtime, use NRI device injection (#427) (@xdu31)
- e15a3c6: fix(validator): source NCCL env from host profile instead of hardcoding (#422) (@xdu31)
- 70efe82: fix: ArgoCD deployer generates valid YAML, add structural validation (#410) (#413) (@lockwobr)
Other Tasks
- 84f3c4c: chore: bump nvsentinel from v0.10.x to v1.1.0 (#423) (@mchmarny)
- 75092d8: chore: deps: bump github.com/in-toto/attestation from 1.1.2 to 1.2.0 (#431) (@dependabot[bot])
- ea19bdf: chore: deps: bump github/codeql-action from 4.32.6 to 4.33.0 (#418) (@dependabot[bot])
- a10d4b3: chore: deps: bump google.golang.org/grpc from 1.79.2 to 1.79.3 (#430) (@dependabot[bot])
- 9e81d69: chore: deps: bump the kubernetes group with 3 updates (#446) (@dependabot[bot])
- f23ade5: chore: ignore movies (@mchmarny)
- d4e818f: ci(kwok): implement tiered testing strategy per ADR-003 (#432) (@mchmarny)
- 9101d29: ci: build and publish validator images on merge to main (#412) (@yuanchen8911)
- ff9c66d: docs(conformance): update CNCF evidence for multi-platform and training (#425) (@yuanchen8911)
- 5d4aa7c: docs(validator): add custom image testing and private registry guide (#417) (@xdu31)
v0.10.16
Immutable
release. Only release title and notes can be modified.
v0.10.15
v0.10.14
Immutable
release. Only release title and notes can be modified.
Changelog
Bug Fixes
- 23f2a02: fix(brew): escape backslashes in caveats for proper multiline display (#402) (@mchmarny)
- 7d79830: fix(bundler): clean up kai-resource-reservation namespace on undeploy (#394) (@yuanchen8911)
- 87cb118: fix(evidence): track check results at runtime instead of scanning directory (#396) (@yuanchen8911)
Other Tasks
- d3ff136: chore: deps: bump actions/stale from 10.1.1 to 10.2.0 (#400) (@dependabot[bot])
- 220ed15: chore: deps: bump actions/upload-pages-artifact from 3.0.1 to 4.0.0 (#399) (@dependabot[bot])
- e44b763: chore: deps: bump sigstore/cosign-installer from 4.0.0 to 4.1.0 (#398) (@dependabot[bot])
- c06950e: site: eliminate docs duplication with build-time sync (#385) (@tabern)
v0.10.13
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- d992630: feat(recipes): add GKE COS training overlays for H100 (#383) (@yuanchen8911)
Bug Fixes
- a5d501b: fix(bundler): skip components with overrides.enabled: false (#382) (@xdu31)
- 8550939: fix(install): cosign version grep fails silently due to pipefail (#384) (@lockwobr)
- d802b3d: fix(test): update offline e2e to skip disabled aws-ebs-csi-driver (@mchmarny)
- 9bb2c7b: fix(validator): remove helm-values check (Helm values stored in secrets, never available in snapshot) (#388) (@xdu31)
Other Tasks
v0.10.12
Immutable
release. Only release title and notes can be modified.
v0.10.11
Immutable
release. Only release title and notes can be modified.
Changelog
New Features
- 4267972: feat(bundler): add pre-flight checks to deploy.sh and post-flight to undeploy.sh (#364) (@yuanchen8911)
- 8312960: feat(validator): add Kubeflow Trainer to robust-controller and skip inference-gateway on training clusters (#349) (@yuanchen8911)
Bug Fixes
- 662809b: fix(ci): use root directory for github-actions dependabot scanning (@mchmarny)
- ca0551d: fix(recipe): bump NCCL all-reduce bandwidth threshold to 300 Gbps (#350) (@xdu31)
- 48e878b: fix(test): eliminate dead tests, non-deterministic skips, and flaky sleeps (@mchmarny)
- a9162f0: fix(validator): truncate long stdout lines to prevent oversized reports (#363) (@xdu31)
- 103f5b0: fix: replace magic duration literals with named constants from pkg/defaults (@mchmarny)
- 8945569: fix: wrap bare errors and check writable Close() returns (@mchmarny)
Other Tasks
- bb53543: chore(ci): bump actions/cache to v5.0.3 and goreleaser-action to v7.0.0 (@mchmarny)
- 4ea330f: chore: dep update (@mchmarny)
- 99a96f9: chore: deps: bump actions/download-artifact from 4.1.8 to 8.0.1 (#370) (@dependabot[bot])
- 208b836: chore: deps: bump actions/github-script from 7.0.1 to 8.0.0 (#376) (@dependabot[bot])
- a2364eb: chore: deps: bump actions/setup-go from 6.2.0 to 6.3.0 (#368) (@dependabot[bot])
- 204aafa: chore: deps: bump actions/setup-node from 4.4.0 to 6.3.0 (#372) (@dependabot[bot])
- 1d5a104: chore: deps: bump actions/upload-artifact from 6.0.0 to 7.0.0 (#369) (@dependabot[bot])
- 9481db5: chore: deps: bump aquasecurity/trivy-action from 0.34.1 to 0.35.0 (#367) (@dependabot[bot])
- 9057566: chore: deps: bump aws-actions/configure-aws-credentials from 5.1.1 to 6.0.0 (#371) (@dependabot[bot])
- 24e41ad: chore: deps: bump docker/build-push-action from 6.15.0 to 7.0.0 (#373) (@dependabot[bot])
- 091e497: chore: deps: bump docker/setup-buildx-action from 3.10.0 to 4.0.0 (#375) (@dependabot[bot])
- 77aade3: chore: deps: bump github/codeql-action from 4.32.0 to 4.32.6 (#374) (@dependabot[bot])
- 15584cd: chore: deps: update hashicorp/aws requirement from ~> 5.0 to ~> 6.36 in /infra/uat-aws-account (#366) (@dependabot[bot])
- 413b808: chore: ignore GHSA-67mh-4wv8-2f99 (esbuild) in grype scan (@mchmarny)
- eab65d9: core: image update (@mchmarny)
- 9c8b1af: docs(api): add missing bundle params and document CLI-only gaps (@mchmarny)
- c5a9115: docs(install): add Homebrew installation option (#357) (@mchmarny)
- 42e8ff5: docs(site): align Go version requirements to 1.26 (#362) (@yuanchen8911)
- 647d10b: site: migrate from Hugo/Docsy to VitePress (#360) (@tabern)