AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.
Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.
The source of truth is recipes/registry.yaml. Each entry in the registry defines the component's Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.
| Component | Description | Source |
|---|---|---|
| gpu-operator | Manages the GPU driver and runtime lifecycle on Kubernetes nodes. Handles driver installation, container runtime configuration, device plugin, and GPU feature discovery. | NVIDIA GPU Operator |
| network-operator | Manages high-performance networking for GPU workloads. Configures RDMA, SR-IOV, and host networking for multi-node communication. | NVIDIA Network Operator |
| nfd | Node Feature Discovery — labels nodes with hardware features (PCI device IDs, kernel modules, CPU capabilities). Both gpu-operator and network-operator consume these labels. On production GPU recipes, the Topology Updater publishes per-node NodeResourceTopology CRDs describing NUMA zones and GPU/NIC affinity for downstream NUMA-aware schedulers. |
Node Feature Discovery |
| gke-nccl-tcpxo | NCCL TCPxO network plugin for GKE. Provides optimized collective communication for multi-node GPU workloads on Google Kubernetes Engine. GKE-specific. | — |
| aws-efa | Device plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific. | AWS EFA K8s Device Plugin |
| cert-manager | Automates TLS certificate management. Required by several operators for webhook and API server certificates. | cert-manager |
| nodewright-operator | OS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes. | Nodewright |
| nodewright-customizations | Environment-specific node tuning profiles applied via Nodewright. Extends the operator with kernel params, hugepages, and other host-level configurations. | — |
| nvsentinel | GPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes. | NVSentinel |
| nvidia-dra-driver-gpu | Dynamic Resource Allocation (DRA) driver for GPUs. Advertises GPUs via the Kubernetes resource.k8s.io/v1 API instead of the legacy device plugin. Requires Kubernetes 1.34+ (DRA is GA in 1.34). See AKS GPU Setup for details. CLI alias: dradriver. |
NVIDIA DRA Driver |
| kube-prometheus-stack | Cluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards. | kube-prometheus-stack |
| prometheus-adapter | Exposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics. | prometheus-adapter |
| aws-ebs-csi-driver | CSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific. | AWS EBS CSI Driver |
| k8s-ephemeral-storage-metrics | Exports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes. | k8s-ephemeral-storage-metrics |
| kai-scheduler | DRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology. | KAI Scheduler |
| grove | Pod lifecycle management for Dynamo inference platform. Installed as a standalone component. | Grove |
| dynamo-platform | NVIDIA Dynamo inference serving platform with bundled CRDs. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode. | Dynamo |
| kgateway-crds | Custom Resource Definitions for kgateway (Kubernetes Gateway API implementation). | kgateway |
| kgateway | Kubernetes Gateway API implementation. Provides model-aware ingress routing for inference workloads. | kgateway |
| k8s-nim-operator | NVIDIA NIM Operator for managing NIM (NVIDIA Inference Microservices) deployments on Kubernetes. | K8s NIM Operator |
| kueue | Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. | Kueue |
| kubeflow-trainer | Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. | Kubeflow Trainer |
Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:
- Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
- Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
- Intent-specific components (kubeflow-trainer, dynamo-platform, kai-scheduler) are added based on workload intent.
- Accelerator/OS-specific tuning (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.
Production GPU leaf recipes (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) enable the NFD Topology Updater. It publishes per-node NodeResourceTopology CRDs that describe NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity. Runtime consumers (NUMA-aware schedulers, debugging via kubectl get noderesourcetopologies) can read these CRDs without further configuration.
The Topology Updater requires the kubelet podResources gRPC socket. The KubeletPodResources feature gate has been on by default since Kubernetes 1.15 (Beta) and reached GA in Kubernetes 1.28; AICR's recipe constraints on the affected leaves require K8s ≥ 1.30 or higher, so this is satisfied in practice. Recipes targeting Kubernetes < 1.15 must enable the feature gate explicitly. Kind / KWOK simulated clusters do not run a real kubelet and therefore leave the Topology Updater disabled — kind-based recipes will not see NodeResourceTopology CRDs.
See the upstream Topology Updater docs for runtime consumer examples.
To see exactly which components appear in a given recipe, generate one:
aicr recipe --service eks --accelerator h100 --os ubuntu --intent training -o recipe.yamlThe output lists every component with its pinned version and configuration values.
New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Bundler Development docs for details.