Skip to content

Latest commit

 

History

History
63 lines (46 loc) · 7.88 KB

File metadata and controls

63 lines (46 loc) · 7.88 KB

Component Catalog

AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.

Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.

The source of truth is recipes/registry.yaml. Each entry in the registry defines the component's Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.

Components

Component Description Source
gpu-operator Manages the GPU driver and runtime lifecycle on Kubernetes nodes. Handles driver installation, container runtime configuration, device plugin, and GPU feature discovery. NVIDIA GPU Operator
network-operator Manages high-performance networking for GPU workloads. Configures RDMA, SR-IOV, and host networking for multi-node communication. NVIDIA Network Operator
nfd Node Feature Discovery — labels nodes with hardware features (PCI device IDs, kernel modules, CPU capabilities). Both gpu-operator and network-operator consume these labels. On production GPU recipes, the Topology Updater publishes per-node NodeResourceTopology CRDs describing NUMA zones and GPU/NIC affinity for downstream NUMA-aware schedulers. Node Feature Discovery
gke-nccl-tcpxo NCCL TCPxO network plugin for GKE. Provides optimized collective communication for multi-node GPU workloads on Google Kubernetes Engine. GKE-specific.
aws-efa Device plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific. AWS EFA K8s Device Plugin
cert-manager Automates TLS certificate management. Required by several operators for webhook and API server certificates. cert-manager
nodewright-operator OS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes. Nodewright
nodewright-customizations Environment-specific node tuning profiles applied via Nodewright. Extends the operator with kernel params, hugepages, and other host-level configurations.
nvsentinel GPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes. NVSentinel
nvidia-dra-driver-gpu Dynamic Resource Allocation (DRA) driver for GPUs. Advertises GPUs via the Kubernetes resource.k8s.io/v1 API instead of the legacy device plugin. Requires Kubernetes 1.34+ (DRA is GA in 1.34). See AKS GPU Setup for details. CLI alias: dradriver. NVIDIA DRA Driver
kube-prometheus-stack Cluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards. kube-prometheus-stack
prometheus-adapter Exposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics. prometheus-adapter
aws-ebs-csi-driver CSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific. AWS EBS CSI Driver
k8s-ephemeral-storage-metrics Exports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes. k8s-ephemeral-storage-metrics
kai-scheduler DRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology. KAI Scheduler
grove Pod lifecycle management for Dynamo inference platform. Installed as a standalone component. Grove
dynamo-platform NVIDIA Dynamo inference serving platform with bundled CRDs. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode. Dynamo
kgateway-crds Custom Resource Definitions for kgateway (Kubernetes Gateway API implementation). kgateway
kgateway Kubernetes Gateway API implementation. Provides model-aware ingress routing for inference workloads. kgateway
k8s-nim-operator NVIDIA NIM Operator for managing NIM (NVIDIA Inference Microservices) deployments on Kubernetes. K8s NIM Operator
kueue Kubernetes-native job queuing system. Manages quotas and admits jobs for batch and AI workloads. Kueue
kubeflow-trainer Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration. Kubeflow Trainer

How Components Are Selected

Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:

  • Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
  • Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
  • Intent-specific components (kubeflow-trainer, dynamo-platform, kai-scheduler) are added based on workload intent.
  • Accelerator/OS-specific tuning (nodewright-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

NFD Topology Updater

Production GPU leaf recipes (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) enable the NFD Topology Updater. It publishes per-node NodeResourceTopology CRDs that describe NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity. Runtime consumers (NUMA-aware schedulers, debugging via kubectl get noderesourcetopologies) can read these CRDs without further configuration.

The Topology Updater requires the kubelet podResources gRPC socket. The KubeletPodResources feature gate has been on by default since Kubernetes 1.15 (Beta) and reached GA in Kubernetes 1.28; AICR's recipe constraints on the affected leaves require K8s ≥ 1.30 or higher, so this is satisfied in practice. Recipes targeting Kubernetes < 1.15 must enable the feature gate explicitly. Kind / KWOK simulated clusters do not run a real kubelet and therefore leave the Topology Updater disabled — kind-based recipes will not see NodeResourceTopology CRDs.

See the upstream Topology Updater docs for runtime consumer examples.

To see exactly which components appear in a given recipe, generate one:

aicr recipe --service eks --accelerator h100 --os ubuntu --intent training -o recipe.yaml

The output lists every component with its pinned version and configuration values.

Adding Components

New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Bundler Development docs for details.