Skip to content

Latest commit

 

History

History
171 lines (133 loc) · 6.52 KB

File metadata and controls

171 lines (133 loc) · 6.52 KB

Deploying Inference Graphs to Kubernetes

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

1. Install Platform First

# 1. Set environment
export NAMESPACE=dynamo-kubernetes
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Platform
kubectl create namespace ${NAMESPACE}
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}

For more details or customization options, see Installation Guide for Dynamo Kubernetes Platform.

2. Choose Your Backend

Each backend has deployment examples and configuration options:

Backend Available Configurations
vLLM Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner
SGLang Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node
TensorRT-LLM Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router

3. Deploy Your First Model

# Set same namespace from platform install
export NAMESPACE=dynamo-cloud

# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}

# Test it
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models

What's a DynamoGraphDeployment?

It's a Kubernetes Custom Resource that defines your inference pipeline:

  • Model configuration
  • Resource allocation (GPUs, memory)
  • Scaling policies
  • Frontend/backend connections

The scripts in the components/<backend>/launch folder like agg.sh demonstrate how you can serve your models locally. The corresponding YAML files like agg.yaml show you how you could create a kubernetes deployment for your inference graph.

📖 API Reference & Documentation

For detailed technical specifications of Dynamo's Kubernetes resources:

  • API Reference - Complete CRD field specifications for DynamoGraphDeployment and DynamoComponentDeployment
  • Operator Guide - Dynamo operator configuration and management
  • Create Deployment - Step-by-step deployment creation examples

Choosing Your Architecture Pattern

When creating a deployment, select the architecture pattern that best fits your use case:

  • Development / Testing - Use agg.yaml as the base configuration
  • Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference
  • High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

  • Provides OpenAI-compatible /v1/chat/completions endpoint
  • Auto-discovers backend workers via etcd
  • Routes requests and handles load balancing
  • Validates and preprocesses requests

Customizing Your Deployment

Example structure:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: my-llm
spec:
  services:
    Frontend:
      dynamoNamespace: my-llm
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: your-image
    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
      dynamoNamespace: dynamo-dev
      componentType: worker
      replicas: 1
      envFromSecret: hf-token-secret  # for HuggingFace models
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: your-image
          command: ["/bin/sh", "-c"]
          args:
            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

# vLLM worker
args:
  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
  - >-
    python3 -m dynamo.sglang
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --tp 1
    --trust-remote-code

# TensorRT-LLM worker
args:
  - python3 -m dynamo.trtllm
    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    --extra-engine-args engine_configs/agg.yaml

Key customization points include:

  • Model Configuration: Specify model in the args command
  • Resource Allocation: Configure GPU requirements under resources.limits
  • Scaling: Set replicas for number of worker instances
  • Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs
  • Worker Specialization: Add --is-prefill-worker flag for disaggregated prefill workers

Additional Resources