This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the DynamoGraphDeployment resource.
Basic deployment pattern with frontend and a single worker.
Architecture:
Frontend: OpenAI-compatible API server (with kv router mode disabled)TRTLLMWorker: Single worker handling both prefill and decode
Enhanced aggregated deployment with KV cache routing capabilities.
Architecture:
Frontend: OpenAI-compatible API server (with kv router mode enabled)TRTLLMWorker: Multiple workers handling both prefill and decode (2 replicas for load balancing)
High-performance deployment with separated prefill and decode workers.
Architecture:
Frontend: HTTP API server coordinating between workersTRTLLMDecodeWorker: Specialized decode-only workerTRTLLMPrefillWorker: Specialized prefill-only worker
Advanced disaggregated deployment with KV cache routing capabilities.
Architecture:
Frontend: HTTP API server (with kv router mode enabled)TRTLLMDecodeWorker: Specialized decode-only workerTRTLLMPrefillWorker: Specialized prefill-only worker (2 replicas for load balancing)
Aggregated deployment with custom configuration.
Architecture:
nvidia-config: ConfigMap containing a custom trtllm configurationFrontend: OpenAI-compatible API server (with kv router mode disabled)TRTLLMWorker: Single worker handling both prefill and decode with custom configuration mounted from the configmap
Advanced disaggregated deployment with SLA-based automatic scaling.
Architecture:
Frontend: HTTP API server coordinating between workersPlanner: SLA-based planner that monitors performance and scales workers automaticallyPrometheus: Metrics collection and monitoringTRTLLMDecodeWorker: Specialized decode-only workerTRTLLMPrefillWorker: Specialized prefill-only worker
Note
This deployment requires pre-deployment profiling to be completed first. See Pre-Deployment Profiling for detailed instructions.
All templates use the DynamoGraphDeployment CRD:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: <deployment-name>
spec:
services:
<ServiceName>:
# Service configurationResource Management:
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"Container Configuration:
extraPodSpec:
mainContainer:
image: my-registry/trtllm-runtime:my-tag
workingDir: /workspace/examples/backends/trtllm
args:
- "python3"
- "-m"
- "dynamo.trtllm"
# Model-specific argumentsBefore using these templates, ensure you have:
- Dynamo Cloud Platform installed - See Quickstart Guide
- Kubernetes cluster with GPU support
- Container registry access for TensorRT-LLM runtime images
- HuggingFace token secret (referenced as
envFromSecret: hf-token-secret)
The deployment files currently require access to my-registry/trtllm-runtime. If you don't have access, build and push your own image:
./container/build.sh --framework tensorrtllm
# Tag and push to your container registry
# Update the image references in the YAML filesNote: TensorRT-LLM uses git-lfs, which needs to be installed in advance:
apt-get update && apt-get -y install git git-lfsFor ARM machines, use:
./container/build.sh --framework tensorrtllm --platform linux/arm64Select the deployment pattern that matches your requirements:
- Use
agg.yamlfor simple testing - Use
agg_router.yamlfor production with KV cache routing and load balancing - Use
disagg.yamlfor maximum performance with separated workers - Use
disagg_router.yamlfor high-performance with KV cache routing and disaggregation
Edit the template to match your environment:
# Update image registry and tag
image: my-registry/trtllm-runtime:my-tag
# Configure your model and deployment settings
args:
- "python3"
- "-m"
- "dynamo.trtllm"
# Add your model-specific argumentsSee the Create Deployment Guide to learn how to deploy the deployment file.
First, create a secret for the HuggingFace token.
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}Then, deploy the model using the deployment file.
Export the NAMESPACE you used in your Dynamo Cloud Installation.
cd dynamo/examples/backends/trtllm/deploy
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACETo use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>
yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACEAfter deployment, forward the frontend service to access the API:
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000To change DYN_LOG level, edit the yaml file by adding:
...
spec:
envs:
- name: DYN_LOG
value: "debug" # or other log levels
...TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include:
- KV Cache Transfer: Choose between UCX (default) or NIXL for disaggregated serving
- Request Migration: Enable graceful failure handling with
--migration-limit
Send a test request to verify your deployment. See the client section for detailed instructions.
Note: For multi-node deployments, target the node running python3 -m dynamo.frontend <args>.
The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.
- Frontend health endpoint:
http://<frontend-service>:8000/health - Worker health endpoints:
http://<worker-service>:9090/health - Liveness probes: Check process health every 5 seconds
- Readiness probes: Check service readiness with configurable delays
TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving:
- UCX (default): Standard method for KV cache transfer
- NIXL (experimental): Alternative transfer method
For detailed configuration instructions, see the KV cache transfer guide.
You can enable request migration to handle worker failures gracefully by adding the migration limit argument to worker configurations:
args:
- "python3"
- "-m"
- "dynamo.trtllm"
- "--migration-limit"
- "3"To benchmark your deployment with AIPerf, see this utility script: perf.sh
Configure the model name and host based on your deployment.
- Deployment Guide: Creating Kubernetes Deployments
- Quickstart: Deployment Quickstart
- Platform Setup: Dynamo Cloud Installation
- Examples: Deployment Examples
- Architecture Docs: Disaggregated Serving, KV-Aware Routing
- Multinode Deployment: Multinode Examples
- Speculative Decoding: Llama 4 + Eagle Guide
- Kubernetes CRDs: Custom Resources Documentation
Common issues and solutions:
- Pod fails to start: Check image registry access and HuggingFace token secret
- GPU not allocated: Verify cluster has GPU nodes and proper resource limits
- Health check failures: Review model loading logs and increase
initialDelaySeconds - Out of memory: Increase memory limits or reduce model batch size
- Port forwarding issues: Ensure correct pod UUID in port-forward command
- Git LFS issues: Ensure git-lfs is installed before building containers
- ARM deployment: Use
--platform linux/arm64when building on ARM machines
For additional support, refer to the deployment troubleshooting guide.