Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
agg-with-config.yaml	agg-with-config.yaml
agg.yaml	agg.yaml
agg_router.yaml	agg_router.yaml
disagg-multinode.yaml	disagg-multinode.yaml
disagg.yaml	disagg.yaml
disagg_planner.yaml	disagg_planner.yaml
disagg_router.yaml	disagg_router.yaml

TensorRT-LLM Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the DynamoGraphDeployment resource.

Available Deployment Patterns

1. Aggregated Deployment (`agg.yaml`)

Basic deployment pattern with frontend and a single worker.

Architecture:

Frontend: OpenAI-compatible API server (with kv router mode disabled)
TRTLLMWorker: Single worker handling both prefill and decode

2. Aggregated Router Deployment (`agg_router.yaml`)

Enhanced aggregated deployment with KV cache routing capabilities.

Architecture:

Frontend: OpenAI-compatible API server (with kv router mode enabled)
TRTLLMWorker: Multiple workers handling both prefill and decode (2 replicas for load balancing)

3. Disaggregated Deployment (`disagg.yaml`)

High-performance deployment with separated prefill and decode workers.

Architecture:

Frontend: HTTP API server coordinating between workers
TRTLLMDecodeWorker: Specialized decode-only worker
TRTLLMPrefillWorker: Specialized prefill-only worker

4. Disaggregated Router Deployment (`disagg_router.yaml`)

Advanced disaggregated deployment with KV cache routing capabilities.

Architecture:

Frontend: HTTP API server (with kv router mode enabled)
TRTLLMDecodeWorker: Specialized decode-only worker
TRTLLMPrefillWorker: Specialized prefill-only worker (2 replicas for load balancing)

5. Aggregated Deployment with Config (`agg-with-config.yaml`)

Aggregated deployment with custom configuration.

Architecture:

nvidia-config: ConfigMap containing a custom trtllm configuration
Frontend: OpenAI-compatible API server (with kv router mode disabled)
TRTLLMWorker: Single worker handling both prefill and decode with custom configuration mounted from the configmap

6. Disaggregated Planner Deployment (`disagg_planner.yaml`)

Advanced disaggregated deployment with SLA-based automatic scaling.

Architecture:

Frontend: HTTP API server coordinating between workers
Planner: SLA-based planner that monitors performance and scales workers automatically
Prometheus: Metrics collection and monitoring
TRTLLMDecodeWorker: Specialized decode-only worker
TRTLLMPrefillWorker: Specialized prefill-only worker

Note

This deployment requires pre-deployment profiling to be completed first. See Pre-Deployment Profiling for detailed instructions.

CRD Structure

All templates use the DynamoGraphDeployment CRD:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: <deployment-name>
spec:
  services:
    <ServiceName>:
      # Service configuration

Key Configuration Options

Resource Management:

resources:
  requests:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"
  limits:
    cpu: "10"
    memory: "20Gi"
    gpu: "1"

Container Configuration:

extraPodSpec:
  mainContainer:
    image: my-registry/trtllm-runtime:my-tag
    workingDir: /workspace/examples/backends/trtllm
    args:
      - "python3"
      - "-m"
      - "dynamo.trtllm"
      # Model-specific arguments

Prerequisites

Before using these templates, ensure you have:

Dynamo Cloud Platform installed - See Quickstart Guide
Kubernetes cluster with GPU support
Container registry access for TensorRT-LLM runtime images
HuggingFace token secret (referenced as envFromSecret: hf-token-secret)

Container Images

The deployment files currently require access to my-registry/trtllm-runtime. If you don't have access, build and push your own image:

./container/build.sh --framework tensorrtllm
# Tag and push to your container registry
# Update the image references in the YAML files

Note: TensorRT-LLM uses git-lfs, which needs to be installed in advance:

apt-get update && apt-get -y install git git-lfs

For ARM machines, use:

./container/build.sh --framework tensorrtllm --platform linux/arm64

Usage

1. Choose Your Template

Select the deployment pattern that matches your requirements:

Use agg.yaml for simple testing
Use agg_router.yaml for production with KV cache routing and load balancing
Use disagg.yaml for maximum performance with separated workers
Use disagg_router.yaml for high-performance with KV cache routing and disaggregation

2. Customize Configuration

Edit the template to match your environment:

# Update image registry and tag
image: my-registry/trtllm-runtime:my-tag

# Configure your model and deployment settings
args:
  - "python3"
  - "-m"
  - "dynamo.trtllm"
  # Add your model-specific arguments

3. Deploy

See the Create Deployment Guide to learn how to deploy the deployment file.

First, create a secret for the HuggingFace token.

export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN=${HF_TOKEN} \
  -n ${NAMESPACE}

Then, deploy the model using the deployment file.

Export the NAMESPACE you used in your Dynamo Cloud Installation.

cd dynamo/examples/backends/trtllm/deploy
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE

4. Using Custom Dynamo Frameworks Image for TensorRT-LLM

To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:

export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE

5. Port Forwarding

After deployment, forward the frontend service to access the API:

kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000

Configuration Options

Environment Variables

To change DYN_LOG level, edit the yaml file by adding:

...
spec:
  envs:
    - name: DYN_LOG
      value: "debug" # or other log levels
  ...

TensorRT-LLM Worker Configuration

TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include:

KV Cache Transfer: Choose between UCX (default) or NIXL for disaggregated serving
Request Migration: Enable graceful failure handling with --migration-limit

Testing the Deployment

Send a test request to verify your deployment. See the client section for detailed instructions.

Note: For multi-node deployments, target the node running python3 -m dynamo.frontend <args>.

Model Configuration

The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.

Monitoring and Health

Frontend health endpoint: http://<frontend-service>:8000/health
Worker health endpoints: http://<worker-service>:9090/health
Liveness probes: Check process health every 5 seconds
Readiness probes: Check service readiness with configurable delays

KV Cache Transfer Methods

TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving:

UCX (default): Standard method for KV cache transfer
NIXL (experimental): Alternative transfer method

For detailed configuration instructions, see the KV cache transfer guide.

Request Migration

You can enable request migration to handle worker failures gracefully by adding the migration limit argument to worker configurations:

args:
  - "python3"
  - "-m"
  - "dynamo.trtllm"
  - "--migration-limit"
  - "3"

Benchmarking

To benchmark your deployment with AIPerf, see this utility script: perf.sh

Configure the model name and host based on your deployment.

Troubleshooting

Common issues and solutions:

Pod fails to start: Check image registry access and HuggingFace token secret
GPU not allocated: Verify cluster has GPU nodes and proper resource limits
Health check failures: Review model loading logs and increase initialDelaySeconds
Out of memory: Increase memory limits or reduce model batch size
Port forwarding issues: Ensure correct pod UUID in port-forward command
Git LFS issues: Ensure git-lfs is installed before building containers
ARM deployment: Use --platform linux/arm64 when building on ARM machines

For additional support, refer to the deployment troubleshooting guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

TensorRT-LLM Kubernetes Deployment Configurations

Available Deployment Patterns

1. Aggregated Deployment (`agg.yaml`)

2. Aggregated Router Deployment (`agg_router.yaml`)

3. Disaggregated Deployment (`disagg.yaml`)

4. Disaggregated Router Deployment (`disagg_router.yaml`)

5. Aggregated Deployment with Config (`agg-with-config.yaml`)

6. Disaggregated Planner Deployment (`disagg_planner.yaml`)

CRD Structure

Key Configuration Options

Prerequisites

Container Images

Usage

1. Choose Your Template

2. Customize Configuration

3. Deploy

4. Using Custom Dynamo Frameworks Image for TensorRT-LLM

5. Port Forwarding

Configuration Options

Environment Variables

TensorRT-LLM Worker Configuration

Testing the Deployment

Model Configuration

Monitoring and Health

KV Cache Transfer Methods

Request Migration

Benchmarking

Further Reading

Troubleshooting

FilesExpand file tree

deploy

Directory actions

More options

Directory actions

More options

Latest commit

History

deploy

Folders and files

parent directory

README.md

TensorRT-LLM Kubernetes Deployment Configurations

Available Deployment Patterns

1. Aggregated Deployment (agg.yaml)

2. Aggregated Router Deployment (agg_router.yaml)

3. Disaggregated Deployment (disagg.yaml)

4. Disaggregated Router Deployment (disagg_router.yaml)

5. Aggregated Deployment with Config (agg-with-config.yaml)

6. Disaggregated Planner Deployment (disagg_planner.yaml)

CRD Structure

Key Configuration Options

Prerequisites

Container Images

Usage

1. Choose Your Template

2. Customize Configuration

3. Deploy

4. Using Custom Dynamo Frameworks Image for TensorRT-LLM

5. Port Forwarding

Configuration Options

Environment Variables

TensorRT-LLM Worker Configuration

Testing the Deployment

Model Configuration

Monitoring and Health

KV Cache Transfer Methods

Request Migration

Benchmarking

Further Reading

Troubleshooting

1. Aggregated Deployment (`agg.yaml`)

2. Aggregated Router Deployment (`agg_router.yaml`)

3. Disaggregated Deployment (`disagg.yaml`)

4. Disaggregated Router Deployment (`disagg_router.yaml`)

5. Aggregated Deployment with Config (`agg-with-config.yaml`)

6. Disaggregated Planner Deployment (`disagg_planner.yaml`)