Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion components/backends/sglang/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,34 @@ args:
```

### 3. Deploy

Use the following command to deploy the deployment file.

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

```bash
kubectl apply -f <your-template>.yaml
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
```

### 4. Using Custom Dynamo Frameworks Image for SGLang

To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

## Model Configuration
Expand Down
49 changes: 3 additions & 46 deletions components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,61 +189,18 @@ For comprehensive instructions on multinode serving, see the [multinode-examples

### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:

- `agg.yaml` - Aggregated serving
- `agg_router.yaml` - Aggregated serving with KV routing
- `disagg.yaml` - Disaggregated serving
- `disagg_router.yaml` - Disaggregated serving with KV routing

#### Prerequisites

- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.

- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
```bash
./container/build.sh --framework tensorrtllm
# Tag and push to your container registry
# Update the image references in the YAML files
```

- **Port Forwarding**: After deployment, forward the frontend service to access the API:
```bash
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
```

#### Deploy to Kubernetes

Example with disagg:
Export the NAMESPACE you used in your Dynamo Cloud Installation.

```bash
cd dynamo
cd components/backends/trtllm/deploy
kubectl apply -f disagg.yaml -n $NAMESPACE
```

To change `DYN_LOG` level, edit the yaml file by adding

```yaml
...
spec:
envs:
- name: DYN_LOG
value: "debug" # or other log levels
...
```
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md)

### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

### Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)


## Disaggregation Strategy
Expand Down
289 changes: 288 additions & 1 deletion components/backends/trtllm/deploy/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,288 @@
This folder contains deployment examples for the TRTLLM inference backend.
# TensorRT-LLM Kubernetes Deployment Configurations

This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the **DynamoGraphDeployment** resource.

## Available Deployment Patterns

### 1. **Aggregated Deployment** (`agg.yaml`)
Basic deployment pattern with frontend and a single worker.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
- `TRTLLMWorker`: Single worker handling both prefill and decode

### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
Enhanced aggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: OpenAI-compatible API server (with kv router mode enabled)
- `TRTLLMWorker`: Multiple workers handling both prefill and decode (2 replicas for load balancing)

### 3. **Disaggregated Deployment** (`disagg.yaml`)
High-performance deployment with separated prefill and decode workers.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker

### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`)
Advanced disaggregated deployment with KV cache routing capabilities.

**Architecture:**
- `Frontend`: HTTP API server (with kv router mode enabled)
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing)

## CRD Structure

All templates use the **DynamoGraphDeployment** CRD:

```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: <deployment-name>
spec:
services:
<ServiceName>:
# Service configuration
```

### Key Configuration Options

**Resource Management:**
```yaml
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
```

**Container Configuration:**
```yaml
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17
workingDir: /workspace/components/backends/trtllm
args:
- "python3"
- "-m"
- "dynamo.trtllm"
# Model-specific arguments
```

## Prerequisites

Before using these templates, ensure you have:

1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md)
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for TensorRT-LLM runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)

### Container Images

The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:

```bash
./container/build.sh --framework tensorrtllm
# Tag and push to your container registry
# Update the image references in the YAML files
```

**Note:** TensorRT-LLM uses git-lfs, which needs to be installed in advance:
```bash
apt-get update && apt-get -y install git git-lfs
```

For ARM machines, use:
```bash
./container/build.sh --framework tensorrtllm --platform linux/arm64
```

## Usage

### 1. Choose Your Template
Select the deployment pattern that matches your requirements:
- Use `agg.yaml` for simple testing
- Use `agg_router.yaml` for production with KV cache routing and load balancing
- Use `disagg.yaml` for maximum performance with separated workers
- Use `disagg_router.yaml` for high-performance with KV cache routing and disaggregation

### 2. Customize Configuration
Edit the template to match your environment:

```yaml
# Update image registry and tag
image: your-registry/trtllm-runtime:your-tag

# Configure your model and deployment settings
args:
- "python3"
- "-m"
- "dynamo.trtllm"
# Add your model-specific arguments
```

### 3. Deploy

See the [Create Deployment Guide](../../../../docs/guides/dynamo_deploy/create_deployment.md) to learn how to deploy the deployment file.

First, create a secret for the HuggingFace token.
```bash
export HF_TOKEN=your_hf_token
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```

Then, deploy the model using the deployment file.

Export the NAMESPACE you used in your Dynamo Cloud Installation.

```bash
cd dynamo/components/backends/trtllm/deploy
export DEPLOYMENT_FILE=agg.yaml
kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
```

### 4. Using Custom Dynamo Frameworks Image for TensorRT-LLM

To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:

```bash
export DEPLOYMENT_FILE=agg.yaml
export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>

yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated
kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
```

### 5. Port Forwarding

After deployment, forward the frontend service to access the API:

```bash
kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8000:8000
```

## Configuration Options

### Environment Variables

To change `DYN_LOG` level, edit the yaml file by adding:

```yaml
...
spec:
envs:
- name: DYN_LOG
value: "debug" # or other log levels
...
```

### TensorRT-LLM Worker Configuration

TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include:

- **Disaggregation Strategy**: Control request flow with `DISAGGREGATION_STRATEGY` environment variable
- **KV Cache Transfer**: Choose between UCX (default) or NIXL for disaggregated serving
- **Request Migration**: Enable graceful failure handling with `--migration-limit`

### Disaggregation Strategy

The disaggregation strategy controls how requests are distributed between prefill and decode workers:

- **`decode_first`** (default): Requests routed to decode worker first, then forwarded to prefill worker
- **`prefill_first`**: Requests routed directly to prefill worker (used with KV routing)

Set via environment variable:
```yaml
envs:
- name: DISAGGREGATION_STRATEGY
value: "prefill_first"
```

## Testing the Deployment

Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions.

**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.

## Model Configuration

The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files.

### Multi-Token Prediction (MTP) Support

For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit:

```bash
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit
```

## Monitoring and Health

- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
- **Worker health endpoints**: `http://<worker-service>:9090/health`
- **Liveness probes**: Check process health every 5 seconds
- **Readiness probes**: Check service readiness with configurable delays

## KV Cache Transfer Methods

TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving:

- **UCX** (default): Standard method for KV cache transfer
- **NIXL** (experimental): Alternative transfer method

For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-tranfer.md).

## Request Migration

You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations:

```yaml
args:
- "python3"
- "-m"
- "dynamo.trtllm"
- "--migration-limit"
- "3"
```

## Benchmarking

To benchmark your deployment with GenAI-Perf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh)

Configure the `model` name and `host` based on your deployment.

## Further Reading

- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md)
- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md)
- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)

## Troubleshooting

Common issues and solutions:

1. **Pod fails to start**: Check image registry access and HuggingFace token secret
2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size
5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command
6. **Git LFS issues**: Ensure git-lfs is installed before building containers
7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines

For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
Loading
Loading