Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 62 additions & 58 deletions components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# LLM Deployment Examples using TensorRT-LLM
# LLM Deployment using TensorRT-LLM

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

# User Documentation

- [Deployment Architectures](#deployment-architectures)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Build docker](#build-docker)
- [Run container](#run-container)
- [Run deployment](#run-deployment)
- [Single Node deployment](#single-node-deployments)
- [Multinode deployment](#multinode-deployment)
- [Client](#client)
- [Benchmarking](#benchmarking)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [More Example Architectures](#more-example-architectures)
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)

# Quick Start

## Use the Latest Release

We recommend using the latest stable release of dynamo to avoid breaking changes:
Expand All @@ -50,26 +31,52 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

## Deployment Architectures
---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#single-node-deployments)
- [Advanced Examples](#advanced-examples)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)

## Feature Support Matrix

### Core Dynamo Features

| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |

### Large Scale P/D and WideEP Features

See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
| Feature | TensorRT-LLM | Notes |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP** | 🚧 | Not supported |
| **DP Rank Routing**| 🚧 | Not supported |
| **GB200 Support** | ✅ | Not supported |

Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
## Quick Start

## Getting Started
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Start NATS and ETCD in the background

### Prerequisites
Start using [Docker Compose](../../../deploy/docker-compose.yml)

Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
```

### Build docker
### Build container

```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
Expand All @@ -89,17 +96,18 @@ apt-get update && apt-get -y install git git-lfs

### Run container

```
```bash
./container/run.sh --framework tensorrtllm -it
```
## Run Deployment

This figure shows an overview of the major components to deploy:
## Single Node Examples

> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.

This figure shows an overview of the major components to deploy:

```

+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker1 |------------>| Worker2 |
| |<-----| |<-----| |<------------| |
Expand All @@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
| +---------| kv-router |
+------------->| |
+------------------+

```

**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.

### Single-Node Deployments

> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.

#### Aggregated
### Aggregated
```bash
cd $DYNAMO_HOME/components/backends/trtllm
./launch/agg.sh
```

#### Aggregated with KV Routing
### Aggregated with KV Routing
```bash
cd $DYNAMO_HOME/components/backends/trtllm
./launch/agg_router.sh
```

#### Disaggregated
### Disaggregated

> [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
Expand All @@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg.sh
```

#### Disaggregated with KV Routing
### Disaggregated with KV Routing

> [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
Expand All @@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg_router.sh
```

#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash
cd $DYNAMO_HOME/components/backends/trtllm

Expand All @@ -172,21 +174,16 @@ Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

### Multinode Deployment

For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.

### Client
## Advanced Examples

See [client](../llm/README.md#client) section to learn how to send request to the deployment.
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!

NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.

### Benchmarking
### Multinode Deployment

To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.

### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**

## Disaggregation Strategy

Expand Down Expand Up @@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe

The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.

## More Example Architectures
## Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.

## Benchmarking

- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
86 changes: 66 additions & 20 deletions components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,81 @@ SPDX-License-Identifier: Apache-2.0

This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

## Deployment Architectures
## Use the Latest Release

See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
We recommend using the latest stable release of Dynamo to avoid breaking changes:

## Getting Started
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)

### Prerequisites
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:

Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```

---

## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)

## Feature Support Matrix

### Core Dynamo Features

| Feature | vLLM | Notes |
|---------|------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |

### Large Scale P/D and WideEP Features

| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | 🚧 | Not supported |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Not supported |

## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Start using [Docker Compose](../../../deploy/docker-compose.yml)

```bash
docker compose -f deploy/docker-compose.yml up -d
```

### Build and Run docker
### Pull or build container

We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:

```bash
./container/build.sh --framework VLLM
```

### Run container

```bash
./container/run.sh -it --framework VLLM [--mount-workspace]
```

This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.

## Run Deployment
## Run Single Node Examples

> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 dynamo.frontend` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.

This figure shows an overview of the major components to deploy:

Expand All @@ -53,57 +101,55 @@ This figure shows an overview of the major components to deploy:

Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.

### Example Architectures

> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.

#### Aggregated Serving
### Aggregated Serving

```bash
# requires one gpu
cd components/backends/vllm
bash launch/agg.sh
```

#### Aggregated Serving with KV Routing
### Aggregated Serving with KV Routing

```bash
# requires two gpus
cd components/backends/vllm
bash launch/agg_router.sh
```

#### Disaggregated Serving
### Disaggregated Serving

```bash
# requires two gpus
cd components/backends/vllm
bash launch/disagg.sh
```

#### Disaggregated Serving with KV Routing
### Disaggregated Serving with KV Routing

```bash
# requires three gpus
cd components/backends/vllm
bash launch/disagg_router.sh
```

#### Single Node Data Parallel Attention / Expert Parallelism
### Single Node Data Parallel Attention / Expert Parallelism

This example is not meant to be performant but showcases dynamo routing to data parallel workers
This example is not meant to be performant but showcases Dynamo routing to data parallel workers

```bash
# requires four gpus
cd components/backends/vllm
bash launch/dep.sh
```


> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.

## Advanced Examples

Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

### Kubernetes Deployment

For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
Expand All @@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director

- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.

- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
- **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
```bash
./container/build.sh --framework VLLM
# Tag and push to your container registry
Expand Down
Loading
Loading