ai-dynamo · athreesh · Jul 29, 2025 · Jul 24, 2025 · Jul 28, 2025 · Jul 28, 2025
diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# LLM Deployment Examples using TensorRT-LLM
+# LLM Deployment using TensorRT-LLM
 
 This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
 
-# User Documentation
-
-- [Deployment Architectures](#deployment-architectures)
-- [Getting Started](#getting-started)
-  - [Prerequisites](#prerequisites)
-  - [Build docker](#build-docker)
-  - [Run container](#run-container)
-  - [Run deployment](#run-deployment)
-    - [Single Node deployment](#single-node-deployments)
-    - [Multinode deployment](#multinode-deployment)
-  - [Client](#client)
-  - [Benchmarking](#benchmarking)
-- [Disaggregation Strategy](#disaggregation-strategy)
-- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
-- [More Example Architectures](#more-example-architectures)
-  - [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
-
-# Quick Start
-
 ## Use the Latest Release
 
 We recommend using the latest stable release of dynamo to avoid breaking changes:
@@ -50,26 +31,52 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
 git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ```
 
-## Deployment Architectures
+---
+
+## Table of Contents
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#quick-start)
+- [Single Node Examples](#single-node-deployments)
+- [Advanced Examples](#advanced-examples)
+- [Disaggregation Strategy](#disaggregation-strategy)
+- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
+- [Client](#client)
+- [Benchmarking](#benchmarking)
+
+## Feature Support Matrix
+
+### Core Dynamo Features
+
+| Feature | TensorRT-LLM | Notes |
+|---------|--------------|-------|
+| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | Planned |
+| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
+
+### Large Scale P/D and WideEP Features
 
-See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture.
+| Feature            | TensorRT-LLM | Notes                                                                 |
+|--------------------|--------------|-----------------------------------------------------------------------|
+| **WideEP**         | 🚧           | Not supported                                                                 |
+| **DP Rank Routing**| 🚧           | Not supported                                                                 |
+| **GB200 Support**  | ✅           | Not supported |
 
-Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving.
+## Quick Start
 
-## Getting Started
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
 
-1. Choose a deployment architecture based on your requirements
-2. Configure the components as needed
-3. Deploy using the provided scripts
+### Start NATS and ETCD in the background
 
-### Prerequisites
+Start using [Docker Compose](../../../deploy/docker-compose.yml)
 
-Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml)
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```
 
-### Build docker
+### Build container
 
 ```bash
 # TensorRT-LLM uses git-lfs, which needs to be installed in advance.
@@ -89,17 +96,18 @@ apt-get update && apt-get -y install git git-lfs
 
 ### Run container
 
-```
+```bash
 ./container/run.sh --framework tensorrtllm -it
 ```
-## Run Deployment
 
-This figure shows an overview of the major components to deploy:
+## Single Node Examples
 
+> [!IMPORTANT]
+> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
 
+This figure shows an overview of the major components to deploy:
 
 ```
-
 +------+      +-----------+      +------------------+             +---------------+
 | HTTP |----->| processor |----->|      Worker1     |------------>|    Worker2    |
 |      |<-----|           |<-----|                  |<------------|               |
@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
                   |    +---------|     kv-router    |
                   +------------->|                  |
                                  +------------------+
-
 ```
 
 **Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
 
-### Single-Node Deployments
-
-> [!IMPORTANT]
-> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
-
-#### Aggregated
+### Aggregated
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 ./launch/agg.sh
 ```
 
-#### Aggregated with KV Routing
+### Aggregated with KV Routing
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 ./launch/agg_router.sh
 ```
 
-#### Disaggregated
+### Disaggregated
 
 > [!IMPORTANT]
 > Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
 ./launch/disagg.sh
 ```
 
-#### Disaggregated with KV Routing
+### Disaggregated with KV Routing
 
 > [!IMPORTANT]
 > Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
 ./launch/disagg_router.sh
 ```
 
-#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
+### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
 ```bash
 cd $DYNAMO_HOME/components/backends/trtllm
 
@@ -172,21 +174,16 @@ Notes:
 - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
 - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
 
-### Multinode Deployment
-
-For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
-
-### Client
+## Advanced Examples
 
-See [client](../llm/README.md#client) section to learn how to send request to the deployment.
+Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
 
-NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
-
-### Benchmarking
+### Multinode Deployment
 
-To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
 
+### Speculative Decoding
+- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
 
 ## Disaggregation Strategy
 
@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
 
 The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
 
-## More Example Architectures
+## Client
+
+See [client](../llm/README.md#client) section to learn how to send request to the deployment.
+
+NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
+
+## Benchmarking
 
-- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
+To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
+`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -7,33 +7,81 @@ SPDX-License-Identifier: Apache-2.0
 
 This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
 
-## Deployment Architectures
+## Use the Latest Release
 
-See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns.
+We recommend using the latest stable release of Dynamo to avoid breaking changes:
 
-## Getting Started
+[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
 
-### Prerequisites
+You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
 
-Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml):
+```bash
+git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
+```
+
+---
+
+## Table of Contents
+- [Feature Support Matrix](#feature-support-matrix)
+- [Quick Start](#quick-start)
+- [Single Node Examples](#run-single-node-examples)
+- [Advanced Examples](#advanced-examples)
+- [Deploy on Kubernetes](#kubernetes-deployment)
+- [Configuration](#configuration)
+
+## Feature Support Matrix
+
+### Core Dynamo Features
+
+| Feature | vLLM | Notes |
+|---------|------|-------|
+| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ✅ |  |
+| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | WIP |
+| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
+
+### Large Scale P/D and WideEP Features
+
+| Feature            | vLLM | Notes                                                                 |
+|--------------------|------|-----------------------------------------------------------------------|
+| **WideEP**         | 🚧   | Not supported                                                                 |
+| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
+| **GB200 Support**  | 🚧   | Not supported |
+
+## Quick Start
+
+Below we provide a guide that lets you run all of our the common deployment patterns on a single node. 
+
+### Start NATS and ETCD in the background
+
+Start using [Docker Compose](../../../deploy/docker-compose.yml)
 
 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```
 
-### Build and Run docker
+### Pull or build container
+
+We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
 
 ```bash
 ./container/build.sh --framework VLLM
 ```
 
+### Run container
+
 ```bash
 ./container/run.sh -it --framework VLLM [--mount-workspace]
 ```
 
 This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
 
-## Run Deployment
+## Run Single Node Examples
+
+> [!IMPORTANT]
+> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 dynamo.frontend` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
 
 This figure shows an overview of the major components to deploy:
 
@@ -53,57 +101,55 @@ This figure shows an overview of the major components to deploy:
 
 Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
 
-### Example Architectures
-
-> [!IMPORTANT]
-> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
-
-#### Aggregated Serving
+### Aggregated Serving
 
 ```bash
 # requires one gpu
 cd components/backends/vllm
 bash launch/agg.sh
 ```
 
-#### Aggregated Serving with KV Routing
+### Aggregated Serving with KV Routing
 
 ```bash
 # requires two gpus
 cd components/backends/vllm
 bash launch/agg_router.sh
 ```
 
-#### Disaggregated Serving
+### Disaggregated Serving
 
 ```bash
 # requires two gpus
 cd components/backends/vllm
 bash launch/disagg.sh
 ```
 
-#### Disaggregated Serving with KV Routing
+### Disaggregated Serving with KV Routing
 
 ```bash
 # requires three gpus
 cd components/backends/vllm
 bash launch/disagg_router.sh
 ```
 
-#### Single Node Data Parallel Attention / Expert Parallelism
+### Single Node Data Parallel Attention / Expert Parallelism
 
-This example is not meant to be performant but showcases dynamo routing to data parallel workers
+This example is not meant to be performant but showcases Dynamo routing to data parallel workers
 
 ```bash
 # requires four gpus
 cd components/backends/vllm
 bash launch/dep.sh
 ```
 
-
 > [!TIP]
 > Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
 
+## Advanced Examples
+
+Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
+
 ### Kubernetes Deployment
 
 For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
 
 - **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
 
-- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image:
+- **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
   ```bash
   ./container/build.sh --framework VLLM
   # Tag and push to your container registry