Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ High-throughput, low-latency inference framework designed for serving generative

Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close.

Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as:
Dynamo is designed to be inference engine agnostic (supports SGLang, TRT-LLM, vLLM or others) and captures LLM-specific capabilities such as:

- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
- **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
Expand All @@ -54,20 +54,20 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
| ------------------------------------------------------------------------------------------------- | ---- | ------ | ------------ |
| [**Disaggregated Serving**](/docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | | 🚧 | ✅ |
| Feature | SGLang | TensorRT-LLM | vLLM |
| ------------------------------------------------------------------------------------------------- | ------ | ------------ | ---- |
| [**Disaggregated Serving**](/docs/design_docs/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/router/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](docs/planner/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](docs/planner/sla_planner.md) | ✅ | ✅ | ✅ |
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 | ✅ | ✅ |

To learn more about each framework and their capabilities, check out each framework's README!

- **[vLLM](docs/backends/vllm/README.md)**
- **[SGLang](docs/backends/sglang/README.md)**
- **[TensorRT-LLM](docs/backends/trtllm/README.md)**
- **[vLLM](docs/backends/vllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

Expand Down Expand Up @@ -111,15 +111,15 @@ To run locally without etcd, pass `--store-kv file` to both the frontend and wor

## 2. Select an engine

We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
We publish Python wheels specialized for each of our supported engines: sglang, trtllm, and vllm. The examples that follow use SGLang; continue reading for other engines.

```
uv venv venv
source venv/bin/activate
uv pip install pip

# Choose one
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
uv pip install "ai-dynamo[sglang]" #replace with [trtllm], [vllm], etc.
```

## 3. Run Dynamo
Expand Down
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ This directory contains the core components that make up the Dynamo inference fr

Dynamo supports multiple inference engines, each with their own deployment configurations and capabilities:

- **[vLLM](/docs/backends/vllm/README.md)** - Full-featured vLLM integration with disaggregated serving, KV-aware routing, SLA-based planning, native KV cache events, and NIXL-based transfer mechanisms
- **[SGLang](/docs/backends/sglang/README.md)** - SGLang engine integration with ZMQ-based communication, supporting disaggregated serving and KV-aware routing
- **[TensorRT-LLM](/docs/backends/trtllm/README.md)** - TensorRT-LLM integration with disaggregated serving capabilities and TensorRT acceleration
- **[vLLM](/docs/backends/vllm/README.md)** - Full-featured vLLM integration with disaggregated serving, KV-aware routing, SLA-based planning, native KV cache events, and NIXL-based transfer mechanisms

Each engine provides launch and deploy scripts for different deployment patterns in the [examples](../examples/backends/) folder.

Expand Down
2 changes: 1 addition & 1 deletion docs/agents/tool-calling.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ To enable this feature, you should set the following flag while launching the ba
- `--dyn-tool-call-parser` : select the parser from the available parsers list using the below command

```bash
# <backend> can be vllm, sglang, trtllm, etc. based on your installation
# <backend> can be sglang, trtllm, vllm, etc. based on your installation
python -m dynamo.<backend> --help"
```

Expand Down
2 changes: 1 addition & 1 deletion docs/design_docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ limitations under the License.

# High Level Architecture

Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting SGLang, TRT-LLM, vLLM and others, while capturing essential LLM capabilities:

- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
Expand Down
20 changes: 10 additions & 10 deletions docs/hidden_toctree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,16 +44,6 @@
fault_tolerance/request_migration.md
fault_tolerance/request_cancellation.md

backends/trtllm/multinode/multinode-examples.md
backends/trtllm/multinode/multinode-multimodal-example.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/multimodal_support.md
backends/trtllm/multimodal_epd.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md

backends/sglang/multinode-examples.md
backends/sglang/dsr1-wideep-gb200.md
backends/sglang/dsr1-wideep-h100.md
Expand All @@ -64,6 +54,16 @@
backends/sglang/sglang-disaggregation.md
backends/sglang/prometheus.md

backends/trtllm/multinode/multinode-examples.md
backends/trtllm/multinode/multinode-multimodal-example.md
backends/trtllm/llama4_plus_eagle.md
backends/trtllm/kv-cache-transfer.md
backends/trtllm/multimodal_support.md
backends/trtllm/multimodal_epd.md
backends/trtllm/gemma3_sliding_window_attention.md
backends/trtllm/gpt-oss.md
backends/trtllm/prometheus.md

examples/README.md
examples/runtime/hello_world/README.md

Expand Down
2 changes: 1 addition & 1 deletion docs/kvbm/kvbm_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The KVBM serves as a critical infrastructure component for scaling LLM inference
![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../images/kvbm-architecture.png)
**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**

The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBMs block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (SGLang, TRTLLM, and vLLM)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.

The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.

Expand Down
2 changes: 1 addition & 1 deletion docs/kvbm/kvbm_integrations.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ limitations under the License.

# KVBM Integrations

KVBM Integrates with Inference frameworks (vLLM, TRTLLM, SGLang) via Connector APIs to influence KV caching behaviour, scheduling, and forward pass execution.
KVBM Integrates with Inference frameworks (SGLang, TRTLLM, vLLM) via Connector APIs to influence KV caching behaviour, scheduling, and forward pass execution.
There are two components of the interface, Scheduler and Worker. Scheduler(leader) is responsible for the orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion. Worker is responsible for reading metadata built by the scheduler(leader), does async onboarding/ offloading at the end of the forward pass.

## Typical KVBM Integrations
Expand Down
6 changes: 3 additions & 3 deletions docs/multimodal/multimodal_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ The PD approach is a more traditional, aggregated method where the inference eng

## Inference Framework Support Matrix

Dynamo supports multimodal capabilities across leading LLM inference backends, including **vLLM**, **TensorRT-LLM (TRT-LLM)**, and **SGLang**. The table below details the current support level for EPD/PD and various media types for each stack.
Dynamo supports multimodal capabilities across leading LLM inference backends, including **SGLang**, **TensorRT-LLM (TRT-LLM)**, and **vLLM**. The table below details the current support level for EPD/PD and various media types for each stack.

| Stack | EPD Support | PD Support | Image | Video | Audio |
| --------- | --------- | --------- | --------- |---------| --------- |
| **vLLM** | ✅ | ✅ | ✅ | ✅ | 🚧 |
| **TRT-LLM** | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| **SGLang** | ✅ | ❌ | ✅ | ❌ | ❌ |
| **TRT-LLM** | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| **vLLM** | ✅ | ✅ | ✅ | ✅ | 🚧 |
4 changes: 2 additions & 2 deletions docs/reference/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,10 @@ If you are using a **GPU**, the following GPU models and architectures are suppo

| **Build Dependency** | **Version** |
| :------------------- | :------------------------------------------------------------------------------- |
| **TensorRT-LLM** | 1.1.0rc5 |
| **NIXL** | 0.7.0 |
| **vLLM** | 0.10.1.1 |
| **SGLang** | 0.5.3rc0 |
| **TensorRT-LLM** | 1.1.0rc5 |
| **vLLM** | 0.10.1.1 |

> [!Important]
> Specific versions of TensorRT-LLM supported by Dynamo are subject to change. Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] will fail.
Expand Down
117 changes: 117 additions & 0 deletions docs/router/standalone_router.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Router Standalone

A toy implementation of KvRouter that demonstrates standalone usage without dependency on the dynamo runtime, etcd control plane, or nats event plane.

## Overview

This example shows how to use KvRouter in a standalone fashion to intelligently route requests across multiple vLLM workers based on KV cache overlap and load metrics. The router maintains a view of each worker's cached blocks and routes new requests to the worker with the best combination of cache overlap and available capacity.

> [!Tip]
> The main focus should be put on `router.py` as it contains the bulk of the non-boilerplate code and core routing logic.

## How It Works

### Core Architecture

The router uses a **RadixTree** data structure (written in Rust) to efficiently track which blocks each worker has cached. When a new request arrives, the router:

1. Uses `find_matches` to calculate overlap scores (number of matching blocks) between the request and each worker's cached blocks
2. Combines this with current load metrics to select the optimal worker
3. Routes the request to the chosen worker for processing

### Event-Driven Updates

The router receives two types of events from vLLM engines:

1. **KV Events**: Emitted automatically by vLLM engines when blocks are cached/evicted
2. **Load Metrics**: GPU usage percentage and waiting request count via custom callbacks

These events keep the router's view of worker state up-to-date in real-time.

### Alternative: Pure Predictive Routing

While not implemented in this example, the router can also operate in a pure predictive mode, estimating the radix tree state and loads based solely on the requests it receives, without relying on backend events. This requires simulating / mocking the block managing (e.g. eviction) and the scheduling policies of the backend engine. This is not recommended as there is no real-time feedback from the engines, and the router state may drift out of sync with the engine states. Nevertheless, this is WIP and can be supported in the future via our mocker engines.

## Components

> [!Note]
> This is a standalone toy implementation created for pedagogical purposes to demonstrate the core KvRouter concepts in isolation.
> Our default dynamo router is already very efficient and uses NATS for event communication and etcd for endpoint registration.
> This example intentionally avoids these production components to provide a simpler, self-contained demonstration of the routing logic and cache overlap mechanics.
>
> The toy communication pattern is as follows:
> - **OpenAI Compatible Frontend** – FastAPI application serving OpenAI compatible HTTP API.
> - **Router** – Standalone FastAPI endpoint for best worker selection, with core routines implemented in Rust exposed via Python bindings.
> - **Workers** – Served in-process within the frontend application to reduce complexity and boilerplate, rather than as separate endpoints.

### `router.py`
- **KvRouter**: Core routing logic using RadixTree
- Subscribes to KV cache events and load metrics from workers
- Implements `get_best_worker()` to select optimal routing destination
- Runs background tasks to periodically update worker states

### `worker.py`
- **VllmWorkers**: Manages multiple vLLM worker processes
- Each worker runs on a separate port with KV cache event emission enabled
- Provides `direct()` method for sending requests to specific workers
- Handles worker lifecycle and configuration

### `api.py`
- **RouterAPI**: Minimal FastAPI server providing OpenAI-compatible chat completions endpoint
- Enables in-process communication between router and workers
- Can be easily modified to use external communication (FastAPI clients, dynamo endpoints, etc.)
- Integrates with vLLM's OpenAI serving components for request preprocessing and response formatting

### `perf.sh`
- Benchmarking script using `aiperf` to test the router setup
- Configured for streaming chat completions with synthetic workloads
- Tests concurrent requests to evaluate routing performance

## Usage

1. **Install latest vLLM**:
```bash
uv pip uninstall ai-dynamo-vllm
uv pip install vllm==0.9.0
```
*Note: This uninstalls the local vLLM patch (`ai-dynamo-vllm`) and replaces it with the latest standard vLLM package.*

2. **Start the router API**:
For example:
```bash
python api.py \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--num-workers 4 \
--block-size 64 \
--base-kv-events-port 5557 \
--base-metrics-port 5657 \
--router-port 7000 \
--http-port 8000
```

3. **Ping the endpoint (optional)**:
```bash
./ping.sh
```

4. **Run performance benchmark**:
```bash
./perf.sh
```
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ Learn fundamental Dynamo concepts through these introductory examples:
These examples show how Dynamo broadly works using major inference engines.

If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Examples Backends](../examples/backends/) directory:
- **[vLLM](backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](backends/trtllm/)** – TensorRT-LLM workflows and optimizations
- **[vLLM](backends/vllm/)** – vLLM-specific deployment and configuration

## Deployment Examples

Expand Down
Loading