Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | SGLang | Notes |
|---------------------|--------|--------------------------------------------------------------|
| **WideEP** | ✅ | Full support on H100s/GB200 |
| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **Attention DP** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker |
| **GB200 Support** | ✅ | |


Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | TensorRT-LLM | Notes |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | |
| **DP Rank Routing**| ✅ | |
| **Attention DP** | ✅ | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with table across all 3 backends if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch, thank you. will update all of them

| **GB200 Support** | ✅ | |

## Quick Start
Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **Attention DP** | ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |

## Quick Start
Expand Down
64 changes: 0 additions & 64 deletions docs/API/nixl_connect/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,76 +85,13 @@ flowchart LR
e2@{ animate: true; }
```

### Multimodal Example

In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md):

1. The HTTP frontend accepts a text prompt and a URL to an image.

2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker.

3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker.

4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker.

5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker.

6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data.

7. Finally, Decode Worker performs the requested inference.

```mermaid
---
title: Multimodal Disaggregated Workflow
---
flowchart LR
p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor]
p0 i1@--"url"-->p1
p1 i2@--"prompt"-->dw[Decode Worker]
p1 i3@--"url"-->dw
dw i4@--"prompt"-->pw[Prefill Worker]
dw i5@--"url"-->pw
pw i6@--"url"-->ew[Encode Worker]
ew o0@=="image embeddings"==>pw
pw o1@=="kv_cache updates"==>dw
dw o2@--"inference results"-->p0

i0@{ animate: true; }
i1@{ animate: true; }
i2@{ animate: true; }
i3@{ animate: true; }
i4@{ animate: true; }
i5@{ animate: true; }
i6@{ animate: true; }
o0@{ animate: true; }
o1@{ animate: true; }
o2@{ animate: true; }
```

> [!Note]
> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo NIXL Connect library.

#### Code Examples

See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.

See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response.


## Python Classes

- [Connector](connector.md)
- [Descriptor](descriptor.md)
- [Device](device.md)
- [ReadOperation](read_operation.md)
- [ReadableOperation](readable_operation.md)
- [SerializedRequest](serialized_request.md)
- [WritableOperation](writable_operation.md)
- [WriteOperation](write_operation.md)

Expand All @@ -164,5 +101,4 @@ and the worker awaits for the data transfer to complete for yielding a response.
- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
2 changes: 1 addition & 1 deletion docs/architecture/dynamo_flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Coordination and messaging support:

## Technical Implementation Details

### NIXL (NVIDIA Interchange Library):
### NIXL (NVIDIA Inference Xfer Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
- Decode Worker publishes GPU metadata to ETCD for coordination
- PrefillWorker loads metadata to establish direct communication channels
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/kv_cache_routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The KV-aware routing arguments:

- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.

- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
- `--kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.


## Architecture
Expand Down
7 changes: 2 additions & 5 deletions docs/architecture/planner_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ Planner

The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.

Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
Currently, the planner can scale the number of vLLM workers up and down based on the kv cache load and prefill queue size:

Key features include:

* **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
* **Multi-backend support** for both local (Circus) and Kubernetes environments
* **Multi-backend support** for Kubernetes environments
* **Graceful scaling** that ensures no requests are dropped during scale-down operations

.. list-table::
Expand All @@ -50,9 +50,6 @@ Key features include:
* -
- ❌
- SGLang
* -
- ❌
- llama.cpp
* - **Serving Type**
- ✅
- Aggregated
Expand Down
4 changes: 2 additions & 2 deletions docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ Consult the examples below for the CRs for your specific inference backend.

[View SGLang k8s](../../components/backends/sglang/deploy/README.md)

[View vLLM K8s](../../components/backends/vllm/deploy/README.md)
[View vLLM K8s](../../components/backends/vllm/README.md#kubernetes-deployment)

[View TRTLLM k8s](../../components/backends/trtllm/deploy/README.md)
[View TRTLLM k8s](../components/backends/trtllm/README.md#kubernetes-deployment)

**Note 1** Example Image

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/dynamo_deploy/dynamo_operator.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ spec:

## GitOps Deployment with FluxCD

This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../components/backends/vllm/README.md) to demonstrate the workflow.
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../components/backends/vllm/README.md) to demonstrate the workflow.

### Prerequisites

Expand Down
14 changes: 7 additions & 7 deletions docs/guides/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ Dynamo provides built-in metrics capabilities through the `MetricsRegistry` trai

Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also adds the following labels `dynamo_namespace`, `dynamo_component`, and `dynamo_endpoint` to indicate which component is providing the metric.

**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of frontend metrics.
**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of frontend metrics.

**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of component metrics.
**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of component metrics.

**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for details on specialized component metrics.

## Coming Soon

Expand All @@ -49,7 +49,7 @@ This hierarchical structure allows you to create metrics at the appropriate leve

## Getting Started

For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](../../deploy/metrics/README.md#getting-started) in the deploy metrics documentation.
For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](./metrics_deployment.md#getting-started) in the metrics deployment guide.

The quick start includes:
- Docker Compose setup for Prometheus and Grafana
Expand All @@ -59,7 +59,7 @@ The quick start includes:

## Implementation Examples

See [Implementation Examples](../../deploy/metrics/README.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels.
See [Implementation Examples](./metrics_deployment.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels.

### Grafana Dashboards

Expand Down Expand Up @@ -99,5 +99,5 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s
- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
- [Dynamo Architecture Overview](../architecture/architecture.md)
- [Backend Guide](backend.md)
- [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples)
- [Complete Metrics Setup Guide](../../deploy/metrics/README.md)
- [Metrics Implementation Examples](./metrics_deployment.md#implementation-examples)
- [Complete Metrics Setup Guide](./metrics_deployment.md)
Loading
Loading