ai-dynamo · athreesh · Aug 4, 2025 · Aug 4, 2025 · Aug 4, 2025 · Aug 5, 2025
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -46,7 +46,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature             | SGLang | Notes                                                        |
 |---------------------|--------|--------------------------------------------------------------|
 | **WideEP**          | ✅     | Full support on H100s/GB200                                  |
-| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+| **Attention DP**    | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
 | **GB200 Support**   | ✅     |                                                              |
 
 

diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -61,7 +61,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature            | TensorRT-LLM | Notes                                                                 |
 |--------------------|--------------|-----------------------------------------------------------------------|
 | **WideEP**         | ✅           |                                                                 |
-| **DP Rank Routing**| ✅           |                                                                 |
+| **Attention DP**   | ✅           |                                                                 |
 | **GB200 Support**  | ✅           |                                                                 |
 
 ## Quick Start

diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -47,7 +47,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature            | vLLM | Notes                                                                 |
 |--------------------|------|-----------------------------------------------------------------------|
 | **WideEP**         | ✅   | Support for PPLX / DeepEP not verified                                           |
-| **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
+| **Attention DP**   | ✅   | Supported via external control of DP ranks |
 | **GB200 Support**  | 🚧   | Container functional on main |
 
 ## Quick Start

diff --git a/docs/API/nixl_connect/README.md b/docs/API/nixl_connect/README.md
@@ -85,76 +85,13 @@ flowchart LR
   e2@{ animate: true; }
 ```
 
-### Multimodal Example
-
-In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md):
-
- 1. The HTTP frontend accepts a text prompt and a URL to an image.
-
- 2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker.
-
- 3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker.
-
- 4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker.
-
- 5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker.
-
- 6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data.
-
- 7. Finally, Decode Worker performs the requested inference.
-
-```mermaid
----
-title: Multimodal Disaggregated Workflow
----
-flowchart LR
-  p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor]
-  p0 i1@--"url"-->p1
-  p1 i2@--"prompt"-->dw[Decode Worker]
-  p1 i3@--"url"-->dw
-  dw i4@--"prompt"-->pw[Prefill Worker]
-  dw i5@--"url"-->pw
-  pw i6@--"url"-->ew[Encode Worker]
-  ew o0@=="image embeddings"==>pw
-  pw o1@=="kv_cache updates"==>dw
-  dw o2@--"inference results"-->p0
-
-  i0@{ animate: true; }
-  i1@{ animate: true; }
-  i2@{ animate: true; }
-  i3@{ animate: true; }
-  i4@{ animate: true; }
-  i5@{ animate: true; }
-  i6@{ animate: true; }
-  o0@{ animate: true; }
-  o1@{ animate: true; }
-  o2@{ animate: true; }
-```
-
-> [!Note]
-> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
-> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo NIXL Connect library.
-
-#### Code Examples
-
-See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example,
-for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
-sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
-
-See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
-for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md),
-a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
-and the worker awaits for the data transfer to complete for yielding a response.
-
-
 ## Python Classes
 
   - [Connector](connector.md)
   - [Descriptor](descriptor.md)
   - [Device](device.md)
   - [ReadOperation](read_operation.md)
   - [ReadableOperation](readable_operation.md)
-  - [SerializedRequest](serialized_request.md)
   - [WritableOperation](writable_operation.md)
   - [WriteOperation](write_operation.md)
 
@@ -164,5 +101,4 @@ and the worker awaits for the data transfer to complete for yielding a response.
   - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
     - [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
   - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
-  - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
   - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md
@@ -67,7 +67,7 @@ Coordination and messaging support:
 
 ## Technical Implementation Details
 
-### NIXL (NVIDIA Interchange Library):
+### NIXL (NVIDIA Inference Xfer Library):
 - Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
 - Decode Worker publishes GPU metadata to ETCD for coordination
 - PrefillWorker loads metadata to establish direct communication channels

diff --git a/docs/architecture/kv_cache_routing.md b/docs/architecture/kv_cache_routing.md
@@ -21,7 +21,7 @@ The KV-aware routing arguments:
 
 - `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
 
-- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
+- `--kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
 
 
 ## Architecture

diff --git a/docs/architecture/planner_intro.rst b/docs/architecture/planner_intro.rst
@@ -19,13 +19,13 @@ Planner
 
 The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
 
-Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
+Currently, the planner can scale the number of vLLM workers up and down based on the kv cache load and prefill queue size:
 
 Key features include:
 
 * **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
 * **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
-* **Multi-backend support** for both local (Circus) and Kubernetes environments
+* **Multi-backend support** for Kubernetes environments
 * **Graceful scaling** that ensures no requests are dropped during scale-down operations
 
 .. list-table::
@@ -50,9 +50,6 @@ Key features include:
    * -
      - ❌
      - SGLang
-   * -
-     - ❌
-     - llama.cpp
    * - **Serving Type**
      - ✅
      - Aggregated

diff --git a/docs/examples/README.md b/docs/examples/README.md
@@ -45,9 +45,9 @@ Consult the examples below for the CRs for your specific inference backend.
 
 [View SGLang k8s](../../components/backends/sglang/deploy/README.md)
 
-[View vLLM K8s](../../components/backends/vllm/deploy/README.md)
+[View vLLM K8s](../../components/backends/vllm/README.md#kubernetes-deployment)
 
-[View TRTLLM k8s](../../components/backends/trtllm/deploy/README.md)
+[View TRTLLM k8s](../components/backends/trtllm/README.md#kubernetes-deployment)
 
 **Note 1** Example Image
 

diff --git a/docs/guides/dynamo_deploy/dynamo_operator.md b/docs/guides/dynamo_deploy/dynamo_operator.md
@@ -75,7 +75,7 @@ spec:
 
 ## GitOps Deployment with FluxCD
 
-This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../components/backends/vllm/README.md) to demonstrate the workflow.
+This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../components/backends/vllm/README.md) to demonstrate the workflow.
 
 ### Prerequisites
 

diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md
@@ -25,11 +25,11 @@ Dynamo provides built-in metrics capabilities through the `MetricsRegistry` trai
 
 Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also adds the following labels `dynamo_namespace`, `dynamo_component`, and `dynamo_endpoint` to indicate which component is providing the metric.
 
-**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of frontend metrics.
+**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of frontend metrics.
 
-**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of component metrics.
+**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of component metrics.
 
-**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics.
+**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for details on specialized component metrics.
 
 ## Coming Soon
 
@@ -49,7 +49,7 @@ This hierarchical structure allows you to create metrics at the appropriate leve
 
 ## Getting Started
 
-For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](../../deploy/metrics/README.md#getting-started) in the deploy metrics documentation.
+For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](./metrics_deployment.md#getting-started) in the metrics deployment guide.
 
 The quick start includes:
 - Docker Compose setup for Prometheus and Grafana
@@ -59,7 +59,7 @@ The quick start includes:
 
 ## Implementation Examples
 
-See [Implementation Examples](../../deploy/metrics/README.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels.
+See [Implementation Examples](./metrics_deployment.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels.
 
 ### Grafana Dashboards
 
@@ -99,5 +99,5 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s
 - [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
 - [Dynamo Architecture Overview](../architecture/architecture.md)
 - [Backend Guide](backend.md)
-- [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples)
-- [Complete Metrics Setup Guide](../../deploy/metrics/README.md)
+- [Metrics Implementation Examples](./metrics_deployment.md#implementation-examples)
+- [Complete Metrics Setup Guide](./metrics_deployment.md)