Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Apply remaining UX improvements from old branch
- Update backend READMEs with correct KVBM status
- Simplify Kubernetes README with cleaner structure
- Fix multinode example to use correct dynamo.sglang command
- Add missing --skip-tokenizer-init flags

Signed-off-by: athreesh <[email protected]>
  • Loading branch information
athreesh committed Sep 24, 2025
commit 14c6c353af2b2b45deabe3f00c77c81e1cc61ed4
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | ❌ | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |

### Large Scale P/D and WideEP Features

Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | | |

### Large Scale P/D and WideEP Features

Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | | |
| [**LMCache**](./LMCache_Integration.md) | ✅ | |

### Large Scale P/D and WideEP Features
Expand Down
52 changes: 42 additions & 10 deletions docs/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ High-level guide to Dynamo Kubernetes deployments. Start here, then dive into sp
```bash
# 1. Set environment
export NAMESPACE=dynamo-kubernetes
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
Expand All @@ -33,19 +33,51 @@ helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

# For multinode deployments, enable Grove/KAI with:
# helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace \
# --set grove.enabled=true \
# --set kai.enabled=true
```

For more details or customization options (including multinode deployments), see **[Installation Guide for Dynamo Kubernetes Platform](/docs/kubernetes/installation_guide.md)**.

## 2. Choose Your Backend

Each backend has deployment examples and configuration options:

| Backend | Available Configurations |
|---------|--------------------------|
| **[vLLM](/components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node |
| **[SGLang](/components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
| **[TensorRT-LLM](/components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node |
## 2. Choose Your Backend and Deployment Pattern

### **Aggregated Serving**
Prefill and decode phases run on the same worker - simplest deployment pattern.

| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Aggregated](components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Aggregated + Router](components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated](components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated + Router](components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated](components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated + Router](components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` |

### **Disaggregated Serving**
Prefill and decode phases run on separate workers - higher performance and scalability.

| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Disaggregated](components/backends/vllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Router](components/backends/vllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Planner](components/backends/vllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated](components/backends/sglang/deploy/disagg.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated + Planner](components/backends/sglang/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated](components/backends/trtllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Router](components/backends/trtllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Planner](components/backends/trtllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |

### **Multi-node Deployment** (Model replicaes across multiple nodes)
Scale disaggregated serving across multiple Kubernetes nodes for maximum performance.

| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Multi-node](components/backends/vllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **SGLang** | [Multi-node](components/backends/sglang/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Multi-node](components/backends/trtllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
Comment on lines +52 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix broken relative links to backend deploy manifests.

These links are relative to docs/kubernetes. They should traverse up two directories.

Apply:

-| **vLLM** | [Aggregated](components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` |
+| **vLLM** | [Aggregated](../../components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` |
-| **vLLM** | [Aggregated + Router](components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
+| **vLLM** | [Aggregated + Router](../../components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
-| **SGLang** | [Aggregated](components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` |
+| **SGLang** | [Aggregated](../../components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` |
-| **SGLang** | [Aggregated + Router](components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` |
+| **SGLang** | [Aggregated + Router](../../components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` |
-| **TensorRT-LLM** | [Aggregated](components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` |
+| **TensorRT-LLM** | [Aggregated](../../components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` |
-| **TensorRT-LLM** | [Aggregated + Router](components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
+| **TensorRT-LLM** | [Aggregated + Router](../../components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` |

Repeat the same ../../ prefix for all Disaggregated and Multi-node tables in this section.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| **vLLM** | [Aggregated](components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Aggregated + Router](components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated](components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated + Router](components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated](components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated + Router](components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
### **Disaggregated Serving**
Prefill and decode phases run on separate workers - higher performance and scalability.
| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Disaggregated](components/backends/vllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Router](components/backends/vllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Planner](components/backends/vllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated](components/backends/sglang/deploy/disagg.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated + Planner](components/backends/sglang/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated](components/backends/trtllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Router](components/backends/trtllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Planner](components/backends/trtllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
### **Multi-node Deployment** (Model replicaes across multiple nodes)
Scale disaggregated serving across multiple Kubernetes nodes for maximum performance.
| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Multi-node](components/backends/vllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **SGLang** | [Multi-node](components/backends/sglang/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Multi-node](components/backends/trtllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **vLLM** | [Aggregated](../../components/backends/vllm/deploy/agg.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Aggregated + Router](../../components/backends/vllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated](../../components/backends/sglang/deploy/agg.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Aggregated + Router](../../components/backends/sglang/deploy/agg_router.yaml) | `kubectl apply -f components/backends/sglang/deploy/agg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated](../../components/backends/trtllm/deploy/agg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Aggregated + Router](../../components/backends/trtllm/deploy/agg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/agg_router.yaml -n ${NAMESPACE}` |
### **Disaggregated Serving**
Prefill and decode phases run on separate workers - higher performance and scalability.
| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Disaggregated](components/backends/vllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Router](components/backends/vllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **vLLM** | [Disaggregated + Planner](components/backends/vllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated](components/backends/sglang/deploy/disagg.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg.yaml -n ${NAMESPACE}` |
| **SGLang** | [Disaggregated + Planner](components/backends/sglang/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated](components/backends/trtllm/deploy/disagg.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Router](components/backends/trtllm/deploy/disagg_router.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_router.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Disaggregated + Planner](components/backends/trtllm/deploy/disagg_planner.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n ${NAMESPACE}` |
### **Multi-node Deployment** (Model replicaes across multiple nodes)
Scale disaggregated serving across multiple Kubernetes nodes for maximum performance.
| Backend | Configuration | Deploy Command |
|---------|---------------|----------------|
| **vLLM** | [Multi-node](components/backends/vllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/vllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **SGLang** | [Multi-node](components/backends/sglang/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/sglang/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
| **TensorRT-LLM** | [Multi-node](components/backends/trtllm/deploy/disagg-multinode.yaml) | `kubectl apply -f components/backends/trtllm/deploy/disagg-multinode.yaml -n ${NAMESPACE}` |
🤖 Prompt for AI Agents
In docs/kubernetes/README.md around lines 52 to 80, the relative links to
backend deploy manifests in the Disaggregated and Multi-node tables are
incorrect because they are relative to docs/kubernetes and need to traverse up
two directories; update every link target and the corresponding kubectl apply
command for all entries in the Disaggregated and Multi-node sections to prepend
"../../" so they point to ../../components/backends/... (i.e., change
components/backends/... to ../../components/backends/... for both the
Configuration link and the Deploy Command for each row).


## 3. Deploy Your First Model

Expand Down
12 changes: 6 additions & 6 deletions examples/basics/multinode/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ Open a terminal on Node 1 and launch both workers:

```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
Expand All @@ -141,7 +141,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl &

CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
Expand All @@ -157,15 +157,15 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
> - `--page-size 16`: Sets the KV cache block size - must be identical across all workers
> - `--disaggregation-mode`: Separates prefill (prompt processing) from decode (token > generation)
> - `--disaggregation-transfer-backend nixl`: Enables high-speed GPU-to-GPU transfers
> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend > handles tokenization
> - `--skip-tokenizer-init`: Avoids duplicate tokenizer loading since the frontend handles tokenization

### Step 3: Launch Replica 2 (Node 2)

Open a terminal on Node 2 and launch both workers:

```bash
# Launch prefill worker in background
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
Expand All @@ -176,7 +176,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.sglang.worker \
--disaggregation-transfer-backend nixl &

# Launch decode worker in foreground
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang.decode_worker \
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--page-size 16 \
Expand Down Expand Up @@ -473,7 +473,7 @@ Stop all components in reverse order:
exit

# Method 3: Kill by process name (from any terminal)
pkill -f "dynamo.sglang.worker.*prefill"
pkill -f "dynamo.sglang.*prefill"
```
3. Stop infrastructure services:
```bash
Expand Down
Loading