ai-dynamo · ryanolson · Aug 2, 2025 · Jul 21, 2025 · Jul 22, 2025 · Jul 22, 2025
diff --git a/components/backends/llama_cpp/README.md b/components/backends/llama_cpp/README.md
@@ -13,16 +13,10 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args]
 
 ## Request Migration
 
-In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
-The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
-
-For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
-
-For example,
 ```bash
 python3 -m dynamo.llama_cpp ... --migration-limit=3
 ```
-indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
 
-The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -43,11 +43,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ### Large Scale P/D and WideEP Features
 
-| Feature            | SGLang | Notes                                                                 |
-|--------------------|--------|-----------------------------------------------------------------------|
-| **WideEP**         | ✅/🚧 | Full support on H100s/GB200 WIP [PR](https://github.com/sgl-project/sglang/pull/7556)                                     |
-| **DP Rank Routing**| 🚧    | Direct routing supported. Process per DP rank is not supported        |
-| **GB200 Support**  | 🚧    | WIP [PR](https://github.com/sgl-project/sglang/pull/7556) |
+| Feature             | SGLang | Notes                                                        |
+|---------------------|--------|--------------------------------------------------------------|
+| **WideEP**          | ✅     | Full support on H100s/GB200                                  |
+| **DP Rank Routing** | 🚧     | Direct routing supported. Dynamo KV router does not router to DP worker |
+| **GB200 Support**   | ✅     |                                                              |
 
 
 ## Quick Start
@@ -143,25 +143,19 @@ When using MoE models, you can also use the our implementation of the native SGL
 
 ## Request Migration
 
-In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend.
+You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
 
-The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues.
-
-For ongoing requests, there is a `--migration-limit` flag which can be set on the Backend that tells the Frontend how many times a request can be migrated to another Backend should there be a loss of connectivity to the current Backend.
-
-For example,
 ```bash
 python3 -m dynamo.sglang ... --migration-limit=3
 ```
-indicates a request to this model may be migrated up to 3 times to another Backend, before failing the request, should the Frontend detects a connectivity issue to the current Backend.
 
-The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
+This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../../docs/architecture/request_migration.md) documentation for details on how this works.
 
 ## Advanced Examples
 
 Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
 
-### Run on multi-node
+### Run a multi-node sized model
 - **[Run a multi-node model](docs/multinode-examples.md)**
 
 ### Large scale P/D disaggregation with WideEP

diff --git a/components/backends/sglang/docs/dsr1-wideep-gb200.md b/components/backends/sglang/docs/dsr1-wideep-gb200.md
@@ -0,0 +1,171 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Running DeepSeek-R1 Disaggregated with WideEP on GB200s
+
+Dynamo supports SGLang's GB200 implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-06-16-gb200-part-1/) for more details. Full end to end optimization is still a work in progress but you can get this up and running with the following steps. In ths example, we will run 1 prefill worker on 2 GB200 nodes (4 GPUs each) and 1 decode worker on 12 GB200 nodes (total 56 GPUs).
+
+## Instructions
+
+1. Build the Dynamo container
+
+```bash
+cd $DYNAMO_ROOT
+docker build \
+  -f container/Dockerfile.sglang-wideep \
+  -t dynamo-wideep-gb200 \
+  --build-arg MODE=blackwell \
+  --build-arg SGLANG_IMAGE_TAG=v0.4.9.post6-cu128-gb200 \
+  --build-arg ARCH=arm64 \
+  --build-arg ARCH_ALT=aarch64 \
+  .
+```
+
+2. You can run this container on each 4xGB200 node using the following command.
+
+> [!IMPORTANT]
+> We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
+
+```bash
+docker run \
+    --gpus all \
+    -it \
+    --rm \
+    --network host \
+    --volume /PATH_TO_DSR1_MODEL/:/model/ \
+    --shm-size=10G \
+    --ulimit memlock=-1 \
+    --ulimit stack=67108864 \
+    --ulimit nofile=65536:65536 \
+    --cap-add CAP_SYS_PTRACE \
+    --ipc host \
+    dynamo-wideep-gb200:latest
+```
+
+3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
+
+```bash
+./utils/gen_env_vars.sh
+```
+
+4. Run the ingress and prefill worker
+
+```bash
+# run ingress
+python3 -m dynamo.frontend --http-port=8000 &
+# optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
+python3 utils/sgl_http_server.py --ns dynamo &
+# run prefill worker
+SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=2048 \
+MC_TE_METRIC=true \
+SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+MC_FORCE_MNNVL=1 \
+NCCL_MNNVL_ENABLE=1 \
+NCCL_CUMEM_ENABLE=1 \
+SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+PYTHONUNBUFFERED=1 \
+python3 components/worker.py \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --model-path /model/ \
+  --skip-tokenizer-init \
+  --trust-remote-code \
+  --disaggregation-mode prefill \
+  --dist-init-addr ${HEAD_PREFILL_NODE_IP}:29500 \
+  --disaggregation-bootstrap-port 30001 \
+  --disaggregation-transfer-backend nixl \
+  --nnodes 2 \
+  --node-rank 0 \
+  --tp-size 8 \
+  --dp-size 8 \
+  --enable-dp-attention \
+  --host 0.0.0.0 \
+  --decode-log-interval 1 \
+  --max-running-requests 6144 \
+  --context-length 2716 \
+  --disable-radix-cache \
+  --enable-deepep-moe \
+  --deepep-mode low_latency \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --disable-shared-experts-fusion \
+  --ep-num-redundant-experts 32 \
+  --ep-dispatch-algorithm static \
+  --eplb-algorithm deepseek \
+  --attention-backend cutlass_mla \
+  --watchdog-timeout 1000000 \
+  --disable-cuda-graph \
+  --chunked-prefill-size 16384 \
+  --max-total-tokens 32768 \
+  --mem-fraction-static 0.8 \
+  --log-level debug
+```
+
+5. Run the decode worker on the head decode node
+
+```bash
+SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=768 \
+MC_TE_METRIC=true \
+SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=100000 \
+SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 \
+SGLANG_DISAGGREGATION_WAITING_TIMEOUT=100000 \
+SGLANG_HACK_SEQ_BOOTSTRAP_ROOM=1 \
+SGLANG_MOONCAKE_CUSTOM_MEM_POOL=True \
+NCCL_MNNVL_ENABLE=1 \
+MC_FORCE_MNNVL=1 \
+NCCL_CUMEM_ENABLE=1 \
+SGLANG_USE_MESSAGE_QUEUE_BROADCASTER=0 \
+SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
+PYTHONUNBUFFERED=1 \
+python3 components/decode_worker.py \
+  --served-model-name deepseek-ai/DeepSeek-R1 \
+  --model-path /model/ \
+  --skip-tokenizer-init \
+  --trust-remote-code \
+  --disaggregation-mode decode \
+  --dist-init-addr ${HEAD_DECODE_NODE_IP}:29500 \
+  --disaggregation-bootstrap-port 30001 \
+  --nnodes 12 \
+  --node-rank 0 \
+  --tp-size 48 \
+  --dp-size 48 \
+  --enable-dp-attention \
+  --host 0.0.0.0 \
+  --decode-log-interval 1 \
+  --max-running-requests 36864 \
+  --context-length 2716 \
+  --disable-radix-cache \
+  --enable-deepep-moe \
+  --deepep-mode low_latency \
+  --moe-dense-tp-size 1 \
+  --enable-dp-lm-head \
+  --cuda-graph-bs 768 \
+  --disable-shared-experts-fusion \
+  --ep-num-redundant-experts 32 \
+  --ep-dispatch-algorithm static \
+  --eplb-algorithm deepseek \
+  --attention-backend cutlass_mla \
+  --watchdog-timeout 1000000 \
+  --chunked-prefill-size 36864 \
+  --mem-fraction-static 0.82 \
+  --log-level debug
+```
+
+On the other decode nodes (this example has 12 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md
@@ -9,22 +9,16 @@ Dynamo supports SGLang's implementation of wide expert parallelism and large sca
 
 ## Instructions
 
-1. Pull the SGLang release `v0.4.8.post1` container. We are actively working on validating newer releases.
-
-```bash
-docker pull lmsysorg/sglang:v0.4.8.post1-cu126
-```
-
-You can also pull a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags)
-
-2. Build the Dynamo container
+1. Build the Dynamo container
 
 ```bash
 cd $DYNAMO_ROOT
 docker build -f container/Dockerfile.sglang-wideep . -t dynamo-wideep --no-cache
 ```
 
-3. You can run this container on each 8xH100 node using the following command.
+You can use a specific tag from the [lmsys dockerhub](https://hub.docker.com/r/lmsysorg/sglang/tags) by adding `--build-arg SGLANG_IMAGE_TAG=<tag>` to the build command.
+
+2. You can run this container on each 8xH100 node using the following command.
 
 > [!IMPORTANT]
 > We recommend downloading DeepSeek-R1 and then mounting it to the container. You can find the model [here](https://huggingface.co/deepseek-ai/DeepSeek-R1)
@@ -47,17 +41,17 @@ docker run \
 
 In each container, you should be in the `/sgl-workspace/dynamo/components/backends/sglang` directory.
 
-4. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
+3. On the head prefill node, run the helper script provided to generate commands to start the `nats-server`, `etcd`. This script will also tell you which environment variables to export on each node to make deployment easier.
 
 ```bash
 ./utils/gen_env_vars.sh
 ```
 
-5. Run the ingress and prefill worker
+4. Run the ingress and prefill worker
 
 ```bash
 # run ingress
-dynamo run in=http out=dyn &
+python3 -m dynamo.frontend --http-port=8000 &
 # optionally run the http server that allows you to flush the kv cache for all workers (see benchmarking section below)
 python3 utils/sgl_http_server.py --ns dynamo &
 # run prefill worker
@@ -93,7 +87,7 @@ python3 -m dynamo.sglang.worker \
 
 On the other prefill node (since this example has 4 total prefill nodes), run the same command but change `--node-rank` to 1,2, and 3
 
-7. Run the decode worker on the head decode node
+5. Run the decode worker on the head decode node
 
 ```bash
 python3 -m dynamo.sglang.decode_worker \
@@ -121,7 +115,7 @@ python3 -m dynamo.sglang.decode_worker \
   --deepep-mode low_latency \
   --mem-fraction-static 0.835 \
   --ep-num-redundant-experts 32 \
-  --cuda-graph-bs 256
+  --cuda-graph-bs 128
 ```
 
 On the other decode nodes (this example has 9 total decode nodes), run the same command but change `--node-rank` to 1, 2, 3, 4, 5, 6, 7, and 8
@@ -131,6 +125,7 @@ On the other decode nodes (this example has 9 total decode nodes), run the same
 In the official [blog post repro instructions](https://github.com/sgl-project/sglang/issues/6017), SGL uses batch inference to benchmark their prefill and decode workers. They do this by pretokenizing the ShareGPT dataset and then creating a batch of 8192 requests with ISL 4096 and OSL 5 (for prefill stress test) and a batch of 40000 with ISL 2000 and OSL 100 (for decode stress test). If you want to repro these benchmarks, you will need to add the following flags to the prefill and decode commands:
 
 prefill:
+
 ```bash
 ...
 --max-running-requests 8192 \
@@ -142,6 +137,7 @@ prefill:
 ```
 
 decode:
+
 ```bash
 ...
 --max-running-requests 18432 \
@@ -152,9 +148,10 @@ decode:
 We currently provide 2 different ways to perform an end to end benchmark which includes using our OpenAI frontend and tokenization. We will continue to add better support for these sorts of large single batch workloads in the future.
 
 1. **GenAI Perf to benchmark end to end performance with 8k ISL 256 OSL**
-We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
+   We've found that 8k ISL 256 OSL provides a good baseline for measuring end to end disaggregated serving performance for DSR1. As WideEP allows for a higher throughput, we provide a script that runs this workload at high concurrencies. DeepGEMM kernels can sometimes take a while to warm up. We provide a short ramping warmup script that can be used.
 
 Example usage:
+
 ```bash
 # warmup
 ./utils/bench.sh HEAD_PREFILL_NODE_IP --type warmup
@@ -165,9 +162,10 @@ curl -X POST http://${HEAD_PREFILL_NODE_IP}:9001/flush_cache
 ```
 
 2. **GenAI Perf to benchmark completions with custom dataset**
-We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAIPerf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
+   We provide a script that generates a JSONL file of the ShareGPT dataset and then use GenAI Perf to benchmark the prefill and decode workers. We use ShareGPT in order to leverage the pre-existing EPLB distributions provided by the SGLang team. If you don't want to use ShareGPT - you can also use GenAI Perf's synthetic dataset setup But note you will have to use dynamic EPLB configurations or record your own as the `init-expert-location` provided by SGLang is tuned specifically for the ShareGPT dataset at a 4096 ISL and 5 OSL.
 
 Example usage:
+
 ```bash
 # generate data
 python3 src/dynamo/sglang/utils/generate_bench_data.py --output data.jsonl --num-prompts 8192 --input-len 4096 --output-len 5 --model deepseek-ai/DeepSeek-R1