ai-dynamo · rmccorm4 · Aug 19, 2025 · Aug 19, 2025 · Aug 19, 2025 · Aug 19, 2025
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | **GB200 Support**   | ✅     |                                                              |
 
 
-## Quick Start
+## SGLang Quick Start
 
 Below we provide a guide that lets you run all of our common deployment patterns on a single node.
 

diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | **DP Rank Routing**| ✅           |                                                                 |
 | **GB200 Support**  | ✅           |                                                                 |
 
-## Quick Start
+## TensorRT-LLM Quick Start
 
 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
 

diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | **DP Rank Routing**| ✅   | Supported via external control of DP ranks |
 | **GB200 Support**  | 🚧   | Container functional on main |
 
-## Quick Start
+## vLLM Quick Start
 
 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
 

@@ -1,3 +1,3 @@
 # Dynamo Metrics Collection on Kubernetes
 
-For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/deploy/k8s_metrics.md](../../../docs/guides/deploy/k8s_metrics.md).
+For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/dynamo_deploy/k8s_metrics.md](../../../docs/guides/dynamo_deploy/k8s_metrics.md).
diff --git a/docs/architecture/architecture.md b/docs/architecture/architecture.md
@@ -48,7 +48,7 @@ There are multi-faceted challenges:
 
 To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
 
-## High level architecture and key benefits
+## Key benefits
 
 The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
 

diff --git a/docs/architecture/sla_planner.md b/docs/architecture/sla_planner.md
@@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
 * **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
 * **Correction factors**: Adapts to real-world performance deviations from profiled data
 
-## Architecture
+## Design
 
 The SLA planner consists of several key components:
 
@@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
 
 For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).
 
-**Quick Start:**
+**To deploy SLA Planner:**
 ```bash
 cd components/backends/vllm/deploy
 kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}

diff --git a/docs/components/router/README.md b/docs/components/router/README.md
@@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0
 
 The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
 
-## Quick Start
+## KV Router Quick Start
 
 To launch the Dynamo frontend with the KV Router:
 

diff --git a/docs/guides/dynamo_deploy/README.md b/docs/guides/dynamo_deploy/README.md
@@ -17,85 +17,130 @@ limitations under the License.
 
 # Deploying Inference Graphs to Kubernetes
 
- We expect users to deploy their inference graphs using CRDs or helm charts.
+High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
 
-# 1. Install Dynamo Cloud.
+## 1. Install Platform First
+**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths
 
-Prior to deploying an inference graph the user should deploy the Dynamo Cloud Platform. Reference the [Quickstart Guide](quickstart.md) for steps to install Dynamo Cloud with Helm.
+## 2. Choose Your Backend
 
-Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you. This is a one-time action, only necessary the first time you deploy a DynamoGraph.
+Each backend has deployment examples and configuration options:
 
-# 2. Deploy your inference graph.
+| Backend | Available Configurations |
+|---------|--------------------------|
+| **[vLLM](../../../components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner |
+| **[SGLang](../../../components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
+| **[TensorRT-LLM](../../../components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router |
 
-We provide a Custom Resource YAML file for many examples under the components/backends/{engine}/deploy folders. Consult the examples below for the CRs for a specific inference backend.
+## 3. Deploy Your First Model
 
-[View SGLang K8s](../../../components/backends/sglang/deploy/README.md)
-
-[View vLLM K8s](../../../components/backends/vllm/deploy/README.md)
+```bash
+# Set same namespace from platform install
+export NAMESPACE=dynamo-cloud
 
-[View TRT-LLM K8s](../../../components/backends/trtllm/deploy/README.md)
+# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
+kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
 
-### Deploying a particular example
+# Check status
+kubectl get dynamoGraphDeployment -n ${NAMESPACE}
 
-```bash
-# Set your dynamo root directory
-cd <root-dynamo-folder>
-export PROJECT_ROOT=$(pwd)
-export NAMESPACE=<your-namespace> # the namespace you used to deploy Dynamo cloud to.
+# Test it
+kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
+curl http://localhost:8000/v1/models
 ```
 
-Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example:
+## What's a DynamoGraphDeployment?
 
-```bash
-kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
-```
+It's a Kubernetes Custom Resource that defines your inference pipeline:
+- Model configuration
+- Resource allocation (GPUs, memory)
+- Scaling policies
+- Frontend/backend connections
 
-You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment.
-You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
+The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.
 
-We provide a Custom Resource YAML file for many examples under the `deploy/` folder.
-Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.
+### Choosing Your Architecture Pattern
 
-**Note 1** Example Image
+When creating a deployment, select the architecture pattern that best fits your use case:
 
-The examples use a prebuilt image from the `nvcr.io` registry.
-You can utilize public images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) or build your own image and update the image location in your CR file prior to applying. Either way, you will need to overwrite the image in the example YAML.
+- **Development / Testing** - Use `agg.yaml` as the base configuration
+- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
+- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability
 
-To build your own image:
+### Frontend and Worker Components
 
-```bash
-./container/build.sh --framework <your-inference-framework>
-```
+You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
 
-For example for the `sglang` run
-```bash
-./container/build.sh --framework sglang
-```
+- Provides OpenAI-compatible `/v1/chat/completions` endpoint
+- Auto-discovers backend workers via etcd
+- Routes requests and handles load balancing
+- Validates and preprocesses requests
 
-To overwrite the image in the example:
+### Customizing Your Deployment
 
-```bash
-extraPodSpec:
+Example structure:
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-llm
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-llm
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
         mainContainer:
-          image: <image-in-your-$DYNAMO_IMAGE>
+          image: your-image
+    VllmDecodeWorker:  # or SGLangDecodeWorker, TrtllmDecodeWorker
+      dynamoNamespace: dynamo-dev
+      componentType: worker
+      replicas: 1
+      envFromSecret: hf-token-secret  # for HuggingFace models
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        mainContainer:
+          image: your-image
+          command: ["/bin/sh", "-c"]
+          args:
+            - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
 ```
 
-**Note 2**
-Setup port forward if needed when deploying to Kubernetes.
-
-List the services in your namespace:
-
-```bash
-kubectl get svc -n ${NAMESPACE}
+Worker command examples per backend:
+```yaml
+# vLLM worker
+args:
+  - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
+
+# SGLang worker
+args:
+  - >-
+    python3 -m dynamo.sglang
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --tp 1
+    --trust-remote-code
+
+# TensorRT-LLM worker
+args:
+  - python3 -m dynamo.trtllm
+    --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+    --extra-engine-args engine_configs/agg.yaml
 ```
-Look for one that ends in `-frontend` and use it for port forward.
 
-```bash
-SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1)
-kubectl port-forward svc/${SERVICE_NAME}-frontend 8080:8080 -n ${NAMESPACE}
-```
+Key customization points include:
+- **Model Configuration**: Specify model in the args command
+- **Resource Allocation**: Configure GPU requirements under `resources.limits`
+- **Scaling**: Set `replicas` for number of worker instances
+- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
+- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers
 
-Additional Resources:
-- [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
-- [Examples Deployment Guide](../../examples/README.md#deploying-a-particular-example)
+## Additional Resources
 
+- **[Examples](../../examples/README.md)** - Complete working examples
+- **[Create Custom Deployments](create_deployment.md)** - Build your own CRDs
+- **[Operator Documentation](dynamo_operator.md)** - How the platform works
+- **[Helm Charts](../../../deploy/helm/README.md)** - For advanced users