Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **GB200 Support** | ✅ | |


## Quick Start
## SGLang Quick Start

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | |

## Quick Start
## TensorRT-LLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |

## Quick Start
## vLLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion deploy/metrics/k8s/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Dynamo Metrics Collection on Kubernetes

For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/deploy/k8s_metrics.md](../../../docs/guides/deploy/k8s_metrics.md).
For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/dynamo_deploy/k8s_metrics.md](../../../docs/guides/dynamo_deploy/k8s_metrics.md).
2 changes: 1 addition & 1 deletion docs/architecture/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ There are multi-faceted challenges:

To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.

## High level architecture and key benefits
## Key benefits

The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:

Expand Down
4 changes: 2 additions & 2 deletions docs/architecture/sla_planner.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data

## Architecture
## Design

The SLA planner consists of several key components:

Expand Down Expand Up @@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill

For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).

**Quick Start:**
**To deploy SLA Planner:**
```bash
cd components/backends/vllm/deploy
kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
Expand Down
2 changes: 1 addition & 1 deletion docs/components/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0

The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.

## Quick Start
## KV Router Quick Start

To launch the Dynamo frontend with the KV Router:

Expand Down
153 changes: 99 additions & 54 deletions docs/guides/dynamo_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,85 +17,130 @@ limitations under the License.

# Deploying Inference Graphs to Kubernetes

We expect users to deploy their inference graphs using CRDs or helm charts.
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

# 1. Install Dynamo Cloud.
## 1. Install Platform First
**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths

Prior to deploying an inference graph the user should deploy the Dynamo Cloud Platform. Reference the [Quickstart Guide](quickstart.md) for steps to install Dynamo Cloud with Helm.
## 2. Choose Your Backend

Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you. This is a one-time action, only necessary the first time you deploy a DynamoGraph.
Each backend has deployment examples and configuration options:

# 2. Deploy your inference graph.
| Backend | Available Configurations |
|---------|--------------------------|
| **[vLLM](../../../components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner |
| **[SGLang](../../../components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
| **[TensorRT-LLM](../../../components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router |

We provide a Custom Resource YAML file for many examples under the components/backends/{engine}/deploy folders. Consult the examples below for the CRs for a specific inference backend.
## 3. Deploy Your First Model

[View SGLang K8s](../../../components/backends/sglang/deploy/README.md)

[View vLLM K8s](../../../components/backends/vllm/deploy/README.md)
```bash
# Set same namespace from platform install
export NAMESPACE=dynamo-cloud

[View TRT-LLM K8s](../../../components/backends/trtllm/deploy/README.md)
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

### Deploying a particular example
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}

```bash
# Set your dynamo root directory
cd <root-dynamo-folder>
export PROJECT_ROOT=$(pwd)
export NAMESPACE=<your-namespace> # the namespace you used to deploy Dynamo cloud to.
# Test it
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
```

Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example:
## What's a DynamoGraphDeployment?

```bash
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
```
It's a Kubernetes Custom Resource that defines your inference pipeline:
- Model configuration
- Resource allocation (GPUs, memory)
- Scaling policies
- Frontend/backend connections

You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment.
You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
The scripts in the `components/<backend>/launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph.

We provide a Custom Resource YAML file for many examples under the `deploy/` folder.
Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.
### Choosing Your Architecture Pattern

**Note 1** Example Image
When creating a deployment, select the architecture pattern that best fits your use case:

The examples use a prebuilt image from the `nvcr.io` registry.
You can utilize public images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) or build your own image and update the image location in your CR file prior to applying. Either way, you will need to overwrite the image in the example YAML.
- **Development / Testing** - Use `agg.yaml` as the base configuration
- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference
- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability

To build your own image:
### Frontend and Worker Components

```bash
./container/build.sh --framework <your-inference-framework>
```
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

For example for the `sglang` run
```bash
./container/build.sh --framework sglang
```
- Provides OpenAI-compatible `/v1/chat/completions` endpoint
- Auto-discovers backend workers via etcd
- Routes requests and handles load balancing
- Validates and preprocesses requests

To overwrite the image in the example:
### Customizing Your Deployment

```bash
extraPodSpec:
Example structure:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: <image-in-your-$DYNAMO_IMAGE>
image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
```

**Note 2**
Setup port forward if needed when deploying to Kubernetes.

List the services in your namespace:

```bash
kubectl get svc -n ${NAMESPACE}
Worker command examples per backend:
```yaml
# vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

# SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code

# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args engine_configs/agg.yaml
```
Look for one that ends in `-frontend` and use it for port forward.

```bash
SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1)
kubectl port-forward svc/${SERVICE_NAME}-frontend 8080:8080 -n ${NAMESPACE}
```
Key customization points include:
- **Model Configuration**: Specify model in the args command
- **Resource Allocation**: Configure GPU requirements under `resources.limits`
- **Scaling**: Set `replicas` for number of worker instances
- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs
- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers

Additional Resources:
- [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
- [Examples Deployment Guide](../../examples/README.md#deploying-a-particular-example)
## Additional Resources

- **[Examples](../../examples/README.md)** - Complete working examples
- **[Create Custom Deployments](create_deployment.md)** - Build your own CRDs
- **[Operator Documentation](dynamo_operator.md)** - How the platform works
- **[Helm Charts](../../../deploy/helm/README.md)** - For advanced users
Loading
Loading