Skip to content

Commit 9e0ae96

Browse files
kylehhnealvaidya
authored andcommitted
docs: SNS agg k8s example (#2773)
Signed-off-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: nnshah1 <neelays@nvidia.com>
1 parent 043963e commit 9e0ae96

File tree

2 files changed

+164
-0
lines changed

2 files changed

+164
-0
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Distributed Inferences with Dynamo
2+
## 1. Single-Node-Sized Models hosting on multiple Nodes
3+
For SNS (Single-Node-Sized) Model, we can use Dynamo aggregated serving to deploy multiple replicas of the model and create a frontend with different routing strategies
4+
1. Install Dynamo CRD
5+
```sh
6+
export RELEASE_VERSION=0.5.0 # any version of Dynamo 0.3.2+
7+
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
8+
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
9+
```
10+
2. Install Dynamo platform
11+
Create a K8S namespace for your Dynamo application and install the Dynamo platform. It will install following pods:
12+
- ETCD
13+
- NATS
14+
- Dynamo Operator Controller
15+
```sh
16+
export NAMESPACE=YOUR_DYNAMO_NAMESPACE
17+
kubectl create namespace ${NAMESPACE}
18+
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
19+
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
20+
```
21+
3. Model hosting with vLLM backend
22+
This `agg_router.yaml` is adpated from vLLM deployment [example](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/agg_router.yaml). It has following customizations
23+
- Deployed `Qwen/Qwen2.5-1.5B-Instruct` model
24+
- Use KV cache based routing in frontend deployment `--router-mode kv`
25+
- Mounted a local cache folder `/YOUR/LOCAL/CACHE/FOLDER` for model artifacts reuse
26+
- Created 4 replicas for this model deployment by setting `replicas: 4`
27+
- Added `debug` flag environment variable for observability
28+
Create a K8S secret with your Huggingface token and then deploy the models
29+
```sh
30+
export HF_TOKEN=YOUR_HF_TOKEN
31+
kubectl create secret generic hf-token-secret \
32+
--from-literal=HF_TOKEN=${HF_TOKEN} \
33+
--namespace ${NAMESPACE}
34+
kubectl apply -f agg_router.yaml --namespace ${NAMESPACE}
35+
```
36+
4. Testing the deployment and run benchmarks
37+
After deployment, forward the frontend service to access the API:
38+
```sh
39+
kubectl port-forward deployment/vllm-agg-router-frontend 8000:8000 -n ${NAMESPACE}
40+
```
41+
and use following request to test the deployed model
42+
```sh
43+
curl localhost:8000/v1/chat/completions \
44+
-H "Content-Type: application/json" \
45+
-d '{
46+
"model": "Qwen/Qwen3-0.6B",
47+
"messages": [
48+
{
49+
"role": "user",
50+
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
51+
}
52+
],
53+
"stream": false,
54+
"max_tokens": 30
55+
}'
56+
```
57+
You can also benchmark the performance of the endpoint by [GenAI-Perf](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html)
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: nvidia.com/v1alpha1
5+
kind: DynamoGraphDeployment
6+
metadata:
7+
name: vllm-agg-router
8+
spec:
9+
services:
10+
Frontend:
11+
livenessProbe:
12+
httpGet:
13+
path: /health
14+
port: 8000
15+
initialDelaySeconds: 60
16+
periodSeconds: 60
17+
timeoutSeconds: 30
18+
failureThreshold: 10
19+
readinessProbe:
20+
exec:
21+
command:
22+
- /bin/sh
23+
- -c
24+
- 'curl -s http://localhost:8000/health | jq -e ".status == \"healthy\""'
25+
initialDelaySeconds: 60
26+
periodSeconds: 60
27+
timeoutSeconds: 30
28+
failureThreshold: 10
29+
dynamoNamespace: vllm-agg-router
30+
componentType: main
31+
replicas: 1
32+
resources:
33+
requests:
34+
cpu: "1"
35+
memory: "2Gi"
36+
limits:
37+
cpu: "1"
38+
memory: "2Gi"
39+
extraPodSpec:
40+
mainContainer:
41+
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
42+
workingDir: /workspace/components/backends/vllm
43+
command:
44+
- /bin/sh
45+
- -c
46+
args:
47+
- "python3 -m dynamo.frontend --http-port 8000 --router-mode kv"
48+
VllmDecodeWorker:
49+
envFromSecret: hf-token-secret
50+
livenessProbe:
51+
httpGet:
52+
path: /live
53+
port: 9090
54+
periodSeconds: 5
55+
timeoutSeconds: 30
56+
failureThreshold: 1
57+
readinessProbe:
58+
httpGet:
59+
path: /health
60+
port: 9090
61+
periodSeconds: 10
62+
timeoutSeconds: 30
63+
failureThreshold: 60
64+
dynamoNamespace: vllm-agg-router
65+
componentType: worker
66+
replicas: 4
67+
resources:
68+
requests:
69+
cpu: "10"
70+
memory: "20Gi"
71+
gpu: "1"
72+
limits:
73+
cpu: "10"
74+
memory: "20Gi"
75+
gpu: "1"
76+
envs:
77+
- name: DYN_SYSTEM_ENABLED
78+
value: "true"
79+
- name: DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS
80+
value: "[\"generate\"]"
81+
- name: DYN_SYSTEM_PORT
82+
value: "9090"
83+
- name: DYN_LOG
84+
value: "debug"
85+
extraPodSpec:
86+
volumes:
87+
- name: local-model-cache
88+
hostPath:
89+
path: /YOUR/LOCAL/CACHE/FOLDER
90+
type: DirectoryOrCreate
91+
mainContainer:
92+
startupProbe:
93+
httpGet:
94+
path: /health
95+
port: 9090
96+
periodSeconds: 10
97+
failureThreshold: 60
98+
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
99+
volumeMounts:
100+
- name: local-model-cache
101+
mountPath: /root/.cache
102+
workingDir: /workspace/components/backends/vllm
103+
command:
104+
- /bin/sh
105+
- -c
106+
args:
107+
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B 2>&1 | tee /tmp/vllm.log

0 commit comments

Comments
 (0)