Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
1aa8c32
chore: update CODEOWNERS for multimodal examples (#2878)
biswapanda Sep 5, 2025
ece24b1
first attempt
tedzhouhk Sep 8, 2025
c5633e1
add aiofiles
tedzhouhk Sep 9, 2025
0846ae7
fix trtllm docker file
tedzhouhk Sep 9, 2025
90b5ff1
Revert "fix trtllm docker file"
tedzhouhk Sep 9, 2025
f2a66cb
install requirement.txt
tedzhouhk Sep 9, 2025
4001324
add trtllm to sla planner
tedzhouhk Sep 9, 2025
ebc7611
fix: fix hermes tool call config (#2915)
ayushag-nv Sep 8, 2025
e63ec2e
ci: OPS-724: Move to ARC runners (#2904)
dillon-cullinan Sep 8, 2025
37f2778
fix: CI is broken with a deprecated dependency on pynvml (#2926)
saturley-hall Sep 8, 2025
cd1115a
fix: fix typo in multinode example (#2931)
julienmancuso Sep 8, 2025
327f3fe
ci: Add concurrency check to auto cancel running actions. (#2438)
pvijayakrish Sep 8, 2025
8b1b24c
chore: added utility to detect possible tool call start for a chunk (…
ayushag-nv Sep 8, 2025
d34cfdd
chore: add preference logic for using tool-call and reasoning parsers…
ayushag-nv Sep 8, 2025
dad62a5
Update README.md (#2938)
harryskim Sep 8, 2025
64ba7f3
build: OPS-597: restructure sglang to follow container strategy struc…
nv-tusharma Sep 8, 2025
766d5b2
refactor: standardize e2e tests across 3 frameworks (#2827)
alec-flowers Sep 8, 2025
e41c5bb
feat: automatically setup and inject prometheus configuration (#2912)
julienmancuso Sep 9, 2025
1803db8
fix: WAR DeepGemm JIT compilation errors (#2937)
GuanLuo Sep 9, 2025
a76fd70
ci: sglang functional tests (#2943)
alec-flowers Sep 9, 2025
8f1f965
feat: update benchmarking and deploy utils (#2933)
hhzhang16 Sep 9, 2025
4db7fcf
feat: Add a checksum to ModelDeploymentCard fields (#2934)
grahamking Sep 9, 2025
351464b
ci: Fix Dockerfile mount secrets (#2960)
dillon-cullinan Sep 9, 2025
51c75e1
chore: added tool call schema validation in oai formatter (#2935)
ayushag-nv Sep 9, 2025
f7090a3
test: remove nighlty marker in kvbm tests (#2958)
nv-anants Sep 9, 2025
f5644ef
ci: remove pre-merge ignore in github workflow (#2940)
nv-anants Sep 9, 2025
1a412eb
ci: longer timeout, change model for l40 (#2951)
alec-flowers Sep 9, 2025
b19deaf
fix: aggregate logprobs (#2928)
messiaen Sep 9, 2025
7148426
fix: no reasoning parser by default (#2939)
nealvaidya Sep 9, 2025
a2e3b52
docs: fix broken links (#2965)
nv-nmailhot Sep 9, 2025
37213b6
feat: add a virtual connector for 3rd party deployments (#2913)
tedzhouhk Sep 9, 2025
3af3425
fix: dyn namespace scoping for trtllm
biswapanda Sep 9, 2025
436307c
pc
tedzhouhk Sep 10, 2025
e62f664
remove duplicate
tedzhouhk Sep 10, 2025
a4a3e66
Merge branch 'main' of https://github.com/ai-dynamo/dynamo into hzhou…
tedzhouhk Sep 10, 2025
e6ac2a7
address pr comment
tedzhouhk Sep 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add trtllm to sla planner
Signed-off-by: hongkuanz <[email protected]>
  • Loading branch information
tedzhouhk committed Sep 10, 2025
commit 40013243a5922ace30e92a5fae3129644990f798
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | 🚧 |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | ✅ | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README!
Expand Down
11 changes: 8 additions & 3 deletions benchmarks/profiler/utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import json
import logging
import re
import shlex
from typing import Literal, Optional, Protocol

from pydantic import BaseModel
Expand Down Expand Up @@ -83,11 +84,13 @@ def break_arguments(args: list[str] | None) -> list[str]:
if args is None:
return ans
if isinstance(args, str):
ans = re.split(r"[ =]", args)
# Use shlex.split to properly handle quoted arguments and JSON values
ans = shlex.split(args)
else:
for arg in args:
if arg is not None:
ans.extend(arg.split(" "))
# Use shlex.split to properly handle quoted arguments
ans.extend(shlex.split(arg))
return ans


Expand All @@ -102,7 +105,8 @@ def remove_valued_arguments(args: list[str], key: str) -> list[str]:


def join_arguments(args: list[str]) -> list[str]:
return [" ".join(args)]
# Use shlex.join to properly quote arguments that contain spaces or special characters
return [shlex.join(args)]


def append_argument(args: list[str], to_append) -> list[str]:
Expand Down Expand Up @@ -712,6 +716,7 @@ def set_config_tp_size(cls, config: dict, tp_size: int):
raise ValueError("Missing extraPodSpec or mainContainer in worker service")
args = worker_service.extraPodSpec.mainContainer.args

# Break arguments to handle both joined strings and lists
args = break_arguments(args)

# For TRT-LLM, we need to update the override-engine-args
Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | | |
| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |

Expand Down
13 changes: 13 additions & 0 deletions components/backends/trtllm/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,19 @@ Aggregated deployment with custom configuration.
- `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
- `TRTLLMWorker`: Single worker handling both prefill and decode with custom configuration mounted from the configmap

### 6. **Disaggregated Planner Deployment** (`disagg_planner.yaml`)
Advanced disaggregated deployment with SLA-based automatic scaling.

**Architecture:**
- `Frontend`: HTTP API server coordinating between workers
- `Planner`: SLA-based planner that monitors performance and scales workers automatically
- `Prometheus`: Metrics collection and monitoring
- `TRTLLMDecodeWorker`: Specialized decode-only worker
- `TRTLLMPrefillWorker`: Specialized prefill-only worker

> [!NOTE]
> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/benchmarks/pre_deployment_profiling.md) for detailed instructions.

## CRD Structure

All templates use the **DynamoGraphDeployment** CRD:
Expand Down
205 changes: 205 additions & 0 deletions components/backends/trtllm/deploy/disagg_planner.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: trtllm-disagg-planner
spec:
envs:
- name: DYNAMO_SERVICE_CONFIG
value: '{"Prometheus":{"global":{"scrape_interval":"5s"},"scrape_configs":[{"job_name":"prometheus","static_configs":[{"targets":["localhost:8000"]}]},{"job_name":"frontend","static_configs":[{"targets":["trtllm-disagg-planner-frontend:8000"]}]}]}}'
- name: DYNAMO_NAMESPACE
value: "trtllm-disagg-planner"
services:
Frontend:
dynamoNamespace: trtllm-disagg-planner
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
workingDir: /workspace/components/backends/trtllm
command:
- python3
args:
- -m
- dynamo.frontend
- --http-port
- "8000"
- --kv-cache-block-size
- "128"
- --router-mode
- kv
- --kv-overlap-score-weight
- "0.0"
- --router-temperature
- "0.0"
- --no-kv-events
Planner:
dynamoNamespace: trtllm-disagg-planner
envFromSecret: hf-token-secret
componentType: planner
replicas: 1
envs:
- name: PROMETHEUS_PORT
value: "8000"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
pvc:
create: false
name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
mountPoint: /workspace/profiling_results
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
workingDir: /workspace/components/planner/src/dynamo/planner
ports:
- name: metrics
containerPort: 9085
command:
- python3
args:
- -m
- planner_sla
- --environment=kubernetes
- --backend=trtllm
- --adjustment-interval=60
- --profile-results-dir=/workspace/profiling_results
- --prometheus-port=9085
Prometheus: # NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
dynamoNamespace: trtllm-disagg-planner
componentType: frontend
replicas: 1
envs:
- name: PYTHONPATH
value: "/workspace/components/planner/src"
- name: PROMETHEUS_PORT
value: "8000"
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "exit 0"
initialDelaySeconds: 30
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 10
extraPodSpec:
mainContainer:
image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
workingDir: /workspace/components/backends/trtllm
command:
- python3
args:
- -m
- dynamo.planner.prometheus
TRTLLMDecodeWorker:
dynamoNamespace: trtllm-disagg-planner
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
livenessProbe:
httpGet:
path: /live
port: 9090
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 1
readinessProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 60
resources:
limits:
gpu: "1"
extraPodSpec:
terminationGracePeriodSeconds: 600
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
workingDir: /workspace/components/backends/trtllm
command:
- python3
args:
- -m
- dynamo.trtllm
- --model-path
- Qwen/Qwen3-0.6B
- --served-model-name
- Qwen/Qwen3-0.6B
- --extra-engine-args
- engine_configs/decode.yaml
- --disaggregation-mode
- decode
- --disaggregation-strategy
- decode_first
TRTLLMPrefillWorker:
dynamoNamespace: trtllm-disagg-planner
envFromSecret: hf-token-secret
componentType: worker
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
terminationGracePeriodSeconds: 600
mainContainer:
startupProbe:
httpGet:
path: /health
port: 9090
periodSeconds: 10
failureThreshold: 60
image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
workingDir: /workspace/components/backends/trtllm
command:
- python3
args:
- -m
- dynamo.trtllm
- --model-path
- Qwen/Qwen3-0.6B
- --served-model-name
- Qwen/Qwen3-0.6B
- --extra-engine-args
- engine_configs/prefill.yaml
- --disaggregation-mode
- prefill
- --disaggregation-strategy
- decode_first
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def create_sla_planner_parser() -> argparse.ArgumentParser:
parser.add_argument(
"--backend",
default=SLAPlannerDefaults.backend,
choices=["vllm", "sglang"],
choices=["vllm", "sglang", "trtllm"],
help="Backend type",
)
parser.add_argument(
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/planner_intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Key features include:
- ✅
- vLLM
* -
-
-
- TensorRT-LLM
* -
- ❌
Expand Down
4 changes: 2 additions & 2 deletions docs/benchmarks/pre_deployment_profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Support matrix:
| vLLM | MoE | 🚧 |
| SGLang | Dense | ✅ |
| SGLang | MoE | 🚧 |
| TensorRT-LLM | Dense | 🚧 |
| TensorRT-LLM | Dense | |
| TensorRT-LLM | MoE | 🚧 |

> [!NOTE]
Expand Down Expand Up @@ -168,7 +168,7 @@ kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE
```

### Viewing Profiling Results
### Viewing Profiling Results

After the profiling job completes successfully, the results are stored in the persistent volume claim (PVC) created during Step 2.

Expand Down
5 changes: 4 additions & 1 deletion docs/guides/dynamo_deploy/sla_planner_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,18 +34,21 @@ export NAMESPACE=your-namespace

## 1. Deploy the System

We use vllm as the backend engine in this guide. SLA planner also supports SGLang and will support TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.

```bash
# Apply the disaggregated planner deployment
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
# kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
# kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm

# Check deployment status
kubectl get pods -n $NAMESPACE
```

Expected pods (all should be `1/1 Running`):
```
# For vLLM:
vllm-disagg-planner-frontend-* 1/1 Running
vllm-disagg-planner-prometheus-* 1/1 Running
vllm-disagg-planner-planner-* 1/1 Running
Expand Down