add trtllm to sla planner

Signed-off-by: hongkuanz <[email protected]>
ai-dynamo · tedzhouhk · Sep 10, 2025 · Sep 5, 2025 · Sep 8, 2025 · Sep 9, 2025
commit 40013243a5922ace30e92a5fae3129644990f798
diff --git a/README.md b/README.md
@@ -59,7 +59,7 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
 | [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
 | [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
 | [**Load Based Planner**](/docs/architecture/load_planner.md) | 🚧 | 🚧 | 🚧 |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | 🚧 |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | ✅ | ✅ |
 | [**KVBM**](/docs/architecture/kvbm_architecture.md) | ✅ | ✅ | 🚧 |
 
 To learn more about each framework and their capabilities, check out each framework's README!

@@ -16,6 +16,7 @@
 import json
 import logging
 import re
+import shlex
 from typing import Literal, Optional, Protocol
 
 from pydantic import BaseModel
@@ -83,11 +84,13 @@ def break_arguments(args: list[str] | None) -> list[str]:
     if args is None:
         return ans
     if isinstance(args, str):
-        ans = re.split(r"[ =]", args)
+        # Use shlex.split to properly handle quoted arguments and JSON values
+        ans = shlex.split(args)
     else:
         for arg in args:
             if arg is not None:
-                ans.extend(arg.split(" "))
+                # Use shlex.split to properly handle quoted arguments
+                ans.extend(shlex.split(arg))
     return ans
 
 
@@ -102,7 +105,8 @@ def remove_valued_arguments(args: list[str], key: str) -> list[str]:
 
 
 def join_arguments(args: list[str]) -> list[str]:
-    return [" ".join(args)]
+    # Use shlex.join to properly quote arguments that contain spaces or special characters
+    return [shlex.join(args)]
 
 
 def append_argument(args: list[str], to_append) -> list[str]:
@@ -712,6 +716,7 @@ def set_config_tp_size(cls, config: dict, tp_size: int):
             raise ValueError("Missing extraPodSpec or mainContainer in worker service")
         args = worker_service.extraPodSpec.mainContainer.args
 
+        # Break arguments to handle both joined strings and lists
         args = break_arguments(args)
 
         # For TRT-LLM, we need to update the override-engine-args

diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -55,7 +55,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
 | [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
 | [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
+| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | ✅ |  |
 | [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
 | [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
 

diff --git a/components/backends/trtllm/deploy/README.md b/components/backends/trtllm/deploy/README.md
@@ -42,6 +42,19 @@ Aggregated deployment with custom configuration.
 - `Frontend`: OpenAI-compatible API server (with kv router mode disabled)
 - `TRTLLMWorker`: Single worker handling both prefill and decode with custom configuration mounted from the configmap
 
+### 6. **Disaggregated Planner Deployment** (`disagg_planner.yaml`)
+Advanced disaggregated deployment with SLA-based automatic scaling.
+
+**Architecture:**
+- `Frontend`: HTTP API server coordinating between workers
+- `Planner`: SLA-based planner that monitors performance and scales workers automatically
+- `Prometheus`: Metrics collection and monitoring
+- `TRTLLMDecodeWorker`: Specialized decode-only worker
+- `TRTLLMPrefillWorker`: Specialized prefill-only worker
+
+> [!NOTE]
+> This deployment requires pre-deployment profiling to be completed first. See [Pre-Deployment Profiling](../../../../docs/benchmarks/pre_deployment_profiling.md) for detailed instructions.
+
 ## CRD Structure
 
 All templates use the **DynamoGraphDeployment** CRD:

diff --git a/components/backends/trtllm/deploy/disagg_planner.yaml b/components/backends/trtllm/deploy/disagg_planner.yaml
@@ -0,0 +1,205 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: trtllm-disagg-planner
+spec:
+  envs:
+    - name: DYNAMO_SERVICE_CONFIG
+      value: '{"Prometheus":{"global":{"scrape_interval":"5s"},"scrape_configs":[{"job_name":"prometheus","static_configs":[{"targets":["localhost:8000"]}]},{"job_name":"frontend","static_configs":[{"targets":["trtllm-disagg-planner-frontend:8000"]}]}]}}'
+    - name: DYNAMO_NAMESPACE
+      value: "trtllm-disagg-planner"
+  services:
+    Frontend:
+      dynamoNamespace: trtllm-disagg-planner
+      componentType: frontend
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
+          workingDir: /workspace/components/backends/trtllm
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.frontend
+            - --http-port
+            - "8000"
+            - --kv-cache-block-size
+            - "128"
+            - --router-mode
+            - kv
+            - --kv-overlap-score-weight
+            - "0.0"
+            - --router-temperature
+            - "0.0"
+            - --no-kv-events
+    Planner:
+      dynamoNamespace: trtllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: planner
+      replicas: 1
+      envs:
+        - name: PROMETHEUS_PORT
+          value: "8000"
+      livenessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      readinessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        initialDelaySeconds: 60
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      pvc:
+        create: false
+        name: dynamo-pvc # Must be pre-created before deployment and SLA profiler must have been run
+        mountPoint: /workspace/profiling_results
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
+          workingDir: /workspace/components/planner/src/dynamo/planner
+          ports:
+            - name: metrics
+              containerPort: 9085
+          command:
+            - python3
+          args:
+            - -m
+            - planner_sla
+            - --environment=kubernetes
+            - --backend=trtllm
+            - --adjustment-interval=60
+            - --profile-results-dir=/workspace/profiling_results
+            - --prometheus-port=9085
+    Prometheus: # NOTE: this is set on Prometheus to ensure a service is created for the Prometheus component. This is a workaround and should be managed differently.
+      dynamoNamespace: trtllm-disagg-planner
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: PYTHONPATH
+          value: "/workspace/components/planner/src"
+        - name: PROMETHEUS_PORT
+          value: "8000"
+      livenessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      readinessProbe:
+        exec:
+          command:
+            - /bin/sh
+            - -c
+            - "exit 0"
+        initialDelaySeconds: 30
+        periodSeconds: 60
+        timeoutSeconds: 30
+        failureThreshold: 10
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
+          workingDir: /workspace/components/backends/trtllm
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.planner.prometheus
+    TRTLLMDecodeWorker:
+      dynamoNamespace: trtllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      livenessProbe:
+        httpGet:
+          path: /live
+          port: 9090
+        periodSeconds: 5
+        timeoutSeconds: 30
+        failureThreshold: 1
+      readinessProbe:
+        httpGet:
+          path: /health
+          port: 9090
+        periodSeconds: 10
+        timeoutSeconds: 30
+        failureThreshold: 60
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        terminationGracePeriodSeconds: 600
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            failureThreshold: 60
+          image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
+          workingDir: /workspace/components/backends/trtllm
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.trtllm
+            - --model-path
+            - Qwen/Qwen3-0.6B
+            - --served-model-name
+            - Qwen/Qwen3-0.6B
+            - --extra-engine-args
+            - engine_configs/decode.yaml
+            - --disaggregation-mode
+            - decode
+            - --disaggregation-strategy
+            - decode_first
+    TRTLLMPrefillWorker:
+      dynamoNamespace: trtllm-disagg-planner
+      envFromSecret: hf-token-secret
+      componentType: worker
+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      extraPodSpec:
+        terminationGracePeriodSeconds: 600
+        mainContainer:
+          startupProbe:
+            httpGet:
+              path: /health
+              port: 9090
+            periodSeconds: 10
+            failureThreshold: 60
+          image: nvcr.io/nvidian/dynamo-dev/dynamo-trtllm-runtime:hzhou-0909-03
+          workingDir: /workspace/components/backends/trtllm
+          command:
+            - python3
+          args:
+            - -m
+            - dynamo.trtllm
+            - --model-path
+            - Qwen/Qwen3-0.6B
+            - --served-model-name
+            - Qwen/Qwen3-0.6B
+            - --extra-engine-args
+            - engine_configs/prefill.yaml
+            - --disaggregation-mode
+            - prefill
+            - --disaggregation-strategy
+            - decode_first
@@ -34,7 +34,7 @@ def create_sla_planner_parser() -> argparse.ArgumentParser:
     parser.add_argument(
         "--backend",
         default=SLAPlannerDefaults.backend,
-        choices=["vllm", "sglang"],
+        choices=["vllm", "sglang", "trtllm"],
         help="Backend type",
     )
     parser.add_argument(

diff --git a/docs/architecture/planner_intro.rst b/docs/architecture/planner_intro.rst
@@ -44,7 +44,7 @@ Key features include:
      - ✅
      - vLLM
    * -
-     - ❌
+     - ✅
      - TensorRT-LLM
    * -
      - ❌

diff --git a/docs/benchmarks/pre_deployment_profiling.md b/docs/benchmarks/pre_deployment_profiling.md
@@ -14,7 +14,7 @@ Support matrix:
 | vLLM | MoE | 🚧 |
 | SGLang | Dense | ✅ |
 | SGLang | MoE | 🚧 |
-| TensorRT-LLM | Dense | 🚧 |
+| TensorRT-LLM | Dense | ✅ |
 | TensorRT-LLM | MoE | 🚧 |
 
 > [!NOTE]
@@ -168,7 +168,7 @@ kubectl get jobs -n $NAMESPACE
 kubectl logs job/profile-sla -n $NAMESPACE
 ```
 
-### Viewing Profiling Results
+### Viewing Profiling Results 
 
 After the profiling job completes successfully, the results are stored in the persistent volume claim (PVC) created during Step 2.
 

diff --git a/docs/guides/dynamo_deploy/sla_planner_deployment.md b/docs/guides/dynamo_deploy/sla_planner_deployment.md
@@ -34,18 +34,21 @@ export NAMESPACE=your-namespace
 
 ## 1. Deploy the System
 
-We use vllm as the backend engine in this guide. SLA planner also supports SGLang and will support TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
+We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
 
 ```bash
 # Apply the disaggregated planner deployment
 kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
+# kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
+# kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm
 
 # Check deployment status
 kubectl get pods -n $NAMESPACE
 ```
 
 Expected pods (all should be `1/1 Running`):
 ```
+# For vLLM:
 vllm-disagg-planner-frontend-*            1/1 Running
 vllm-disagg-planner-prometheus-*          1/1 Running
 vllm-disagg-planner-planner-*             1/1 Running
-Original file line number
+Diff line change
@@ Expand Up / @@ -44,7 +44,7 @@ Key features include: @@
          - ✅
          - vLLM
        * -
-         - ❌
+         - ✅
          - TensorRT-LLM
        * -
          - ❌
@@ Expand Down @@