ai-dynamo · dmitry-tokarev-nv · Aug 5, 2025 · Aug 3, 2025 · Aug 4, 2025 · Aug 4, 2025
diff --git a/components/backends/sglang/deploy/README.md b/components/backends/sglang/deploy/README.md
@@ -0,0 +1,162 @@
+# SGLang Kubernetes Deployment Configurations
+
+This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource.
+
+## Available Deployment Patterns
+
+### 1. **Aggregated Deployment** (`agg.yaml`)
+Basic deployment pattern with frontend and a single decode worker.
+
+**Architecture:**
+- `Frontend`: OpenAI-compatible API server
+- `SGLangDecodeWorker`: Single worker handling both prefill and decode
+
+### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
+Enhanced aggregated deployment with KV cache routing capabilities.
+
+**Architecture:**
+- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`)
+- `SGLangDecodeWorker`: Single worker handling both prefill and decode
+
+### 3. **Disaggregated Deployment** (`disagg.yaml`)**
+High-performance deployment with separated prefill and decode workers.
+
+**Architecture:**
+- `Frontend`: HTTP API server coordinating between workers
+- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`)
+- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
+- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`)
+
+## CRD Structure
+
+All templates use the **DynamoGraphDeployment** CRD:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: <deployment-name>
+spec:
+  services:
+    <ServiceName>:
+      # Service configuration
+```
+
+### Key Configuration Options
+
+**Resource Management:**
+```yaml
+resources:
+  requests:
+    cpu: "10"
+    memory: "20Gi"
+    gpu: "1"
+  limits:
+    cpu: "10"
+    memory: "20Gi"
+    gpu: "1"
+```
+
+**Container Configuration:**
+```yaml
+extraPodSpec:
+  mainContainer:
+    image: my-registry/sglang-runtime:my-tag
+    workingDir: /workspace/components/backends/sglang
+    args:
+      - "python3"
+      - "-m"
+      - "dynamo.sglang.worker"
+      # Model-specific arguments
+```
+
+## Prerequisites
+
+Before using these templates, ensure you have:
+
+1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+2. **Kubernetes cluster with GPU support**
+3. **Container registry access** for SGLang runtime images
+4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
+
+## Usage
+
+### 1. Choose Your Template
+Select the deployment pattern that matches your requirements:
+- Use `agg.yaml` for development/testing
+- Use `agg_router.yaml` for production with load balancing
+- Use `disagg.yaml` for maximum performance
+
+### 2. Customize Configuration
+Edit the template to match your environment:
+
+```yaml
+# Update image registry and tag
+image: your-registry/sglang-runtime:your-tag
+
+# Configure your model
+args:
+  - "--model-path"
+  - "your-org/your-model"
+  - "--served-model-name"
+  - "your-org/your-model"
+```
+
+### 3. Deploy
+
+Use the following command to deploy the deployment file.
+
+First, create a secret for the HuggingFace token.
+```bash
+export HF_TOKEN=your_hf_token
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+Then, deploy the model using the deployment file.
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
+```
+
+### 4. Using Custom Dynamo Frameworks Image for SGLang
+
+To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
+kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
+```
+
+## Model Configuration
+
+All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters:
+
+## Monitoring and Health
+
+- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
+- **Liveness probes**: Check process health every 60s
+
+## Further Reading
+
+- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
+- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
+- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
+- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Pod fails to start**: Check image registry access and HuggingFace token secret
+2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
+3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
+4. **Out of memory**: Increase memory limits or reduce model batch size
+
+For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -187,61 +187,18 @@ For comprehensive instructions on multinode serving, see the [multinode-examples
 
 ### Kubernetes Deployment
 
-For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
-
-- `agg.yaml` - Aggregated serving
-- `agg_router.yaml` - Aggregated serving with KV routing
-- `disagg.yaml` - Disaggregated serving
-- `disagg_router.yaml` - Disaggregated serving with KV routing
-
-#### Prerequisites
-
-- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
-
-- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image:
-  ```bash
-  ./container/build.sh --framework tensorrtllm
-  # Tag and push to your container registry
-  # Update the image references in the YAML files
-  ```
-
-- **Port Forwarding**: After deployment, forward the frontend service to access the API:
-  ```bash
-  kubectl port-forward deployment/trtllm-v1-disagg-frontend-<pod-uuid-info> 8080:8000
-  ```
-
-#### Deploy to Kubernetes
-
-Example with disagg:
-Export the NAMESPACE  you used in your Dynamo Cloud Installation.
-
-```bash
-cd dynamo
-cd components/backends/trtllm/deploy
-kubectl apply -f disagg.yaml -n $NAMESPACE
-```
-
-To change `DYN_LOG` level, edit the yaml file by adding
-
-```yaml
-...
-spec:
-  envs:
-    - name: DYN_LOG
-      value: "debug" # or other log levels
-  ...
-```
+For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md)
 
 ### Client
 
 See [client](../llm/README.md#client) section to learn how to send request to the deployment.
 
-NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`.
+NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
 
 ### Benchmarking
 
 To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
+`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
 
 
 ## Disaggregation Strategy