doc: add instruction to deploy model with inference gateway

ai-dynamo · dmitry-tokarev-nv · Aug 5, 2025 · Aug 3, 2025 · Aug 4, 2025 · Aug 4, 2025
commit ee53c2a3aa5879333e5f4d797e43a1491e969ddc
diff --git a/components/backends/sglang/deploy/README.md b/components/backends/sglang/deploy/README.md
@@ -0,0 +1,162 @@
+# SGLang Kubernetes Deployment Configurations
+
+This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource.
+
+## Available Deployment Patterns
+
+### 1. **Aggregated Deployment** (`agg.yaml`)
+Basic deployment pattern with frontend and a single decode worker.
+
+**Architecture:**
+- `Frontend`: OpenAI-compatible API server
+- `SGLangDecodeWorker`: Single worker handling both prefill and decode
+
+### 2. **Aggregated Router Deployment** (`agg_router.yaml`)
+Enhanced aggregated deployment with KV cache routing capabilities.
+
+**Architecture:**
+- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`)
+- `SGLangDecodeWorker`: Single worker handling both prefill and decode
+
+### 3. **Disaggregated Deployment** (`disagg.yaml`)**
+High-performance deployment with separated prefill and decode workers.
+
+**Architecture:**
+- `Frontend`: HTTP API server coordinating between workers
+- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`)
+- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`)
+- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`)
+
+## CRD Structure
+
+All templates use the **DynamoGraphDeployment** CRD:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: <deployment-name>
+spec:
+  services:
+    <ServiceName>:
+      # Service configuration
+```
+
+### Key Configuration Options
+
+**Resource Management:**
+```yaml
+resources:
+  requests:
+    cpu: "10"
+    memory: "20Gi"
+    gpu: "1"
+  limits:
+    cpu: "10"
+    memory: "20Gi"
+    gpu: "1"
+```
+
+**Container Configuration:**
+```yaml
+extraPodSpec:
+  mainContainer:
+    image: my-registry/sglang-runtime:my-tag
+    workingDir: /workspace/components/backends/sglang
+    args:
+      - "python3"
+      - "-m"
+      - "dynamo.sglang.worker"
+      # Model-specific arguments
+```
+
+## Prerequisites
+
+Before using these templates, ensure you have:
+
+1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+2. **Kubernetes cluster with GPU support**
+3. **Container registry access** for SGLang runtime images
+4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
+
+## Usage
+
+### 1. Choose Your Template
+Select the deployment pattern that matches your requirements:
+- Use `agg.yaml` for development/testing
+- Use `agg_router.yaml` for production with load balancing
+- Use `disagg.yaml` for maximum performance
+
+### 2. Customize Configuration
+Edit the template to match your environment:
+
+```yaml
+# Update image registry and tag
+image: your-registry/sglang-runtime:your-tag
+
+# Configure your model
+args:
+  - "--model-path"
+  - "your-org/your-model"
+  - "--served-model-name"
+  - "your-org/your-model"
+```
+
+### 3. Deploy
+
+Use the following command to deploy the deployment file.
+
+First, create a secret for the HuggingFace token.
+```bash
+export HF_TOKEN=your_hf_token
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+Then, deploy the model using the deployment file.
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE}
+```
+
+### 4. Using Custom Dynamo Frameworks Image for SGLang
+
+To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq:
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FRAMEWORK_RUNTIME_IMAGE=<sglang-image>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
+kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
+```
+
+## Model Configuration
+
+All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters:
+
+## Monitoring and Health
+
+- **Frontend health endpoint**: `http://<frontend-service>:8000/health`
+- **Liveness probes**: Check process health every 60s
+
+## Further Reading
+
+- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
+- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
+- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
+- **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
+- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Pod fails to start**: Check image registry access and HuggingFace token secret
+2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits
+3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
+4. **Out of memory**: Increase memory limits or reduce model batch size
+
+For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
@@ -212,15 +212,41 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
 
 #### Deploy to Kubernetes
 
-Example with disagg:
+See the [Create Deployment Guide](../../../docs/guides/dynamo_deploy/create_deployment.md) to learn how to deploy the deployment file.
+
+First, create a secret for the HuggingFace token.
+```bash
+export HF_TOKEN=your_hf_token
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+Then, deploy the model using the deployment file.
+
 Export the NAMESPACE  you used in your Dynamo Cloud Installation.
 
 ```bash
 cd dynamo
 cd components/backends/trtllm/deploy
-kubectl apply -f disagg.yaml -n $NAMESPACE
+export DEPLOYMENT_FILE=agg.yaml
+kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
+```
+
+#### Using Custom Dynamo Frameworks Image for TensorRT-LLM
+
+To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq:
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FRAMEWORK_RUNTIME_IMAGE=<trtllm-image>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
+kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
 ```
 
+#### Configuration Options
+
 To change `DYN_LOG` level, edit the yaml file by adding
 
 ```yaml

diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -180,15 +180,41 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
 
 #### Deploy to Kubernetes
 
-Example with disagg:
+Use the following command to deploy the deployment file.
+
+First, create a secret for the HuggingFace token.
+```bash
+export HF_TOKEN=your_hf_token
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN=${HF_TOKEN} \
+  -n ${NAMESPACE}
+```
+
+Then, deploy the model using the deployment file.
+
 Export the NAMESPACE  you used in your Dynamo Cloud Installation.
 
 ```bash
-cd dynamo
-cd components/backends/vllm/deploy
-kubectl apply -f disagg.yaml -n $NAMESPACE
+cd <dynamo-source-root>/components/backends/vllm/deploy
+export DEPLOYMENT_FILE=agg.yaml
+
+kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE
 ```
 
+#### Using Custom Dynamo Frameworks Image for vLLM
+
+To use a custom dynamo frameworks image for vLLM, you can update the deployment file using yq:
+
+```bash
+export DEPLOYMENT_FILE=agg.yaml
+export FRAMEWORK_RUNTIME_IMAGE=<vllm-image>
+
+yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE  > $DEPLOYMENT_FILE.generated
+kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE
+```
+
+#### Configuration Options
+
 To change `DYN_LOG` level, edit the yaml file by adding
 
 ```yaml

@@ -69,7 +69,17 @@ kubectl get gateway inference-gateway -n my-model
 # inference-gateway   kgateway   x.x.x.x   True         1m
 ```
 
-3. **Install dynamo model and dynamo gaie helm chart**
+3. **Deploy model**
+
+Follow the steps in [model deployment](../../components/backends/vllm/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace.
+
+sample commands to deploy model:
+```bash
+cd <dynamo-source-root>/components/backends/vllm/deploy
+kubectl apply -f agg.yaml -n my-model
+```
+
+4. **Install dynamo gaie helm chart**
 
 The Inference Gateway is configured through the `inference-gateway-resources.yaml` file.