diff --git a/components/backends/sglang/deploy/README.md b/components/backends/sglang/deploy/README.md new file mode 100644 index 0000000000..c41b6793ff --- /dev/null +++ b/components/backends/sglang/deploy/README.md @@ -0,0 +1,162 @@ +# SGLang Kubernetes Deployment Configurations + +This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the **DynamoGraphDeployment** resource. + +## Available Deployment Patterns + +### 1. **Aggregated Deployment** (`agg.yaml`) +Basic deployment pattern with frontend and a single decode worker. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server +- `SGLangDecodeWorker`: Single worker handling both prefill and decode + +### 2. **Aggregated Router Deployment** (`agg_router.yaml`) +Enhanced aggregated deployment with KV cache routing capabilities. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server with router mode enabled (`--router-mode kv`) +- `SGLangDecodeWorker`: Single worker handling both prefill and decode + +### 3. **Disaggregated Deployment** (`disagg.yaml`)** +High-performance deployment with separated prefill and decode workers. + +**Architecture:** +- `Frontend`: HTTP API server coordinating between workers +- `SGLangDecodeWorker`: Specialized decode-only worker (`--disaggregation-mode decode`) +- `SGLangPrefillWorker`: Specialized prefill-only worker (`--disaggregation-mode prefill`) +- Communication via NIXL transfer backend (`--disaggregation-transfer-backend nixl`) + +## CRD Structure + +All templates use the **DynamoGraphDeployment** CRD: + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: +spec: + services: + : + # Service configuration +``` + +### Key Configuration Options + +**Resource Management:** +```yaml +resources: + requests: + cpu: "10" + memory: "20Gi" + gpu: "1" + limits: + cpu: "10" + memory: "20Gi" + gpu: "1" +``` + +**Container Configuration:** +```yaml +extraPodSpec: + mainContainer: + image: my-registry/sglang-runtime:my-tag + workingDir: /workspace/components/backends/sglang + args: + - "python3" + - "-m" + - "dynamo.sglang.worker" + # Model-specific arguments +``` + +## Prerequisites + +Before using these templates, ensure you have: + +1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) +2. **Kubernetes cluster with GPU support** +3. **Container registry access** for SGLang runtime images +4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) + +## Usage + +### 1. Choose Your Template +Select the deployment pattern that matches your requirements: +- Use `agg.yaml` for development/testing +- Use `agg_router.yaml` for production with load balancing +- Use `disagg.yaml` for maximum performance + +### 2. Customize Configuration +Edit the template to match your environment: + +```yaml +# Update image registry and tag +image: your-registry/sglang-runtime:your-tag + +# Configure your model +args: + - "--model-path" + - "your-org/your-model" + - "--served-model-name" + - "your-org/your-model" +``` + +### 3. Deploy + +Use the following command to deploy the deployment file. + +First, create a secret for the HuggingFace token. +```bash +export HF_TOKEN=your_hf_token +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` + +Then, deploy the model using the deployment file. + +```bash +export DEPLOYMENT_FILE=agg.yaml +kubectl apply -f $DEPLOYMENT_FILE -n ${NAMESPACE} +``` + +### 4. Using Custom Dynamo Frameworks Image for SGLang + +To use a custom dynamo frameworks image for SGLang, you can update the deployment file using yq: + +```bash +export DEPLOYMENT_FILE=agg.yaml +export FRAMEWORK_RUNTIME_IMAGE= + +yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated +kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE +``` + +## Model Configuration + +All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you can use any sglang argument and configuration. Key parameters: + +## Monitoring and Health + +- **Frontend health endpoint**: `http://:8000/health` +- **Liveness probes**: Check process health every 60s + +## Further Reading + +- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) +- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) +- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) +- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) +- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) + +## Troubleshooting + +Common issues and solutions: + +1. **Pod fails to start**: Check image registry access and HuggingFace token secret +2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits +3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` +4. **Out of memory**: Increase memory limits or reduce model batch size + +For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md index 452b8f1f6b..67d63bd8d5 100644 --- a/components/backends/trtllm/README.md +++ b/components/backends/trtllm/README.md @@ -187,61 +187,18 @@ For comprehensive instructions on multinode serving, see the [multinode-examples ### Kubernetes Deployment -For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations: - -- `agg.yaml` - Aggregated serving -- `agg_router.yaml` - Aggregated serving with KV routing -- `disagg.yaml` - Disaggregated serving -- `disagg_router.yaml` - Disaggregated serving with KV routing - -#### Prerequisites - -- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first. - -- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image: - ```bash - ./container/build.sh --framework tensorrtllm - # Tag and push to your container registry - # Update the image references in the YAML files - ``` - -- **Port Forwarding**: After deployment, forward the frontend service to access the API: - ```bash - kubectl port-forward deployment/trtllm-v1-disagg-frontend- 8080:8000 - ``` - -#### Deploy to Kubernetes - -Example with disagg: -Export the NAMESPACE you used in your Dynamo Cloud Installation. - -```bash -cd dynamo -cd components/backends/trtllm/deploy -kubectl apply -f disagg.yaml -n $NAMESPACE -``` - -To change `DYN_LOG` level, edit the yaml file by adding - -```yaml -... -spec: - envs: - - name: DYN_LOG - value: "debug" # or other log levels - ... -``` +For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](deploy/README.md) ### Client See [client](../llm/README.md#client) section to learn how to send request to the deployment. -NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`. +NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. ### Benchmarking To benchmark your deployment with GenAI-Perf, see this utility script, configuring the -`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh) +`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh) ## Disaggregation Strategy diff --git a/components/backends/trtllm/deploy/README.md b/components/backends/trtllm/deploy/README.md index 1829d46c6a..9add8791da 100644 --- a/components/backends/trtllm/deploy/README.md +++ b/components/backends/trtllm/deploy/README.md @@ -1 +1,288 @@ -This folder contains deployment examples for the TRTLLM inference backend. \ No newline at end of file +# TensorRT-LLM Kubernetes Deployment Configurations + +This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying TensorRT-LLM inference graphs using the **DynamoGraphDeployment** resource. + +## Available Deployment Patterns + +### 1. **Aggregated Deployment** (`agg.yaml`) +Basic deployment pattern with frontend and a single worker. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server (with kv router mode disabled) +- `TRTLLMWorker`: Single worker handling both prefill and decode + +### 2. **Aggregated Router Deployment** (`agg_router.yaml`) +Enhanced aggregated deployment with KV cache routing capabilities. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server (with kv router mode enabled) +- `TRTLLMWorker`: Multiple workers handling both prefill and decode (2 replicas for load balancing) + +### 3. **Disaggregated Deployment** (`disagg.yaml`) +High-performance deployment with separated prefill and decode workers. + +**Architecture:** +- `Frontend`: HTTP API server coordinating between workers +- `TRTLLMDecodeWorker`: Specialized decode-only worker +- `TRTLLMPrefillWorker`: Specialized prefill-only worker + +### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`) +Advanced disaggregated deployment with KV cache routing capabilities. + +**Architecture:** +- `Frontend`: HTTP API server (with kv router mode enabled) +- `TRTLLMDecodeWorker`: Specialized decode-only worker +- `TRTLLMPrefillWorker`: Specialized prefill-only worker (2 replicas for load balancing) + +## CRD Structure + +All templates use the **DynamoGraphDeployment** CRD: + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: +spec: + services: + : + # Service configuration +``` + +### Key Configuration Options + +**Resource Management:** +```yaml +resources: + requests: + cpu: "10" + memory: "20Gi" + gpu: "1" + limits: + cpu: "10" + memory: "20Gi" + gpu: "1" +``` + +**Container Configuration:** +```yaml +extraPodSpec: + mainContainer: + image: nvcr.io/nvidian/nim-llm-dev/trtllm-runtime:dep-233.17 + workingDir: /workspace/components/backends/trtllm + args: + - "python3" + - "-m" + - "dynamo.trtllm" + # Model-specific arguments +``` + +## Prerequisites + +Before using these templates, ensure you have: + +1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md) +2. **Kubernetes cluster with GPU support** +3. **Container registry access** for TensorRT-LLM runtime images +4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) + +### Container Images + +The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/trtllm-runtime`. If you don't have access, build and push your own image: + +```bash +./container/build.sh --framework tensorrtllm +# Tag and push to your container registry +# Update the image references in the YAML files +``` + +**Note:** TensorRT-LLM uses git-lfs, which needs to be installed in advance: +```bash +apt-get update && apt-get -y install git git-lfs +``` + +For ARM machines, use: +```bash +./container/build.sh --framework tensorrtllm --platform linux/arm64 +``` + +## Usage + +### 1. Choose Your Template +Select the deployment pattern that matches your requirements: +- Use `agg.yaml` for simple testing +- Use `agg_router.yaml` for production with KV cache routing and load balancing +- Use `disagg.yaml` for maximum performance with separated workers +- Use `disagg_router.yaml` for high-performance with KV cache routing and disaggregation + +### 2. Customize Configuration +Edit the template to match your environment: + +```yaml +# Update image registry and tag +image: your-registry/trtllm-runtime:your-tag + +# Configure your model and deployment settings +args: + - "python3" + - "-m" + - "dynamo.trtllm" + # Add your model-specific arguments +``` + +### 3. Deploy + +See the [Create Deployment Guide](../../../../docs/guides/dynamo_deploy/create_deployment.md) to learn how to deploy the deployment file. + +First, create a secret for the HuggingFace token. +```bash +export HF_TOKEN=your_hf_token +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` + +Then, deploy the model using the deployment file. + +Export the NAMESPACE you used in your Dynamo Cloud Installation. + +```bash +cd dynamo/components/backends/trtllm/deploy +export DEPLOYMENT_FILE=agg.yaml +kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE +``` + +### 4. Using Custom Dynamo Frameworks Image for TensorRT-LLM + +To use a custom dynamo frameworks image for TensorRT-LLM, you can update the deployment file using yq: + +```bash +export DEPLOYMENT_FILE=agg.yaml +export FRAMEWORK_RUNTIME_IMAGE= + +yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated +kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE +``` + +### 5. Port Forwarding + +After deployment, forward the frontend service to access the API: + +```bash +kubectl port-forward deployment/trtllm-v1-disagg-frontend- 8000:8000 +``` + +## Configuration Options + +### Environment Variables + +To change `DYN_LOG` level, edit the yaml file by adding: + +```yaml +... +spec: + envs: + - name: DYN_LOG + value: "debug" # or other log levels + ... +``` + +### TensorRT-LLM Worker Configuration + +TensorRT-LLM workers are configured through command-line arguments in the deployment YAML. Key configuration areas include: + +- **Disaggregation Strategy**: Control request flow with `DISAGGREGATION_STRATEGY` environment variable +- **KV Cache Transfer**: Choose between UCX (default) or NIXL for disaggregated serving +- **Request Migration**: Enable graceful failure handling with `--migration-limit` + +### Disaggregation Strategy + +The disaggregation strategy controls how requests are distributed between prefill and decode workers: + +- **`decode_first`** (default): Requests routed to decode worker first, then forwarded to prefill worker +- **`prefill_first`**: Requests routed directly to prefill worker (used with KV routing) + +Set via environment variable: +```yaml +envs: + - name: DISAGGREGATION_STRATEGY + value: "prefill_first" +``` + +## Testing the Deployment + +Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions. + +**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend `. + +## Model Configuration + +The deployment templates support various TensorRT-LLM models and configurations. You can customize model-specific arguments in the worker configuration sections of the YAML files. + +### Multi-Token Prediction (MTP) Support + +For models supporting Multi-Token Prediction (such as DeepSeek R1), special configuration is available. Note that MTP requires the experimental TensorRT-LLM commit: + +```bash +./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit +``` + +## Monitoring and Health + +- **Frontend health endpoint**: `http://:8000/health` +- **Worker health endpoints**: `http://:9090/health` +- **Liveness probes**: Check process health every 5 seconds +- **Readiness probes**: Check service readiness with configurable delays + +## KV Cache Transfer Methods + +TensorRT-LLM supports two methods for KV cache transfer in disaggregated serving: + +- **UCX** (default): Standard method for KV cache transfer +- **NIXL** (experimental): Alternative transfer method + +For detailed configuration instructions, see the [KV cache transfer guide](../kv-cache-tranfer.md). + +## Request Migration + +You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations: + +```yaml +args: + - "python3" + - "-m" + - "dynamo.trtllm" + - "--migration-limit" + - "3" +``` + +## Benchmarking + +To benchmark your deployment with GenAI-Perf, see this utility script: [perf.sh](../../../../benchmarks/llm/perf.sh) + +Configure the `model` name and `host` based on your deployment. + +## Further Reading + +- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) +- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) +- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) +- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) +- **Multinode Deployment**: [Multinode Examples](../multinode/multinode-examples.md) +- **Speculative Decoding**: [Llama 4 + Eagle Guide](../llama4_plus_eagle.md) +- **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/) + +## Troubleshooting + +Common issues and solutions: + +1. **Pod fails to start**: Check image registry access and HuggingFace token secret +2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits +3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` +4. **Out of memory**: Increase memory limits or reduce model batch size +5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command +6. **Git LFS issues**: Ensure git-lfs is installed before building containers +7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines + +For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md index cd4de036a3..986fc32337 100644 --- a/components/backends/vllm/README.md +++ b/components/backends/vllm/README.md @@ -152,73 +152,7 @@ Below we provide a selected list of advanced deployments. Please open up an issu ### Kubernetes Deployment -For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations: - -- `agg.yaml` - Aggregated serving -- `agg_router.yaml` - Aggregated serving with KV routing -- `disagg.yaml` - Disaggregated serving -- `disagg_router.yaml` - Disaggregated serving with KV routing -- `disagg_planner.yaml` - Disaggregated serving with [SLA Planner](../../../docs/architecture/sla_planner.md). See [SLA Planner Deployment Guide](../../../docs/guides/dynamo_deploy/sla_planner_deployment.md) for more details. - -#### Prerequisites - -- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first. - -- **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image: - ```bash - ./container/build.sh --framework VLLM - # Tag and push to your container registry - # Update the image references in the YAML files - ``` - -- **Pre-Deployment Profiling (if Using SLA Planner)**: Follow the [pre-deployment profiling guide](../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner. - -- **Port Forwarding**: After deployment, forward the frontend service to access the API: - ```bash - kubectl port-forward deployment/vllm-v1-disagg-frontend- 8080:8000 - ``` - -#### Deploy to Kubernetes - -Example with disagg: -Export the NAMESPACE you used in your Dynamo Cloud Installation. - -```bash -cd dynamo -cd components/backends/vllm/deploy -kubectl apply -f disagg.yaml -n $NAMESPACE -``` - -To change `DYN_LOG` level, edit the yaml file by adding - -```yaml -... -spec: - envs: - - name: DYN_LOG - value: "debug" # or other log levels - ... -``` - -### Testing the Deployment - -Send a test request to verify your deployment: - -```bash -curl localhost:8080/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "Qwen/Qwen3-0.6B", - "messages": [ - { - "role": "user", - "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." - } - ], - "stream": false, - "max_tokens": 30 - }' -``` +For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](deploy/README.md) ## Configuration diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md index 5d7b0e2db5..cb3d442836 100644 --- a/components/backends/vllm/deploy/README.md +++ b/components/backends/vllm/deploy/README.md @@ -1 +1,255 @@ -This folder contains examples for the VLLM inference backend. \ No newline at end of file +# vLLM Kubernetes Deployment Configurations + +This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying vLLM inference graphs using the **DynamoGraphDeployment** resource. + +## Available Deployment Patterns + +### 1. **Aggregated Deployment** (`agg.yaml`) +Basic deployment pattern with frontend and a single decode worker. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server (with kv router mode disabled) +- `VLLMDecodeWorker`: Single worker handling both prefill and decode + +### 2. **Aggregated Router Deployment** (`agg_router.yaml`) +Enhanced aggregated deployment with KV cache routing capabilities. + +**Architecture:** +- `Frontend`: OpenAI-compatible API server (with kv router mode enabled) +- `VLLMDecodeWorker`: Single worker handling both prefill and decode + +### 3. **Disaggregated Deployment** (`disagg.yaml`) +High-performance deployment with separated prefill and decode workers. + +**Architecture:** +- `Frontend`: HTTP API server coordinating between workers +- `VLLMDecodeWorker`: Specialized decode-only worker +- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`) +- Communication via NIXL transfer backend + +### 4. **Disaggregated Router Deployment** (`disagg_router.yaml`) +Advanced disaggregated deployment with KV cache routing capabilities. + +**Architecture:** +- `Frontend`: HTTP API server with KV-aware routing +- `VLLMDecodeWorker`: Specialized decode-only worker +- `VLLMPrefillWorker`: Specialized prefill-only worker (`--is-prefill-worker`) + +## CRD Structure + +All templates use the **DynamoGraphDeployment** CRD: + +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: +spec: + services: + : + # Service configuration +``` + +### Key Configuration Options + +**Resource Management:** +```yaml +resources: + requests: + cpu: "10" + memory: "20Gi" + gpu: "1" + limits: + cpu: "10" + memory: "20Gi" + gpu: "1" +``` + +**Container Configuration:** +```yaml +extraPodSpec: + mainContainer: + image: my-registry/vllm-runtime:my-tag + workingDir: /workspace/components/backends/vllm + args: + - "python3" + - "-m" + - "dynamo.vllm" + # Model-specific arguments +``` + +## Prerequisites + +Before using these templates, ensure you have: + +1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md) +2. **Kubernetes cluster with GPU support** +3. **Container registry access** for vLLM runtime images +4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`) + +### Container Images + +We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image: + +```bash +./container/build.sh --framework VLLM +# Tag and push to your container registry +# Update the image references in the YAML files +``` + +### Pre-Deployment Profiling (SLA Planner Only) + +If using the SLA Planner deployment (`disagg_planner.yaml`), follow the [pre-deployment profiling guide](../../../../docs/architecture/pre_deployment_profiling.md) to run pre-deployment profiling. The results will be saved to the `profiling-pvc` PVC and queried by the SLA Planner. + +## Usage + +### 1. Choose Your Template +Select the deployment pattern that matches your requirements: +- Use `agg.yaml` for simple testing +- Use `agg_router.yaml` for production with load balancing +- Use `disagg.yaml` for maximum performance +- Use `disagg_router.yaml` for high-performance with KV cache routing +- Use `disagg_planner.yaml` for SLA-optimized performance + +### 2. Customize Configuration +Edit the template to match your environment: + +```yaml +# Update image registry and tag +image: your-registry/vllm-runtime:your-tag + +# Configure your model +args: + - "--model" + - "your-org/your-model" +``` + +### 3. Deploy + +Use the following command to deploy the deployment file. + +First, create a secret for the HuggingFace token. +```bash +export HF_TOKEN=your_hf_token +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` + +Then, deploy the model using the deployment file. + +Export the NAMESPACE you used in your Dynamo Cloud Installation. + +```bash +cd /components/backends/vllm/deploy +export DEPLOYMENT_FILE=agg.yaml + +kubectl apply -f $DEPLOYMENT_FILE -n $NAMESPACE +``` + +### 4. Using Custom Dynamo Frameworks Image for vLLM + +To use a custom dynamo frameworks image for vLLM, you can update the deployment file using yq: + +```bash +export DEPLOYMENT_FILE=agg.yaml +export FRAMEWORK_RUNTIME_IMAGE= + +yq '.spec.services.[].extraPodSpec.mainContainer.image = env(FRAMEWORK_RUNTIME_IMAGE)' $DEPLOYMENT_FILE > $DEPLOYMENT_FILE.generated +kubectl apply -f $DEPLOYMENT_FILE.generated -n $NAMESPACE +``` + +### 5. Port Forwarding + +After deployment, forward the frontend service to access the API: + +```bash +kubectl port-forward deployment/vllm-v1-disagg-frontend- 8000:8000 +``` + +## Configuration Options + +### Environment Variables + +To change `DYN_LOG` level, edit the yaml file by adding: + +```yaml +... +spec: + envs: + - name: DYN_LOG + value: "debug" # or other log levels + ... +``` + +### vLLM Worker Configuration + +vLLM workers are configured through command-line arguments. Key parameters include: + +- `--endpoint`: Dynamo endpoint in format `dyn://namespace.component.endpoint` +- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`) +- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving +- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo + +See the [vLLM CLI documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the full list of configuration options. + +## Testing the Deployment + +Send a test request to verify your deployment: + +```bash +curl localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-0.6B", + "messages": [ + { + "role": "user", + "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." + } + ], + "stream": false, + "max_tokens": 30 + }' +``` + +## Model Configuration + +All templates use **Qwen/Qwen3-0.6B** as the default model, but you can use any vLLM-supported LLM model and configuration arguments. + +## Monitoring and Health + +- **Frontend health endpoint**: `http://:8000/health` +- **Liveness probes**: Check process health regularly +- **KV metrics**: Published via metrics endpoint port + +## Request Migration + +You can enable [request migration](../../../../docs/architecture/request_migration.md) to handle worker failures gracefully by adding the migration limit argument to worker configurations: + +```yaml +args: + - "--migration-limit" + - "3" +``` + +## Further Reading + +- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md) +- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md) +- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md) +- **SLA Planner**: [SLA Planner Deployment Guide](../../../../docs/guides/dynamo_deploy/sla_planner_deployment.md) +- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) +- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) + +## Troubleshooting + +Common issues and solutions: + +1. **Pod fails to start**: Check image registry access and HuggingFace token secret +2. **GPU not allocated**: Verify cluster has GPU nodes and proper resource limits +3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds` +4. **Out of memory**: Increase memory limits or reduce model batch size +5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command + +For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting). diff --git a/deploy/inference-gateway/README.md b/deploy/inference-gateway/README.md index 0476e978bc..6f985e3f15 100644 --- a/deploy/inference-gateway/README.md +++ b/deploy/inference-gateway/README.md @@ -69,7 +69,17 @@ kubectl get gateway inference-gateway -n my-model # inference-gateway kgateway x.x.x.x True 1m ``` -3. **Install dynamo model and dynamo gaie helm chart** +3. **Deploy model** + +Follow the steps in [model deployment](../../components/backends/vllm/deploy/README.md) to deploy `Qwen/Qwen3-0.6B` model in aggregate mode using [agg.yaml](../../components/backends/vllm/deploy/agg.yaml) in `my-model` kubernetes namespace. + +Sample commands to deploy model: +```bash +cd /components/backends/vllm/deploy +kubectl apply -f agg.yaml -n my-model +``` + +4. **Install Dynamo GAIE helm chart** The Inference Gateway is configured through the `inference-gateway-resources.yaml` file. diff --git a/examples/deployments/EKS/Deploy_VLLM_example.md b/examples/deployments/EKS/Deploy_VLLM_example.md index dd4f027da8..b395781ed5 100644 --- a/examples/deployments/EKS/Deploy_VLLM_example.md +++ b/examples/deployments/EKS/Deploy_VLLM_example.md @@ -25,8 +25,8 @@ dynamo-cloud vllm-agg-router-vllmdecodeworker-787d575485-zkwdd Test the Deployment ``` -kubectl port-forward deployment/vllm-agg-router-frontend 8080:8000 -n dynamo-cloud -curl localhost:8080/v1/chat/completions \ +kubectl port-forward deployment/vllm-agg-router-frontend 8000:8000 -n dynamo-cloud +curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B",