From fe860802f8374c25dcff6545592503fec729380f Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Mon, 15 Sep 2025 20:34:05 +0000 Subject: [PATCH 01/10] feat: add LLM Router deployment examples and configuration for NVIDIA Dynamo integration - Updated README.md to include a new section for LLM Router deployment with NVIDIA Dynamo. - Added new YAML files for aggregated and disaggregated worker configurations (agg.yaml, disagg.yaml). - Introduced frontend.yaml for shared API frontend service. - Created router-config-dynamo.yaml for routing policies and model configurations. - Added llm-router-values-override.yaml for Helm values specific to LLM Router integration. - Included comprehensive documentation on deployment steps and routing strategies. Signed-off-by: arunraman Signed-off-by: arunraman --- examples/README.md | 1 + examples/deployments/LLM Router/README.md | 1206 +++++++++++++++++ examples/deployments/LLM Router/agg.yaml | 26 + examples/deployments/LLM Router/disagg.yaml | 43 + examples/deployments/LLM Router/frontend.yaml | 16 + .../helm-enhancement-implementation.yaml | 189 +++ .../llm-router-values-override.yaml | 110 ++ .../LLM Router/router-config-dynamo.yaml | 139 ++ 8 files changed, 1730 insertions(+) create mode 100644 examples/deployments/LLM Router/README.md create mode 100644 examples/deployments/LLM Router/agg.yaml create mode 100644 examples/deployments/LLM Router/disagg.yaml create mode 100644 examples/deployments/LLM Router/frontend.yaml create mode 100644 examples/deployments/LLM Router/helm-enhancement-implementation.yaml create mode 100644 examples/deployments/LLM Router/llm-router-values-override.yaml create mode 100644 examples/deployments/LLM Router/router-config-dynamo.yaml diff --git a/examples/README.md b/examples/README.md index 2571ccbd8e..33b1cd797b 100644 --- a/examples/README.md +++ b/examples/README.md @@ -36,6 +36,7 @@ Platform-specific deployment guides for production environments: - **[Amazon EKS](deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service - **[Azure AKS](deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service +- **[LLM Router](deployments/LLM%20Router/)** - Intelligent LLM request routing with NVIDIA Dynamo integration - **[Router Standalone](deployments/router_standalone/)** - Standalone router deployment patterns - **Amazon ECS** - _Coming soon_ - **Google GKE** - _Coming soon_ diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md new file mode 100644 index 0000000000..05023f6dc2 --- /dev/null +++ b/examples/deployments/LLM Router/README.md @@ -0,0 +1,1206 @@ +# LLM Router with NVIDIA Dynamo Cloud Platform +## Kubernetes Deployment Guide + +
+ +[![NVIDIA](https://img.shields.io/badge/NVIDIA-76B900?style=for-the-badge&logo=nvidia&logoColor=white)](https://nvidia.com) +[![Kubernetes](https://img.shields.io/badge/kubernetes-%23326ce5.svg?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io) +[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com) +[![Helm](https://img.shields.io/badge/Helm-0F1689?style=for-the-badge&logo=Helm&labelColor=0F1689)](https://helm.sh) + +**Intelligent LLM Request Routing with Distributed Inference Serving** + +
+ +--- + +This comprehensive guide provides step-by-step instructions for deploying the [**NVIDIA LLM Router**](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [**NVIDIA Dynamo Cloud Platform**](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. + +## NVIDIA LLM Router and Dynamo Integration + +### Overview + +This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: + +### NVIDIA Dynamo +- **Distributed inference serving framework** +- **Disaggregated serving capabilities** +- **Multi-model deployment support** +- **Kubernetes-native scaling** + +### NVIDIA LLM Router +- **Intelligent request routing** +- **Task classification (12 categories)** +- **Complexity analysis (7 categories)** +- **Rust-based performance** + +> **Result**: A complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both **performance** and **cost efficiency**. + +### Kubernetes Architecture Overview + +
+ +```mermaid +graph TB + subgraph "Kubernetes Cluster" + subgraph "Ingress Layer" + LB[Load Balancer/Ingress] + end + + subgraph "LLM Router (Helm)" + RC[Router Controller] + RS[Router Server + GPU] + end + + subgraph "Dynamo Platform - Shared Frontend Architecture" + FE[Shared Frontend Service] + PR[Processor] + + subgraph "Model 1 Workers" + VW1[VllmDecodeWorker-8B + GPU] + PW1[VllmPrefillWorker-8B + GPU] + end + + subgraph "Model 2 Workers" + VW2[VllmDecodeWorker-70B + GPU] + PW2[VllmPrefillWorker-70B + GPU] + end + + subgraph "Model 3 Workers" + VW3[VllmDecodeWorker-Mixtral + GPU] + PW3[VllmPrefillWorker-Mixtral + GPU] + end + end + end + + LB --> RC + RC --> RS + RS --> FE + FE --> PR + PR --> VW1 + PR --> VW2 + PR --> VW3 + PR --> PW1 + PR --> PW2 + PR --> PW3 + + style LB fill:#e1f5fe + style RC fill:#f3e5f5 + style RS fill:#f3e5f5 + style FE fill:#e8f5e8 + style PR fill:#e8f5e8 + style VW1 fill:#fff3e0 + style VW2 fill:#fff3e0 + style VW3 fill:#fff3e0 + style PW1 fill:#ffecb3 + style PW2 fill:#ffecb3 + style PW3 fill:#ffecb3 +``` + +
+ +### Key Benefits + +
+ +| **Feature** | **Benefit** | **Impact** | +|:---:|:---:|:---:| +| **Intelligent Routing** | Auto-routes by task/complexity | **Optimal Model Selection** | +| **Cost Optimization** | Small models for simple tasks | **Reduced Infrastructure Costs** | +| **High Performance** | Rust-based minimal latency | **Sub-millisecond Routing** | +| **Scalability** | Disaggregated multi-model serving | **Enterprise-Grade Throughput** | +| **OpenAI Compatible** | Drop-in API replacement | **Zero Code Changes** | + +
+ +### Integration Components + +
+1. NVIDIA Dynamo Cloud Platform + +- **Purpose**: Distributed LLM inference serving +- **Features**: Disaggregated serving, KV cache management, multi-model support +- **Deployment**: Kubernetes-native with custom resources +- **Models Supported**: Multiple LLMs (Llama, Mixtral, Phi, Nemotron, etc.) + +
+ +
+2. NVIDIA LLM Router + +- **Purpose**: Intelligent request routing and model selection +- **Features**: OpenAI API compliant, flexible policy system, configurable backends +- **Architecture**: Rust-based controller + Triton inference server +- **Routing Policies**: Task classification (12 categories), complexity analysis (7 categories) +- **Customization**: Fine-tune models for domain-specific routing (e.g., banking intent classification) + +
+ +
+3. Integration Configuration + +- **Router Policies**: Define routing rules for different task types +- **Model Mapping**: Map router decisions to Dynamo-served models +- **Service Discovery**: Kubernetes-native service communication +- **Security**: API key management via Kubernetes secrets + +
+ +### Routing Strategies + +
+ +#### Task-Based Routing +*Routes requests based on the type of task being performed* + +
+ +
+View Task Routing Table + +| **Task Type** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Code Generation | `llama-3.1-70b-instruct` | Programming tasks | +| Brainstorming | `llama-3.1-70b-instruct` | Creative ideation | +| Chatbot | `mixtral-8x22b-instruct-v0.1` | Conversational AI | +| Summarization | `llama-3.1-8b-instruct` | Text summarization | +| Open QA | `llama-3.1-70b-instruct` | Complex questions | +| Closed QA | `llama-3.1-8b-instruct` | Simple Q&A | +| Classification | `llama-3.1-8b-instruct` | Text classification | +| Extraction | `llama-3.1-8b-instruct` | Information extraction | +| Rewrite | `llama-3.1-8b-instruct` | Text rewriting | +| Text Generation | `mixtral-8x22b-instruct-v0.1` | General text generation | +| Other | `mixtral-8x22b-instruct-v0.1` | Miscellaneous tasks | +| Unknown | `llama-3.1-8b-instruct` | Unclassified tasks | + +
+ +--- + +
+ +#### Complexity-Based Routing +*Routes requests based on the complexity of the task* + +
+ +
+View Complexity Routing Table + +| **Complexity Level** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Creativity | `llama-3.1-70b-instruct` | Creative tasks | +| Reasoning | `llama-3.1-70b-instruct` | Complex reasoning | +| Contextual-Knowledge | `llama-3.1-8b-instruct` | Context-dependent tasks | +| Few-Shot | `llama-3.1-70b-instruct` | Tasks with examples | +| Domain-Knowledge | `mixtral-8x22b-instruct-v0.1` | Specialized knowledge | +| No-Label-Reason | `llama-3.1-8b-instruct` | Unclassified complexity | +| Constraint | `llama-3.1-8b-instruct` | Tasks with constraints | + +
+ +### Performance Benefits + +
+ +| **Metric** | **Improvement** | **How It Works** | +|:---:|:---:|:---| +| **Latency** | `↓ 40-60%` | Smaller models for simple tasks | +| **Cost** | `↓ 30-50%` | Large models only when needed | +| **Throughput** | `↑ 2-3x` | Better resource utilization | +| **Scalability** | `↑ 10x` | Independent component scaling | + +
+ +### API Usage Examples + +
+ +#### Task-Based Routing + +
+ +```bash +# Code generation task → Routes to llama-3.3-nemotron-super-49b-v1 +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Write a Python function to sort a list"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' +``` + +
+ +#### Complexity-Based Routing + +
+ +```bash +# Complex reasoning task → Routes to llama-3.3-nemotron-super-49b-v1 +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Explain quantum entanglement"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' +``` + +### How Dynamo Model Routing Works + +The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: + +1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1` +2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests +3. **OpenAI Compatibility**: Standard OpenAI API format with model selection + +Example request: +```json +{ + "model": "llama-3.1-70b-instruct", // Dynamo routes based on this + "messages": [...], + "temperature": 0.7 +} +``` + +Dynamo's internal architecture handles: +- Model registry and discovery +- Request parsing and routing +- Load balancing across replicas +- KV cache management +- Disaggregated serving coordination + +## Kubernetes Integration Deployment + +This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference on Kubernetes and route requests intelligently using the NVIDIA LLM Router. The Kubernetes deployment includes: + +1. **NVIDIA Dynamo Cloud Platform**: Distributed inference serving with Kubernetes operators and custom resources +2. **LLM Router**: Helm-deployed intelligent request routing with GPU-accelerated routing models +3. **Multiple LLM Models**: Containerized models deployed via DynamoGraphDeployment CRs + + + +### Key Components + +#### Shared Frontend Architecture + +The deployment now uses a **shared frontend architecture** that splits the original `agg.yaml` into separate components for better resource utilization and model sharing: + +- **frontend.yaml**: Shared OpenAI-compatible API frontend service + - Single frontend instance serves all models + - Handles request routing and load balancing + - Reduces resource overhead compared to per-model frontends + - Uses official NGC Dynamo vLLM Runtime container from `DYNAMO_IMAGE` variable + +- **agg.yaml / disagg.yaml**: Templates for model-specific workers + - **agg.yaml**: Aggregated worker configuration with VllmDecodeWorker (1 GPU per model) + - **disagg.yaml**: Disaggregated worker configuration with separate VllmDecodeWorker and VllmPrefillWorker (1 GPU each) + - Common: Shared configuration (model, block-size, KV connector) + - Deployed per model with unique names using environment variables + +#### Configuration Files + +- **router-config-dynamo.yaml**: Router policies for Dynamo integration (uses `${DYNAMO_API_BASE}` variable) +- **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration (defines `dynamo.api_base` variable) + +### Shared Frontend Benefits + +
+ +| **Benefit** | **Shared Frontend** | **Per-Model Frontend** | **Improvement** | +|:---:|:---:|:---:|:---:| +| **Resource Usage** | 1 Frontend + N Workers | N Frontends + N Workers | **↓ 30-50% CPU/Memory** | +| **Network Complexity** | Single Endpoint | Multiple Endpoints | **Simplified Routing** | +| **Maintenance** | Single Service | Multiple Services | **↓ 70% Ops Overhead** | +| **Load Balancing** | Built-in across models | Per-model only | **Better Utilization** | +| **API Consistency** | Single OpenAI API | Multiple APIs | **Unified Interface** | + +
+ +**Key Advantages:** +- **Resource Efficiency**: Single frontend serves all models, reducing CPU and memory overhead +- **Simplified Operations**: One service to monitor, scale, and maintain instead of multiple frontends +- **Better Load Distribution**: Intelligent request routing across all available model workers +- **Cost Optimization**: Fewer running services means lower infrastructure costs +- **Unified API Gateway**: Single endpoint for all models with consistent OpenAI API interface + +### Disaggregated Serving Configuration + +The deployment uses the official disaggregated serving architecture based on [Dynamo's vLLM backend deployment reference](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/vllm/deploy): + +**Key Features**: +- **Multi-Model Support**: Deploy multiple models (Llama-3.1-8B, Llama-3.1-70B, Mixtral-8x22B) using environment variables +- **KV Transfer**: Uses `DynamoNixlConnector` for high-performance KV cache transfer +- **Conditional Disaggregation**: Automatically switches between prefill and decode workers +- **Remote Prefill**: Offloads prefill operations to dedicated VllmPrefillWorker instances +- **Prefix Caching**: Enables intelligent caching for improved performance +- **Block Size**: 64 tokens for optimal memory utilization +- **Max Model Length**: 16,384+ tokens context window (varies by model) +- **Shared Frontend**: Single frontend serves all deployed models +- **Intelligent Routing**: LLM Router selects optimal model based on task complexity + + + +### Environment Variables + +Set the required environment variables for deployment: + +| Variable | Description | Example | Required | Used In | +|----------|-------------|---------|----------|---------| +| `NAMESPACE` | Kubernetes namespace for deployment | `dynamo-kubernetes` | Yes | All deployments | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | Platform install | +| `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | Model deployment | +| `MODEL_SUFFIX` | Kubernetes deployment name suffix | `llama-8b` | Yes | Model deployment | +| `DYNAMO_IMAGE` | Full Dynamo runtime image path | `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` | Yes | Model deployment | +| `HF_TOKEN` | Hugging Face access token | `your_hf_token` | Yes | Model access | +| `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | No | Private images | +| `DYNAMO_API_BASE` | Dynamo service endpoint URL | `http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000` | Yes | LLM Router | +| `DYNAMO_API_KEY` | Dynamo API authentication key | `your-dynamo-api-key-here` | No | LLM Router auth | + +### Model Size Recommendations + +For optimal deployment experience, consider model size vs. resources: + +| Model Size | GPU Memory | Download Time | Recommended For | +|------------|------------|---------------|-----------------| +| **Small (1-2B)** | ~3-4GB | 2-5 minutes | Development, testing | +| **Medium (7-8B)** | ~8-12GB | 10-20 minutes | Production, single GPU | +| **Large (70B+)** | ~40GB+ | 30+ minutes | Multi-GPU setups | + +**Recommended Models:** +- `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance, used in router config (15GB) +- `meta-llama/Llama-3.1-70B-Instruct` - High performance, used in router config (40GB+) +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative tasks, used in router config (90GB+) +- `Qwen/Qwen2.5-1.5B-Instruct` - Fast testing model (3GB) +- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast testing (2GB) + +> **💡 Health Check Configuration**: The `frontend.yaml` and `disagg.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. + +**NGC Setup Instructions**: +1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions +2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.1` (or latest available) +3. **Optional - NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) if you need private image access +4. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds + +**Available NGC Dynamo Images**: +- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` (recommended) +- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` +- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` +- **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` +- **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` + +### Configuration Variables + +The deployment uses a configurable `api_base` variable for flexible endpoint management: + +| Variable | File | Description | Default Value | +|----------|------|-------------|---------------| +| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000` | +| `${DYNAMO_API_BASE}` | `router-config-dynamo.yaml` | Template variable substituted during deployment | Derived from `dynamo.api_base` | + +This approach allows you to: +- **Switch environments** by changing only the `dynamo.api_base` value +- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8000` +- **Use different values files** for different environments (dev/staging/prod) + +### Resource Requirements + +**Kubernetes Production Deployment**: + +**Minimum Requirements**: +- **Kubernetes cluster** with 4+ GPU nodes for disaggregated serving +- **Each node**: 16+ CPU cores, 64GB+ RAM, 2-4 GPUs +- **Storage**: 500GB+ for model storage (SSD recommended) +- **Network**: High-bandwidth interconnect for multi-node setups + +**Component Resource Allocation**: +- **Frontend**: 1-2 CPU cores, 2-4GB RAM (handles HTTP requests) +- **Processor**: 2-4 CPU cores, 4-8GB RAM (request processing) +- **VllmDecodeWorker**: 4+ GPU, 8+ CPU cores, 16GB+ RAM (model inference) +- **VllmPrefillWorker**: 2+ GPU, 4+ CPU cores, 8GB+ RAM (prefill operations) +- **Router**: 1-2 CPU cores, 2-4GB RAM (KV-aware routing) +- **LLM Router**: 1 GPU, 2 CPU cores, 4GB RAM (routing model inference) + +**Scaling Considerations**: +- **Disaggregated Serving**: Separate prefill and decode for better throughput +- **Horizontal Scaling**: Multiple VllmDecodeWorker and VllmPrefillWorker replicas +- **GPU Memory**: Adjust based on model size (70B models need 40GB+ VRAM per GPU) + +## Prerequisites + +
+ +[![Prerequisites](https://img.shields.io/badge/Prerequisites-Check%20List-blue?style=for-the-badge&logo=checkmk)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#prerequisites) + +*Ensure your environment meets all requirements before deployment* + +
+ +### Required Tools + +
+ +**Verify you have the required tools installed:** + +
+ +```bash +# Required tools verification +kubectl version --client +helm version +docker version +``` + +
+ +| **Tool** | **Requirement** | **Status** | +|:---:|:---:|:---:| +| **kubectl** | `v1.24+` | Check with `kubectl version --client` | +| **Helm** | `v3.0+` | Check with `helm version` | +| **Docker** | Running daemon | Check with `docker version` | + +
+ +**Additional Requirements:** +- **NVIDIA GPU nodes** with GPU Operator installed (for LLM inference) +- **Container registry access** (Docker Hub, NVIDIA NGC, etc.) +- **Git** for cloning repositories + +### Inference Runtime Images + +Set your inference runtime image from the available NGC options: + +```bash +# Set your inference runtime image +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 +``` + +**Available Runtime Images**: +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` - vLLM backend (recommended) +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` - SGLang backend +- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` - TensorRT-LLM backend + +### Hugging Face Token + +For accessing models from Hugging Face Hub, you'll need a Hugging Face token: + +```bash +# Set your Hugging Face token for model access +export HF_TOKEN=your_hf_token +``` + +Get your token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) + +### Kubernetes Cluster Requirements + +#### PVC Support with Default Storage Class +Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration: + +```bash +# Check if default storage class exists +kubectl get storageclass + +# Expected output should show at least one storage class marked as (default) +# Example: +# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE +# standard (default) kubernetes.io/gce-pd Delete Immediate true 1d +``` + +### Optional Requirements + +#### Service Mesh (Optional) +For advanced networking and security features, you may want to install: +- **Istio service mesh**: For advanced traffic management and security + +```bash +# Check if Istio is installed +kubectl get pods -n istio-system + +# Expected output should show running Istio pods +# istiod-* pods should be in Running state +``` + +If Istio is not installed, follow the [official Istio installation guide](https://istio.io/latest/docs/setup/getting-started/). + +## Pre-Deployment Validation + +
+ +[![Validation](https://img.shields.io/badge/Pre--Deployment-Validation-yellow?style=for-the-badge&logo=checkmarx)](https://kubernetes.io) + +*Validate your environment before starting deployment* + +
+ +Before starting the deployment, validate that your environment meets all requirements: + +### Validate Kubernetes Cluster + +```bash +# Verify Kubernetes cluster access and version +kubectl version --client +kubectl cluster-info + +# Check node resources and GPU availability +kubectl get nodes -o wide +kubectl describe nodes | grep -A 5 "Capacity:" + +# Verify default storage class exists +kubectl get storageclass +``` + +### Validate Container Registry Access + +```bash +# Test NGC registry access (if using NGC images) +docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY + +# Verify you can pull the Dynamo runtime image +docker pull $DYNAMO_IMAGE +``` + +### Validate Configuration Files + +```bash +# Navigate to the customization directory +cd customizations/LLM\ Router + +# Check that required files exist +ls -la frontend.yaml agg.yaml disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml + +# Validate YAML syntax +python -c "import yaml; yaml.safe_load(open('frontend.yaml'))" && echo "frontend.yaml is valid" +python -c "import yaml; yaml.safe_load(open('agg.yaml'))" && echo "agg.yaml is valid" +python -c "import yaml; yaml.safe_load(open('disagg.yaml'))" && echo "disagg.yaml is valid" +python -c "import yaml; yaml.safe_load(open('router-config-dynamo.yaml'))" && echo "router-config-dynamo.yaml is valid" +python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" && echo "llm-router-values-override.yaml is valid" +``` + +### Environment Setup + +```bash +# Core deployment variables +export NAMESPACE=dynamo-kubernetes +export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + +# Model deployment variables +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model (see recommendations above) +export MODEL_SUFFIX=llama-8b # Kubernetes-compatible deployment suffix (lowercase, alphanumeric, hyphens only) +export HF_TOKEN=your_hf_token + +# Optional variables +export NGC_API_KEY=your-ngc-api-key # Optional for public images + +# LLM Router variables (set during router deployment) +export DYNAMO_API_BASE="http://vllm-frontend-frontend.${NAMESPACE}.svc.cluster.local:8000" +export DYNAMO_API_KEY="your-dynamo-api-key-here" # Optional for local deployments +``` + +### Validate Environment Variables + +```bash +# Check required environment variables are set +echo "NAMESPACE: ${NAMESPACE:-'NOT SET'}" +echo "DYNAMO_VERSION: ${DYNAMO_VERSION:-'NOT SET'}" +echo "MODEL_NAME: ${MODEL_NAME:-'NOT SET'}" +echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" +echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" +echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET (optional for public images)'}" +echo "DYNAMO_API_BASE: ${DYNAMO_API_BASE:-'NOT SET (set during router deployment)'}" +echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (optional for local deployments)'}" +``` + +## Deployment Guide + +
+ +[![Deployment](https://img.shields.io/badge/Deployment-Step%20by%20Step-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +**Complete walkthrough for deploying NVIDIA Dynamo and LLM Router** + +
+ +--- + + +### Deployment Overview + +
+ +```mermaid +graph LR + A[Prerequisites] --> B[Install Platform] + B --> C[Deploy vLLM] + C --> D[Setup Router] + D --> E[Configure Access] + E --> F[Test Integration] + + style A fill:#e3f2fd + style B fill:#f3e5f5 + style C fill:#e8f5e8 + style D fill:#fff3e0 + style E fill:#fce4ec + style F fill:#e0f2f1 +``` + +
+ +### Step 1: Install Dynamo Platform (Path A: Production Install) + +
+ +[![Step 1](https://img.shields.io/badge/Step%201-Install%20Platform-blue?style=for-the-badge&logo=kubernetes)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#path-a-production-install) + +*Deploy the Dynamo Cloud Platform using the official **Path A: Production Install*** + +
+ + + +```bash +# 1. Install CRDs (use 'upgrade' instead of 'install' if already installed) +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default + +# 2. Install Platform (use 'upgrade' instead of 'install' if already installed) +kubectl create namespace ${NAMESPACE} +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} + +# 3. Verify deployment +# Check CRDs +kubectl get crd | grep dynamo +# Check operator and platform pods +kubectl get pods -n ${NAMESPACE} +# Expected: dynamo-operator-* and etcd-* pods Running +kubectl get svc -n ${NAMESPACE} +``` + +### Step 2: Deploy Multiple vLLM Models + +
+ +[![Step 2](https://img.shields.io/badge/Step%202-Deploy%20Multiple%20Models-orange?style=for-the-badge&logo=nvidia)](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md) + +*Deploy multiple vLLM models for intelligent routing* + +
+ + + +Since our LLM Router routes to different models based on task complexity, we can deploy models using the environment variables already set in Step 1. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): + +```bash +# 1. Create Kubernetes secret for Hugging Face token (using variables from Step 1) +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} + +# 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) +cd "customizations/LLM Router/" +``` + +#### Shared Frontend Deployment + +**Step 1: Deploy Shared Frontend** +```bash +# Deploy the shared frontend service (serves all models) +envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Step 2: Deploy Model Workers** + +Choose your worker deployment approach: + +**Option A: Using agg.yaml (aggregated workers)** +```bash +# Deploy model workers only (frontend extracted to frontend.yaml) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +export MODEL_SUFFIX=llama-8b +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Option B: Using disagg.yaml (disaggregated workers)** +```bash +# Deploy separate prefill and decode workers (frontend extracted to frontend.yaml) +export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct +export MODEL_SUFFIX=llama-70b +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +### Adding More Models (Optional) + +**Current Setup**: We deploy 3 models that cover most use cases: +- **Llama-3.1-8B**: Fast model for simple tasks +- **Llama-3.1-70B**: Powerful model for complex tasks +- **Mixtral-8x22B**: Creative model for conversational tasks + +**To add more models**, follow this pattern: + +#### Example: Adding Phi-3-Mini Model + +```bash +# Simply set the model name and suffix, then deploy using existing files +export MODEL_NAME=microsoft/Phi-3-mini-128k-instruct +export MODEL_SUFFIX=phi-3-mini + +# Deploy using aggregated workers +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} + +# OR deploy using disaggregated workers +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Repeat this pattern** for any additional models you want to deploy. + +### Step 3: Verify Shared Frontend Deployment + +
+ +[![Step 3](https://img.shields.io/badge/Step%203-Verify%20Deployments-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +*Verify that the shared frontend and model workers have been deployed successfully* + +
+ +```bash +# Check deployment status for shared frontend and all model workers +kubectl get pods -n ${NAMESPACE} +kubectl get svc -n ${NAMESPACE} + +# Verify shared frontend is running +kubectl logs deployment/frontend -n ${NAMESPACE} --tail=10 + +# Look for all model worker pods +kubectl get pods -n ${NAMESPACE} | grep -E "(worker|decode|prefill)" + +# Verify the shared frontend service (single port for all models) +kubectl get svc -n ${NAMESPACE} | grep frontend +``` + +### Step 4: Test Shared Frontend Service + +
+ +[![Step 4](https://img.shields.io/badge/Step%204-Test%20Services-purple?style=for-the-badge&logo=checkmarx)](https://checkmarx.com) + +*Test the shared frontend service with different models* + +
+ +```bash +# Forward the shared frontend service port +kubectl port-forward svc/frontend-service 8000:8000 -n ${NAMESPACE} & + +# Test different models through the same endpoint by specifying the model name + +# Test Model 1 (e.g., Llama-3.1-8B) +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "messages": [{"role": "user", "content": "Simple question: What is 2+2?"}], + "stream": false, + "max_tokens": 30 + }' | jq + +# Test Model 2 (e.g., different model if deployed) +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "microsoft/Phi-3-mini-128k-instruct", + "messages": [{"role": "user", "content": "Explain quantum computing briefly"}], + "stream": false, + "max_tokens": 100 + }' | jq + +# Check health and available models +curl localhost:8000/health +curl localhost:8000/v1/models | jq +``` + +### Step 5: Set Up LLM Router API Keys + +
+ +[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=keycdn)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Configure API keys for LLM Router integration* + +
+ +**IMPORTANT**: The router configuration uses Kubernetes secrets for API key management following the [official NVIDIA pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml). + +```bash +# 1. Create the LLM Router namespace +kubectl create namespace llm-router + +# 2. Create secret for Dynamo API key (if authentication is required) +# Note: For local Dynamo deployments, API keys may not be required +kubectl create secret generic dynamo-api-secret \ + --from-literal=DYNAMO_API_KEY="your-dynamo-api-key-here" \ + --namespace=llm-router + +# 3. (Optional) Create image pull secret for private registries (only if using private container registry) +kubectl create secret docker-registry nvcr-secret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password="your-ngc-api-key-here" \ + --namespace=llm-router + +# 4. Verify secrets were created +kubectl get secrets -n llm-router +``` + +### Step 6: Deploy LLM Router + +
+ +[![Step 6](https://img.shields.io/badge/Step%206-Deploy%20Router-indigo?style=for-the-badge&logo=nvidia)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Deploy the NVIDIA LLM Router using Helm* + +
+ +**Note**: The NVIDIA LLM Router requires building images from source and using the official Helm charts from the GitHub repository. + +```bash +# 1. Clone the NVIDIA LLM Router repository (required for Helm charts) +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router + +# 2. Use official NVIDIA LLM Router images (no building required) +# Our values file is configured to use the official images from nvcr.io/nvidian/sae/ +# If you need custom images, build and push them to your registry: +# docker build -t /router-server:latest -f src/router-server/router-server.dockerfile . +# docker build -t /router-controller:latest -f src/router-controller/router-controller.dockerfile . +# docker push /router-server:latest +# docker push /router-controller:latest + + +# 3. Create router configuration ConfigMap using official External ConfigMap strategy +# The official Helm chart now supports external ConfigMaps natively +kubectl create configmap router-config-dynamo \ + --from-file=config.yaml=router-config-dynamo.yaml \ + --namespace=llm-router + +# 4. Prepare router models (download from NGC) +# Download the NemoCurator Prompt Task and Complexity Classifier model from NGC: +# https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/prompt-task-and-complexity-classifier/version +# Follow the main project README to download models to local 'routers/' directory +# Then create PVC and upload models: + +kubectl apply -f - < + +[![Step 7](https://img.shields.io/badge/Step%207-Configure%20Access-teal?style=for-the-badge&logo=nginx)](https://kubernetes.io) + +*Configure external access to the LLM Router* + + + +```bash +# For development/testing, use port forwarding to access LLM Router +kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router + +# Test the LLM Router API +curl http://localhost:8084/health +``` + +## Configuration + +### Ingress Configuration + +The LLM Router is configured with ingress disabled by default to avoid service name conflicts. To enable external access: + +```yaml +ingress: + enabled: false # Disabled by default - enable after deployment is working + className: "nginx" # Adjust for your ingress controller + hosts: + - host: llm-router.local # Change to your domain + paths: + - path: / + pathType: Prefix +``` + +**Important**: Update the `host` field in `llm-router-values-override.yaml` to match your domain: + +```bash +# For production, replace llm-router.local with your actual domain +sed -i 's/llm-router.local/your-domain.com/g' llm-router-values-override.yaml +``` + +**For local testing**, add the ingress IP to your `/etc/hosts`: + +```bash +# Get the ingress IP and add to hosts file +INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts +``` + +### API Key Management + +The router configuration uses **environment variable substitution** for secure API key management, following the [official NVIDIA LLM Router pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml): + +```yaml +# In router-config-dynamo.yaml +llms: + - name: Brainstorming + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: "${DYNAMO_API_KEY}" # Resolved from Kubernetes secret + model: meta-llama/Llama-3.1-70B-Instruct +``` + +The LLM Router controller: +1. Reads `DYNAMO_API_KEY` from the `dynamo-api-secret` Kubernetes secret +2. Replaces `${DYNAMO_API_KEY}` placeholders in the configuration +3. Uses the actual API key value for authentication with Dynamo services + +**Security Note**: Never use empty strings (`""`) for API keys. Always use proper Kubernetes secrets with environment variable references. + +### Router Configuration + +The `router-config-dynamo.yaml` configures routing policies to our deployed models. + +**Current Setup**: The configuration routes to different models based on task complexity and type: +- `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks +- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks + +**Note**: All routing goes through the shared frontend service which handles model selection: + +| **Task Router** | **Model** | **Shared Frontend** | **Use Case** | +|-----------------|-----------|--------------|--------------| +| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative ideation | +| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Conversational AI | +| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Programming tasks | +| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text summarization | +| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | General text creation | +| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex questions | +| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple Q&A | +| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text classification | +| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Information extraction | +| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text rewriting | + +| **Complexity Router** | **Model** | **Shared Frontend** | **Use Case** | +|----------------------|-----------|--------------|--------------| +| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative tasks | +| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex reasoning | +| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Knowledge-intensive | +| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Few-shot learning | +| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Specialized domains | +| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple reasoning | +| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Constrained tasks | + +**Intelligent Routing Strategy**: +- **Simple tasks** → `meta-llama/Llama-3.1-8B-Instruct` (fast, efficient) +- **Complex tasks** → `meta-llama/Llama-3.1-70B-Instruct` (powerful, detailed) +- **Creative/Conversational** → `mistralai/Mixtral-8x22B-Instruct-v0.1` (diverse, creative) +- **Extensible**: Add more models by deploying additional workers and updating router configuration + +## Testing the Integration + +Once both Dynamo and LLM Router are deployed, test the complete integration: + +```bash +# Test LLM Router with task-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Write a Python function to calculate fibonacci numbers" + } + ], + "model": "", + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Test with complexity-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Explain quantum computing in simple terms" + } + ], + "model": "", + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Monitor routing decisions in LLM Router logs +kubectl logs -f deployment/llm-router-router-controller -n llm-router + +# Monitor Dynamo inference logs +kubectl logs -f deployment/llm-deployment-frontend -n dynamo-cloud +``` + + + + + +## Troubleshooting + +If you encounter issues, the most common causes are: + +1. **Missing Prerequisites**: Ensure all environment variables are set correctly +2. **Insufficient Resources**: Verify your cluster has enough GPU and memory resources +3. **Network Issues**: Check that services can communicate across namespaces +4. **LLM Router Configuration**: The Helm chart defaults to NVIDIA Cloud API integration + +### LLM Router Issues + +**Problem**: Router Controller crashes with "Missing field 'api_key'" errors +**Cause**: Configuration issues with API key management or ConfigMap mounting +**Solution**: Verify that the external ConfigMap is properly created and the Helm chart is using the official External ConfigMap strategy + +**Problem**: Router Server crashes with "failed to stat file /model_repository/routers" +**Cause**: Duplicate `modelRepository` sections in values file or missing PVC +**Solution**: Ensure clean values file structure and router models are properly uploaded to PVC + +### Quick Health Check + +```bash +# Verify all components are running +kubectl get pods -n ${NAMESPACE} +kubectl get pods -n llm-router + +# If something isn't working, check the logs +kubectl logs -f -n +``` + +For detailed debugging, refer to the Kubernetes documentation or the specific component's logs. + +## Cleanup + + + +```bash +# Remove LLM Router +helm uninstall llm-router -n llm-router +kubectl delete namespace llm-router + +# Remove all model deployments (use the same files you deployed with) +# If you used agg.yaml: +# kubectl delete -f agg.yaml -n ${NAMESPACE} +# If you used disagg.yaml: +# kubectl delete -f disagg.yaml -n ${NAMESPACE} +# Remove shared frontend +kubectl delete -f frontend.yaml -n ${NAMESPACE} + +# Remove Hugging Face token secret +kubectl delete secret hf-token-secret -n ${NAMESPACE} + +# Remove Dynamo Cloud Platform (if desired) +helm uninstall dynamo-platform -n ${NAMESPACE} +helm uninstall dynamo-crds -n default +kubectl delete namespace ${NAMESPACE} + +# Stop supporting services (if used) +docker compose -f deploy/metrics/docker-compose.yml down +``` + +## Files in This Directory + +- **`README.md`** - This comprehensive deployment guide +- **`frontend.yaml`** - Shared OpenAI-compatible API frontend service configuration +- **`agg.yaml`** - Aggregated worker configuration (frontend extracted to frontend.yaml) +- **`disagg.yaml`** - Disaggregated worker configuration with separate prefill/decode workers (frontend extracted to frontend.yaml) +- **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration +- **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration + +## Resources + +- [NVIDIA Dynamo Cloud Platform Documentation](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) +- [NVIDIA Dynamo Kubernetes Operator](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_operator.html) +- [NVIDIA Dynamo GitHub Repository](https://github.com/ai-dynamo/dynamo) +- [LLM Router GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/llm-router) +- [LLM Router Helm Chart](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router) +- [Kubernetes Documentation](https://kubernetes.io/docs/) +- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) \ No newline at end of file diff --git a/examples/deployments/LLM Router/agg.yaml b/examples/deployments/LLM Router/agg.yaml new file mode 100644 index 0000000000..d84e506bed --- /dev/null +++ b/examples/deployments/LLM Router/agg.yaml @@ -0,0 +1,26 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-agg-${MODEL_SUFFIX} +spec: + services: + VllmDecodeWorker: + envFromSecret: hf-token-secret + dynamoNamespace: vllm-agg + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - python3 -m dynamo.vllm --model ${MODEL_NAME} diff --git a/examples/deployments/LLM Router/disagg.yaml b/examples/deployments/LLM Router/disagg.yaml new file mode 100644 index 0000000000..b391714694 --- /dev/null +++ b/examples/deployments/LLM Router/disagg.yaml @@ -0,0 +1,43 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-disagg-${MODEL_SUFFIX} +spec: + services: + VllmDecodeWorker: + dynamoNamespace: vllm-agg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model ${MODEL_NAME}" + VllmPrefillWorker: + dynamoNamespace: vllm-agg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model ${MODEL_NAME} --is-prefill-worker" \ No newline at end of file diff --git a/examples/deployments/LLM Router/frontend.yaml b/examples/deployments/LLM Router/frontend.yaml new file mode 100644 index 0000000000..8670fffaaf --- /dev/null +++ b/examples/deployments/LLM Router/frontend.yaml @@ -0,0 +1,16 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-frontend +spec: + services: + Frontend: + dynamoNamespace: vllm-agg + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} diff --git a/examples/deployments/LLM Router/helm-enhancement-implementation.yaml b/examples/deployments/LLM Router/helm-enhancement-implementation.yaml new file mode 100644 index 0000000000..df402c184d --- /dev/null +++ b/examples/deployments/LLM Router/helm-enhancement-implementation.yaml @@ -0,0 +1,189 @@ +# LLM Router Helm Chart Enhancement Implementation +# Author: LLM Router Team +# Purpose: Support both cloud and local model deployments + +# ============================================================================= +# 1. Enhanced values.yaml +# ============================================================================= + +routerController: + enabled: true + image: + repository: router-controller + tag: latest + pullPolicy: IfNotPresent + + # Configuration Strategy (choose one) + config: + # Strategy 1: External ConfigMap (highest flexibility) + existingConfigMap: "" # If set, uses existing ConfigMap instead of generating + + # Strategy 2: Inline custom config (medium flexibility) + customConfig: "" # If set, uses this YAML content directly + + # Strategy 3: Template-based config (structured approach) + template: + enabled: true # Use templated configuration + + # Backend configuration + backend: + type: "nvidia-cloud" # Options: nvidia-cloud, local-service, custom + + # For nvidia-cloud backend + nvidia: + apiBase: "https://integrate.api.nvidia.com" + apiKeySecret: "llm-api-keys" + apiKeySecretKey: "nvidia_api_key" + + # For local-service backend + local: + apiBase: "" # e.g., "http://vllm-frontend.dynamo.svc.cluster.local:8000/v1" + apiKeySecret: "" # Optional for local services + apiKeySecretKey: "" + + # For custom backend + custom: + apiBase: "" + apiKeySecret: "" + apiKeySecretKey: "" + + # Model routing configuration + policies: + - name: "task_router" + url: "http://{{ include \"llm-router.fullname\" . }}-router-server:8000/v2/models/task_router_ensemble/infer" + llms: + brainstorming: + model: "meta/llama-3.1-70b-instruct" # Cloud model name + localModel: "Qwen/Qwen3-0.6B" # Local model name + chatbot: + model: "mistralai/mixtral-8x22b-instruct-v0.1" + localModel: "Qwen/Qwen3-0.6B" + classification: + model: "meta/llama-3.1-8b-instruct" + localModel: "Qwen/Qwen3-0.6B" + # ... other models + + - name: "complexity_router" + url: "http://{{ include \"llm-router.fullname\" . }}-router-server:8000/v2/models/complexity_router_ensemble/infer" + llms: + creativity: + model: "meta/llama-3.1-70b-instruct" + localModel: "Qwen/Qwen3-0.6B" + reasoning: + model: "nvidia/llama-3.3-nemotron-super-49b-v1" + localModel: "Qwen/Qwen3-0.6B" + # ... other models + +# ============================================================================= +# 2. Enhanced templates/router-controller-configmap.yaml +# ============================================================================= + +{{- if not .Values.routerController.config.existingConfigMap }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ include "llm-router.fullname" . }}-router-controller-config + labels: + {{- include "llm-router.labels" . | nindent 4 }} + app.kubernetes.io/component: router-controller +data: + config.yaml: |- +{{- if .Values.routerController.config.customConfig }} + # User-provided custom configuration +{{ .Values.routerController.config.customConfig | indent 4 }} +{{- else if .Values.routerController.config.template.enabled }} + # Template-based configuration + policies: +{{- range .Values.routerController.config.template.policies }} + - name: {{ .name | quote }} + url: {{ tpl .url $ }} + llms: +{{- $backend := $.Values.routerController.config.template.backend }} +{{- range $taskName, $taskConfig := .llms }} + - name: {{ $taskName | title }} +{{- if eq $backend.type "nvidia-cloud" }} + api_base: {{ $backend.nvidia.apiBase }} + api_key: ${NVIDIA_API_KEY} + model: {{ $taskConfig.model }} +{{- else if eq $backend.type "local-service" }} + api_base: {{ $backend.local.apiBase }} +{{- if $backend.local.apiKeySecret }} + api_key: ${NVIDIA_API_KEY} +{{- else }} + api_key: ${NVIDIA_API_KEY} # Placeholder for local services +{{- end }} + model: {{ $taskConfig.localModel | default $taskConfig.model }} +{{- else if eq $backend.type "custom" }} + api_base: {{ $backend.custom.apiBase }} + api_key: ${NVIDIA_API_KEY} + model: {{ $taskConfig.localModel | default $taskConfig.model }} +{{- end }} +{{- end }} +{{- end }} +{{- else }} + # Default NVIDIA Cloud configuration (backward compatibility) + policies: + - name: "task_router" + url: http://{{ include "llm-router.fullname" . }}-router-server:8000/v2/models/task_router_ensemble/infer + llms: + - name: Brainstorming + api_base: https://integrate.api.nvidia.com + api_key: ${NVIDIA_API_KEY} + model: meta/llama-3.1-70b-instruct + # ... rest of default config +{{- end }} +{{- end }} + +# ============================================================================= +# 3. Enhanced templates/router-controller-deployment.yaml +# ============================================================================= + +# In the volumes section: + volumes: + - name: config-volume + configMap: +{{- if .Values.routerController.config.existingConfigMap }} + name: {{ .Values.routerController.config.existingConfigMap }} +{{- else }} + name: {{ include "llm-router.fullname" . }}-router-controller-config +{{- end }} + +# ============================================================================= +# 4. Usage Examples +# ============================================================================= + +# Example 1: Dynamo Integration (Template-based) +# values-dynamo.yaml +routerController: + config: + template: + enabled: true + backend: + type: "local-service" + local: + apiBase: "http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1" + apiKeySecret: "" # No API key needed for local services + +# Example 2: Custom Configuration (Maximum flexibility) +# values-custom.yaml +routerController: + config: + customConfig: | + policies: + - name: "task_router" + url: http://router-server:8000/v2/models/task_router_ensemble/infer + llms: + - name: Brainstorming + api_base: http://my-custom-service:8000/v1 + api_key: ${NVIDIA_API_KEY} + model: my-custom-model + +# Example 3: External ConfigMap (Advanced users) +# values-external.yaml +routerController: + config: + existingConfigMap: "my-router-config" + +# Example 4: Default Cloud (Backward compatibility) +# values-cloud.yaml (or no values file) +# Uses default NVIDIA Cloud configuration automatically diff --git a/examples/deployments/LLM Router/llm-router-values-override.yaml b/examples/deployments/LLM Router/llm-router-values-override.yaml new file mode 100644 index 0000000000..3e85dc8fca --- /dev/null +++ b/examples/deployments/LLM Router/llm-router-values-override.yaml @@ -0,0 +1,110 @@ +# LLM Router Helm Values for NVIDIA Dynamo Cloud Platform Integration +# Based on official sample: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/values.override.yaml.sample +# Uses official External ConfigMap strategy for custom configuration + +# Global configuration (following official sample structure) +global: + storageClass: "standard" + imageRegistry: "nvcr.io/nvidian/sae/" + imagePullSecrets: + - name: nvcr-secret + +# Router Controller Configuration +routerController: + enabled: true + replicas: 1 + image: + repository: llm-router-controller # Will be prefixed with global.imageRegistry + tag: latest + pullPolicy: IfNotPresent + + service: + type: ClusterIP + port: 8084 + + # Dynamo-specific environment variables + env: + - name: LOG_LEVEL + value: "INFO" + - name: ENABLE_METRICS + value: "true" + - name: DYNAMO_API_BASE + value: "http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000" + - name: DYNAMO_API_KEY + valueFrom: + secretKeyRef: + name: dynamo-api-secret + key: DYNAMO_API_KEY + + # STRATEGY 1: External ConfigMap (Official Support) + # Uses the official Helm chart's external ConfigMap feature + config: + existingConfigMap: "router-config-dynamo" # Points to our router configuration + +# Router Server Configuration +routerServer: + enabled: true + replicas: 1 # Single replica for simpler deployment + image: + repository: llm-router-server + tag: latest + pullPolicy: IfNotPresent + env: + - name: HF_HOME + value: "/tmp/huggingface" + - name: TRANSFORMERS_CACHE + value: "/tmp/huggingface/transformers" + - name: HF_HUB_CACHE + value: "/tmp/huggingface/hub" + resources: + limits: + nvidia.com/gpu: 1 + memory: "8Gi" + requests: + nvidia.com/gpu: 1 + memory: "8Gi" + # Model repository configuration + modelRepository: + path: "/model_repository/routers" + volumes: + modelRepository: + enabled: true + mountPath: "/model_repository" + storage: + persistentVolumeClaim: + enabled: true + existingClaim: "router-models-pvc" + service: + type: ClusterIP + shm_size: "8G" + +# Ingress Configuration (disabled for internal access) +ingress: + enabled: false + className: "nginx" + annotations: + nginx.ingress.kubernetes.io/rewrite-target: /$2 + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/proxy-body-size: "10m" + nginx.ingress.kubernetes.io/proxy-read-timeout: "300" + nginx.ingress.kubernetes.io/proxy-send-timeout: "300" + hosts: + - host: llm-router.local + paths: + - path: /app(/|$)(.*) + pathType: ImplementationSpecific + service: app + - path: /router-controller(/|$)(.*) + pathType: ImplementationSpecific + service: router-controller + +# Demo app (disabled) +app: + enabled: true # Enable for demo web interface + replicas: 1 # Single replica for simpler deployment + image: + repository: llm-router-app + tag: latest + pullPolicy: IfNotPresent + service: + type: ClusterIP \ No newline at end of file diff --git a/examples/deployments/LLM Router/router-config-dynamo.yaml b/examples/deployments/LLM Router/router-config-dynamo.yaml new file mode 100644 index 0000000000..530bb26d11 --- /dev/null +++ b/examples/deployments/LLM Router/router-config-dynamo.yaml @@ -0,0 +1,139 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LLM Router Configuration for NVIDIA Dynamo Integration +# This configuration routes requests to the official NVIDIA Dynamo Cloud Platform +# deployment using the proper service endpoints +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html +# API Key pattern follows: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml +# +# IMPORTANT: This config references the 3 models for intelligent routing: +# - meta-llama/Llama-3.1-8B-Instruct (Fast model for simple tasks) +# - meta-llama/Llama-3.1-70B-Instruct (Powerful model for complex tasks) +# - mistralai/Mixtral-8x22B-Instruct-v0.1 (Creative model for conversational tasks) +# +# To add more models: +# 1. Deploy the model using the pattern in Step 2 of README.md +# 2. Add router entries below following the same format +# +# NOTE: Environment variables are resolved at runtime: +# - ${DYNAMO_API_BASE}: Points to the Dynamo service endpoint +# - ${DYNAMO_API_KEY}: API key for authenticating with Dynamo services +# +# These variables are populated from: +# - ConfigMap: DYNAMO_API_BASE (defined in llm-router-values-override.yaml) +# - Secret: DYNAMO_API_KEY (created during deployment setup) + +policies: + - name: "task_router" + url: http://llm-router-router-server.llm-router.svc.cluster.local:8000/v2/models/task_router_ensemble/infer + llms: + # === INTELLIGENT ROUTING STRATEGY === + # Route to appropriate models based on task complexity + + # Simple tasks → Fast 8B model + - name: "Closed QA" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Classification + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Extraction + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Rewrite + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Summarization + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Unknown + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + + # Complex tasks → Powerful 70B model + - name: Brainstorming + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Code Generation" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Open QA" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + - name: Other + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + + # Creative/Conversational tasks → Mixtral model + - name: Chatbot + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: "Text Generation" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + + - name: "complexity_router" + url: http://llm-router-router-server.llm-router.svc.cluster.local:8000/v2/models/complexity_router_ensemble/infer + llms: + # === INTELLIGENT COMPLEXITY ROUTING === + # Route to appropriate models based on complexity level + + # Simple complexity → Fast 8B model + - name: "Contextual-Knowledge" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: "No-Label-Reason" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + - name: Constraint + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-8B-Instruct + + # High complexity → Powerful 70B model + - name: Creativity + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + - name: Reasoning + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Few-Shot" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: meta-llama/Llama-3.1-70B-Instruct + + # Creative/Domain complexity → Mixtral model + - name: "Domain-Knowledge" + api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: ${DYNAMO_API_KEY} + model: mistralai/Mixtral-8x22B-Instruct-v0.1 \ No newline at end of file From 23da93bb14dd4cf769d616d9ae6bd8d8f336fcdf Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Tue, 16 Sep 2025 02:43:53 +0000 Subject: [PATCH 02/10] refactor: remove deprecated helm-enhancement-implementation.yaml and update README.md for model routing - Deleted the helm-enhancement-implementation.yaml file as it is no longer needed. - Updated README.md to reflect changes in model routing, including new API base URLs and model names. - Adjusted environment variable descriptions for clarity, particularly regarding the DYNAMO_API_KEY for local deployments. - Enhanced deployment instructions to include multiple model deployment examples. Signed-off-by: arunraman Signed-off-by: arunraman --- examples/deployments/LLM Router/README.md | 116 +++++------ .../helm-enhancement-implementation.yaml | 189 ------------------ .../LLM Router/router-config-dynamo.yaml | 38 ++-- 3 files changed, 79 insertions(+), 264 deletions(-) delete mode 100644 examples/deployments/LLM Router/helm-enhancement-implementation.yaml diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index 05023f6dc2..5c7c23aaf5 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -221,7 +221,7 @@ graph TB ```bash -# Code generation task → Routes to llama-3.3-nemotron-super-49b-v1 +# Code generation task → Routes to meta-llama/Llama-3.1-70B-Instruct curl -X POST http://llm-router.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -243,7 +243,7 @@ curl -X POST http://llm-router.local/v1/chat/completions \ ```bash -# Complex reasoning task → Routes to llama-3.3-nemotron-super-49b-v1 +# Complex reasoning task → Routes to meta-llama/Llama-3.1-70B-Instruct curl -X POST http://llm-router.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -262,7 +262,7 @@ curl -X POST http://llm-router.local/v1/chat/completions \ The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: -1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1` +1. **Single Endpoint**: `http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1` 2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests 3. **OpenAI Compatibility**: Standard OpenAI API format with model selection @@ -367,7 +367,9 @@ Set the required environment variables for deployment: | `HF_TOKEN` | Hugging Face access token | `your_hf_token` | Yes | Model access | | `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | No | Private images | | `DYNAMO_API_BASE` | Dynamo service endpoint URL | `http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000` | Yes | LLM Router | -| `DYNAMO_API_KEY` | Dynamo API authentication key | `your-dynamo-api-key-here` | No | LLM Router auth | +| `DYNAMO_API_KEY` | Dynamo API authentication key | `""` (empty) | Yes* | LLM Router auth | + +*Required for router configuration but can be empty for local deployments ### Model Size Recommendations @@ -401,20 +403,6 @@ For optimal deployment experience, consider model size vs. resources: - **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` - **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` -### Configuration Variables - -The deployment uses a configurable `api_base` variable for flexible endpoint management: - -| Variable | File | Description | Default Value | -|----------|------|-------------|---------------| -| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000` | -| `${DYNAMO_API_BASE}` | `router-config-dynamo.yaml` | Template variable substituted during deployment | Derived from `dynamo.api_base` | - -This approach allows you to: -- **Switch environments** by changing only the `dynamo.api_base` value -- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8000` -- **Use different values files** for different environments (dev/staging/prod) - ### Resource Requirements **Kubernetes Production Deployment**: @@ -596,17 +584,23 @@ export NAMESPACE=dynamo-kubernetes export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} -# Model deployment variables -export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model (see recommendations above) -export MODEL_SUFFIX=llama-8b # Kubernetes-compatible deployment suffix (lowercase, alphanumeric, hyphens only) +# Model deployment variables (deploy all three models) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Start with this model +export MODEL_SUFFIX=llama-8b # Kubernetes-compatible deployment suffix export HF_TOKEN=your_hf_token +# Deploy other models by changing MODEL_NAME and MODEL_SUFFIX: +# export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct +# export MODEL_SUFFIX=llama-70b +# export MODEL_NAME=mistralai/Mixtral-8x22B-Instruct-v0.1 +# export MODEL_SUFFIX=mixtral-8x22b + # Optional variables export NGC_API_KEY=your-ngc-api-key # Optional for public images # LLM Router variables (set during router deployment) export DYNAMO_API_BASE="http://vllm-frontend-frontend.${NAMESPACE}.svc.cluster.local:8000" -export DYNAMO_API_KEY="your-dynamo-api-key-here" # Optional for local deployments +export DYNAMO_API_KEY="" # Empty for local deployments (no authentication required) ``` ### Validate Environment Variables @@ -620,7 +614,7 @@ echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET (optional for public images)'}" echo "DYNAMO_API_BASE: ${DYNAMO_API_BASE:-'NOT SET (set during router deployment)'}" -echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (optional for local deployments)'}" +echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (can be empty for local deployments)'}" ``` ## Deployment Guide @@ -725,22 +719,30 @@ envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} Choose your worker deployment approach: -**Option A: Using agg.yaml (aggregated workers)** +**Deploy Model 1 (Llama 8B - Simple Tasks):** ```bash -# Deploy model workers only (frontend extracted to frontend.yaml) +# Deploy Llama 8B model for simple tasks export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct export MODEL_SUFFIX=llama-8b envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` -**Option B: Using disagg.yaml (disaggregated workers)** +**Deploy Model 2 (Llama 70B - Complex Tasks):** ```bash -# Deploy separate prefill and decode workers (frontend extracted to frontend.yaml) +# Deploy Llama 70B model for complex tasks export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct export MODEL_SUFFIX=llama-70b envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` +**Deploy Model 3 (Mixtral - Creative Tasks):** +```bash +# Deploy Mixtral model for creative/conversational tasks +export MODEL_NAME=mistralai/Mixtral-8x22B-Instruct-v0.1 +export MODEL_SUFFIX=mixtral-8x22b +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + ### Adding More Models (Optional) **Current Setup**: We deploy 3 models that cover most use cases: @@ -803,7 +805,7 @@ kubectl get svc -n ${NAMESPACE} | grep frontend ```bash # Forward the shared frontend service port -kubectl port-forward svc/frontend-service 8000:8000 -n ${NAMESPACE} & +kubectl port-forward svc/vllm-frontend-frontend 8000:8000 -n ${NAMESPACE} & # Test different models through the same endpoint by specifying the model name @@ -817,11 +819,11 @@ curl localhost:8000/v1/chat/completions \ "max_tokens": 30 }' | jq -# Test Model 2 (e.g., different model if deployed) +# Test Model 2 (e.g., Llama-3.1-70B if deployed) curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "microsoft/Phi-3-mini-128k-instruct", + "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [{"role": "user", "content": "Explain quantum computing briefly"}], "stream": false, "max_tokens": 100 @@ -848,10 +850,10 @@ curl localhost:8000/v1/models | jq # 1. Create the LLM Router namespace kubectl create namespace llm-router -# 2. Create secret for Dynamo API key (if authentication is required) -# Note: For local Dynamo deployments, API keys may not be required +# 2. Create secret for Dynamo API key (empty for local deployments) +# Note: For local Dynamo deployments, API keys are not required kubectl create secret generic dynamo-api-secret \ - --from-literal=DYNAMO_API_KEY="your-dynamo-api-key-here" \ + --from-literal=DYNAMO_API_KEY="" \ --namespace=llm-router # 3. (Optional) Create image pull secret for private registries (only if using private container registry) @@ -944,7 +946,7 @@ kubectl delete pod model-uploader -n llm-router # 4. Create placeholder API key secret (required by router controller) kubectl create secret generic llm-api-keys \ - --from-literal=nvidia_api_key="not-needed" \ + --from-literal=nvidia_api_key="" \ --namespace=llm-router \ --dry-run=client -o yaml | kubectl apply -f - @@ -1031,41 +1033,43 @@ The LLM Router controller: 2. Replaces `${DYNAMO_API_KEY}` placeholders in the configuration 3. Uses the actual API key value for authentication with Dynamo services -**Security Note**: Never use empty strings (`""`) for API keys. Always use proper Kubernetes secrets with environment variable references. +**Security Note**: For local Dynamo deployments, empty strings (`""`) are acceptable since no authentication is required. For production deployments with authentication, always use proper API keys stored in Kubernetes secrets. ### Router Configuration The `router-config-dynamo.yaml` configures routing policies to our deployed models. **Current Setup**: The configuration routes to different models based on task complexity and type: -- `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks -- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks -- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks +- `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks (8B parameters) +- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks (70B parameters) +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks (8x22B parameters) + +**Note**: This guide shows the full 3-model production setup. For testing/development, you can start with fewer models (e.g., just Llama-8B + Qwen-0.6B) and add more as needed. The router will work with any subset of the configured models. **Note**: All routing goes through the shared frontend service which handles model selection: | **Task Router** | **Model** | **Shared Frontend** | **Use Case** | |-----------------|-----------|--------------|--------------| -| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative ideation | -| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Conversational AI | -| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Programming tasks | -| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text summarization | -| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | General text creation | -| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex questions | -| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple Q&A | -| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text classification | -| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Information extraction | -| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text rewriting | +| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Creative ideation | +| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Conversational AI | +| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Programming tasks | +| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Text summarization | +| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | General text creation | +| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Complex questions | +| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Simple Q&A | +| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Text classification | +| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Information extraction | +| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Text rewriting | | **Complexity Router** | **Model** | **Shared Frontend** | **Use Case** | |----------------------|-----------|--------------|--------------| -| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative tasks | -| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex reasoning | -| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Knowledge-intensive | -| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Few-shot learning | -| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Specialized domains | -| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple reasoning | -| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Constrained tasks | +| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Creative tasks | +| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Complex reasoning | +| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Knowledge-intensive | +| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Few-shot learning | +| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Specialized domains | +| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Simple reasoning | +| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://vllm-frontend-frontend.${NAMESPACE}:8000/v1` | Constrained tasks | **Intelligent Routing Strategy**: - **Simple tasks** → `meta-llama/Llama-3.1-8B-Instruct` (fast, efficient) @@ -1118,7 +1122,7 @@ curl -X POST http://localhost:8084/v1/chat/completions \ kubectl logs -f deployment/llm-router-router-controller -n llm-router # Monitor Dynamo inference logs -kubectl logs -f deployment/llm-deployment-frontend -n dynamo-cloud +kubectl logs -f deployment/vllm-frontend-frontend -n ${NAMESPACE} ``` diff --git a/examples/deployments/LLM Router/helm-enhancement-implementation.yaml b/examples/deployments/LLM Router/helm-enhancement-implementation.yaml deleted file mode 100644 index df402c184d..0000000000 --- a/examples/deployments/LLM Router/helm-enhancement-implementation.yaml +++ /dev/null @@ -1,189 +0,0 @@ -# LLM Router Helm Chart Enhancement Implementation -# Author: LLM Router Team -# Purpose: Support both cloud and local model deployments - -# ============================================================================= -# 1. Enhanced values.yaml -# ============================================================================= - -routerController: - enabled: true - image: - repository: router-controller - tag: latest - pullPolicy: IfNotPresent - - # Configuration Strategy (choose one) - config: - # Strategy 1: External ConfigMap (highest flexibility) - existingConfigMap: "" # If set, uses existing ConfigMap instead of generating - - # Strategy 2: Inline custom config (medium flexibility) - customConfig: "" # If set, uses this YAML content directly - - # Strategy 3: Template-based config (structured approach) - template: - enabled: true # Use templated configuration - - # Backend configuration - backend: - type: "nvidia-cloud" # Options: nvidia-cloud, local-service, custom - - # For nvidia-cloud backend - nvidia: - apiBase: "https://integrate.api.nvidia.com" - apiKeySecret: "llm-api-keys" - apiKeySecretKey: "nvidia_api_key" - - # For local-service backend - local: - apiBase: "" # e.g., "http://vllm-frontend.dynamo.svc.cluster.local:8000/v1" - apiKeySecret: "" # Optional for local services - apiKeySecretKey: "" - - # For custom backend - custom: - apiBase: "" - apiKeySecret: "" - apiKeySecretKey: "" - - # Model routing configuration - policies: - - name: "task_router" - url: "http://{{ include \"llm-router.fullname\" . }}-router-server:8000/v2/models/task_router_ensemble/infer" - llms: - brainstorming: - model: "meta/llama-3.1-70b-instruct" # Cloud model name - localModel: "Qwen/Qwen3-0.6B" # Local model name - chatbot: - model: "mistralai/mixtral-8x22b-instruct-v0.1" - localModel: "Qwen/Qwen3-0.6B" - classification: - model: "meta/llama-3.1-8b-instruct" - localModel: "Qwen/Qwen3-0.6B" - # ... other models - - - name: "complexity_router" - url: "http://{{ include \"llm-router.fullname\" . }}-router-server:8000/v2/models/complexity_router_ensemble/infer" - llms: - creativity: - model: "meta/llama-3.1-70b-instruct" - localModel: "Qwen/Qwen3-0.6B" - reasoning: - model: "nvidia/llama-3.3-nemotron-super-49b-v1" - localModel: "Qwen/Qwen3-0.6B" - # ... other models - -# ============================================================================= -# 2. Enhanced templates/router-controller-configmap.yaml -# ============================================================================= - -{{- if not .Values.routerController.config.existingConfigMap }} -apiVersion: v1 -kind: ConfigMap -metadata: - name: {{ include "llm-router.fullname" . }}-router-controller-config - labels: - {{- include "llm-router.labels" . | nindent 4 }} - app.kubernetes.io/component: router-controller -data: - config.yaml: |- -{{- if .Values.routerController.config.customConfig }} - # User-provided custom configuration -{{ .Values.routerController.config.customConfig | indent 4 }} -{{- else if .Values.routerController.config.template.enabled }} - # Template-based configuration - policies: -{{- range .Values.routerController.config.template.policies }} - - name: {{ .name | quote }} - url: {{ tpl .url $ }} - llms: -{{- $backend := $.Values.routerController.config.template.backend }} -{{- range $taskName, $taskConfig := .llms }} - - name: {{ $taskName | title }} -{{- if eq $backend.type "nvidia-cloud" }} - api_base: {{ $backend.nvidia.apiBase }} - api_key: ${NVIDIA_API_KEY} - model: {{ $taskConfig.model }} -{{- else if eq $backend.type "local-service" }} - api_base: {{ $backend.local.apiBase }} -{{- if $backend.local.apiKeySecret }} - api_key: ${NVIDIA_API_KEY} -{{- else }} - api_key: ${NVIDIA_API_KEY} # Placeholder for local services -{{- end }} - model: {{ $taskConfig.localModel | default $taskConfig.model }} -{{- else if eq $backend.type "custom" }} - api_base: {{ $backend.custom.apiBase }} - api_key: ${NVIDIA_API_KEY} - model: {{ $taskConfig.localModel | default $taskConfig.model }} -{{- end }} -{{- end }} -{{- end }} -{{- else }} - # Default NVIDIA Cloud configuration (backward compatibility) - policies: - - name: "task_router" - url: http://{{ include "llm-router.fullname" . }}-router-server:8000/v2/models/task_router_ensemble/infer - llms: - - name: Brainstorming - api_base: https://integrate.api.nvidia.com - api_key: ${NVIDIA_API_KEY} - model: meta/llama-3.1-70b-instruct - # ... rest of default config -{{- end }} -{{- end }} - -# ============================================================================= -# 3. Enhanced templates/router-controller-deployment.yaml -# ============================================================================= - -# In the volumes section: - volumes: - - name: config-volume - configMap: -{{- if .Values.routerController.config.existingConfigMap }} - name: {{ .Values.routerController.config.existingConfigMap }} -{{- else }} - name: {{ include "llm-router.fullname" . }}-router-controller-config -{{- end }} - -# ============================================================================= -# 4. Usage Examples -# ============================================================================= - -# Example 1: Dynamo Integration (Template-based) -# values-dynamo.yaml -routerController: - config: - template: - enabled: true - backend: - type: "local-service" - local: - apiBase: "http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1" - apiKeySecret: "" # No API key needed for local services - -# Example 2: Custom Configuration (Maximum flexibility) -# values-custom.yaml -routerController: - config: - customConfig: | - policies: - - name: "task_router" - url: http://router-server:8000/v2/models/task_router_ensemble/infer - llms: - - name: Brainstorming - api_base: http://my-custom-service:8000/v1 - api_key: ${NVIDIA_API_KEY} - model: my-custom-model - -# Example 3: External ConfigMap (Advanced users) -# values-external.yaml -routerController: - config: - existingConfigMap: "my-router-config" - -# Example 4: Default Cloud (Backward compatibility) -# values-cloud.yaml (or no values file) -# Uses default NVIDIA Cloud configuration automatically diff --git a/examples/deployments/LLM Router/router-config-dynamo.yaml b/examples/deployments/LLM Router/router-config-dynamo.yaml index 530bb26d11..d52663d496 100644 --- a/examples/deployments/LLM Router/router-config-dynamo.yaml +++ b/examples/deployments/LLM Router/router-config-dynamo.yaml @@ -46,55 +46,55 @@ policies: # Simple tasks → Fast 8B model - name: "Closed QA" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Classification - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Extraction - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Rewrite - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Summarization - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Unknown - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct # Complex tasks → Powerful 70B model - name: Brainstorming - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - name: "Code Generation" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - name: "Open QA" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - name: Other - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: mistralai/Mixtral-8x22B-Instruct-v0.1 # Creative/Conversational tasks → Mixtral model - name: Chatbot - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: mistralai/Mixtral-8x22B-Instruct-v0.1 - name: "Text Generation" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: mistralai/Mixtral-8x22B-Instruct-v0.1 @@ -106,34 +106,34 @@ policies: # Simple complexity → Fast 8B model - name: "Contextual-Knowledge" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: "No-Label-Reason" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - name: Constraint - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct # High complexity → Powerful 70B model - name: Creativity - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - name: Reasoning - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - name: "Few-Shot" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct # Creative/Domain complexity → Mixtral model - name: "Domain-Knowledge" - api_base: http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: mistralai/Mixtral-8x22B-Instruct-v0.1 \ No newline at end of file From fe3b585671301c8945b576891a3310755e2a8798 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Tue, 16 Sep 2025 03:20:00 +0000 Subject: [PATCH 03/10] chore: update LLM Router configuration files and README for improved deployment instructions - Added SPDX license headers to llm-router-values-override.yaml. - Updated imageRegistry placeholder in llm-router-values-override.yaml for clarity. - Revised README.md to reflect changes in directory structure and emphasize the need to update imageRegistry and imagePullSecrets. - Adjusted paths in README.md for configuration file references to ensure accuracy. - Modified router-config-dynamo.yaml to enhance model routing strategies and updated model names for better clarity. Signed-off-by: arunraman Signed-off-by: arunraman --- examples/deployments/LLM Router/README.md | 29 +++++-- .../llm-router-values-override.yaml | 10 ++- .../LLM Router/router-config-dynamo.yaml | 81 ++++++++----------- 3 files changed, 61 insertions(+), 59 deletions(-) diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index 5c7c23aaf5..cca2fe5feb 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -562,8 +562,8 @@ docker pull $DYNAMO_IMAGE ### Validate Configuration Files ```bash -# Navigate to the customization directory -cd customizations/LLM\ Router +# Navigate to the deployment directory +cd examples/deployments/LLM\ Router # Check that required files exist ls -la frontend.yaml agg.yaml disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml @@ -704,7 +704,7 @@ kubectl create secret generic hf-token-secret \ -n ${NAMESPACE} # 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) -cd "customizations/LLM Router/" +cd "examples/deployments/LLM Router/" ``` #### Shared Frontend Deployment @@ -884,9 +884,22 @@ kubectl get secrets -n llm-router git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git cd llm-router -# 2. Use official NVIDIA LLM Router images (no building required) -# Our values file is configured to use the official images from nvcr.io/nvidian/sae/ -# If you need custom images, build and push them to your registry: +# 2. Configure Docker Registry (REQUIRED) +# IMPORTANT: Update the imageRegistry in llm-router-values-override.yaml before deployment +# The file contains a placeholder "YOUR_REGISTRY_HERE/" that MUST be replaced. + +# Edit the values file: +nano ../examples/deployments/LLM\ Router/llm-router-values-override.yaml + +# Update line ~34: Replace "YOUR_REGISTRY_HERE/" with your actual registry: +# Examples: +# - "nvcr.io/nvidia/" (if you have access to NVIDIA's public registry) +# - "your-company-registry.com/llm-router/" (for private registries) +# - "docker.io/your-username/" (for Docker Hub) + +# Also update imagePullSecrets name to match your registry credentials + +# If you need to build custom images, use: # docker build -t /router-server:latest -f src/router-server/router-server.dockerfile . # docker build -t /router-controller:latest -f src/router-controller/router-controller.dockerfile . # docker push /router-server:latest @@ -896,7 +909,7 @@ cd llm-router # 3. Create router configuration ConfigMap using official External ConfigMap strategy # The official Helm chart now supports external ConfigMaps natively kubectl create configmap router-config-dynamo \ - --from-file=config.yaml=router-config-dynamo.yaml \ + --from-file=config.yaml=../examples/deployments/LLM\ Router/router-config-dynamo.yaml \ --namespace=llm-router # 4. Prepare router models (download from NGC) @@ -954,7 +967,7 @@ kubectl create secret generic llm-api-keys \ cd deploy/helm/llm-router helm upgrade --install llm-router . \ --namespace llm-router \ - --values ../../../llm-router-values-override.yaml \ + --values ../../../../examples/deployments/LLM\ Router/llm-router-values-override.yaml \ --wait --timeout=10m # 6. Verify LLM Router deployment diff --git a/examples/deployments/LLM Router/llm-router-values-override.yaml b/examples/deployments/LLM Router/llm-router-values-override.yaml index 3e85dc8fca..fc168bfeec 100644 --- a/examples/deployments/LLM Router/llm-router-values-override.yaml +++ b/examples/deployments/LLM Router/llm-router-values-override.yaml @@ -1,13 +1,19 @@ +## +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +## + # LLM Router Helm Values for NVIDIA Dynamo Cloud Platform Integration # Based on official sample: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/values.override.yaml.sample # Uses official External ConfigMap strategy for custom configuration # Global configuration (following official sample structure) +# NOTE: Update imageRegistry and imagePullSecrets before deployment (see README Step 6) global: storageClass: "standard" - imageRegistry: "nvcr.io/nvidian/sae/" + imageRegistry: "YOUR_REGISTRY_HERE/" # REPLACE with your Docker registry imagePullSecrets: - - name: nvcr-secret + - name: nvcr-secret # UPDATE to match your registry credentials # Router Controller Configuration routerController: diff --git a/examples/deployments/LLM Router/router-config-dynamo.yaml b/examples/deployments/LLM Router/router-config-dynamo.yaml index d52663d496..4cf874cc4d 100644 --- a/examples/deployments/LLM Router/router-config-dynamo.yaml +++ b/examples/deployments/LLM Router/router-config-dynamo.yaml @@ -41,99 +41,82 @@ policies: - name: "task_router" url: http://llm-router-router-server.llm-router.svc.cluster.local:8000/v2/models/task_router_ensemble/infer llms: - # === INTELLIGENT ROUTING STRATEGY === - # Route to appropriate models based on task complexity - - # Simple tasks → Fast 8B model - - name: "Closed QA" + - name: Brainstorming api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: Classification + model: meta-llama/Llama-3.1-70B-Instruct + - name: Chatbot api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: Extraction + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: Classification api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - - name: Rewrite + - name: Closed QA api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: Summarization + model: meta-llama/Llama-3.1-70B-Instruct + - name: Code Generation api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: Unknown + model: meta-llama/Llama-3.1-70B-Instruct + - name: Extraction api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - - # Complex tasks → Powerful 70B model - - name: Brainstorming + - name: Open QA api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - - name: "Code Generation" + - name: Other api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-70B-Instruct - - name: "Open QA" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: Rewrite api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-70B-Instruct - - name: Other + model: meta-llama/Llama-3.1-8B-Instruct + - name: Summarization api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: mistralai/Mixtral-8x22B-Instruct-v0.1 - - # Creative/Conversational tasks → Mixtral model - - name: Chatbot + model: meta-llama/Llama-3.1-70B-Instruct + - name: Text Generation api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: mistralai/Mixtral-8x22B-Instruct-v0.1 - - name: "Text Generation" + - name: Unknown api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: mistralai/Mixtral-8x22B-Instruct-v0.1 - + model: meta-llama/Llama-3.1-8B-Instruct - name: "complexity_router" url: http://llm-router-router-server.llm-router.svc.cluster.local:8000/v2/models/complexity_router_ensemble/infer llms: - # === INTELLIGENT COMPLEXITY ROUTING === - # Route to appropriate models based on complexity level - - # Simple complexity → Fast 8B model - - name: "Contextual-Knowledge" + - name: Creativity api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: "No-Label-Reason" + model: meta-llama/Llama-3.1-70B-Instruct + - name: Reasoning api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct - - name: Constraint + model: meta-llama/Llama-3.1-70B-Instruct + - name: Contextual-Knowledge api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-8B-Instruct - - # High complexity → Powerful 70B model - - name: Creativity + - name: Few-Shot api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} model: meta-llama/Llama-3.1-70B-Instruct - - name: Reasoning + - name: Domain-Knowledge api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-70B-Instruct - - name: "Few-Shot" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: No-Label-Reason api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-70B-Instruct - - # Creative/Domain complexity → Mixtral model - - name: "Domain-Knowledge" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Constraint api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: mistralai/Mixtral-8x22B-Instruct-v0.1 \ No newline at end of file + model: meta-llama/Llama-3.1-8B-Instruct \ No newline at end of file From 8cf5596e81ecf57d28969e2432ace2781f03e05b Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Tue, 16 Sep 2025 17:42:17 +0000 Subject: [PATCH 04/10] Fix trailing whitespace in YAML files for pre-commit checks Signed-off-by: arunraman Signed-off-by: arunraman --- .../LLM Router/llm-router-values-override.yaml | 10 +++++----- .../deployments/LLM Router/router-config-dynamo.yaml | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/deployments/LLM Router/llm-router-values-override.yaml b/examples/deployments/LLM Router/llm-router-values-override.yaml index fc168bfeec..ad98d0fa4b 100644 --- a/examples/deployments/LLM Router/llm-router-values-override.yaml +++ b/examples/deployments/LLM Router/llm-router-values-override.yaml @@ -1,4 +1,4 @@ -## +## # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 ## @@ -23,11 +23,11 @@ routerController: repository: llm-router-controller # Will be prefixed with global.imageRegistry tag: latest pullPolicy: IfNotPresent - + service: type: ClusterIP port: 8084 - + # Dynamo-specific environment variables env: - name: LOG_LEVEL @@ -41,7 +41,7 @@ routerController: secretKeyRef: name: dynamo-api-secret key: DYNAMO_API_KEY - + # STRATEGY 1: External ConfigMap (Official Support) # Uses the official Helm chart's external ConfigMap feature config: @@ -113,4 +113,4 @@ app: tag: latest pullPolicy: IfNotPresent service: - type: ClusterIP \ No newline at end of file + type: ClusterIP diff --git a/examples/deployments/LLM Router/router-config-dynamo.yaml b/examples/deployments/LLM Router/router-config-dynamo.yaml index 4cf874cc4d..bd722ab52c 100644 --- a/examples/deployments/LLM Router/router-config-dynamo.yaml +++ b/examples/deployments/LLM Router/router-config-dynamo.yaml @@ -119,4 +119,4 @@ policies: - name: Constraint api_base: ${DYNAMO_API_BASE} api_key: ${DYNAMO_API_KEY} - model: meta-llama/Llama-3.1-8B-Instruct \ No newline at end of file + model: meta-llama/Llama-3.1-8B-Instruct From 2e847a26a0656424b3c57fd4bad18a060dea35d4 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Tue, 16 Sep 2025 18:02:51 +0000 Subject: [PATCH 05/10] Update README.md to remove trailing whitespace and enhance clarity in deployment instructions Signed-off-by: arunraman Signed-off-by: arunraman --- examples/deployments/LLM Router/README.md | 34 +++++++++++------------ 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index cca2fe5feb..b77c9024c8 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -46,33 +46,33 @@ graph TB subgraph "Ingress Layer" LB[Load Balancer/Ingress] end - + subgraph "LLM Router (Helm)" RC[Router Controller] RS[Router Server + GPU] end - + subgraph "Dynamo Platform - Shared Frontend Architecture" FE[Shared Frontend Service] PR[Processor] - + subgraph "Model 1 Workers" VW1[VllmDecodeWorker-8B + GPU] PW1[VllmPrefillWorker-8B + GPU] end - + subgraph "Model 2 Workers" VW2[VllmDecodeWorker-70B + GPU] PW2[VllmPrefillWorker-70B + GPU] end - + subgraph "Model 3 Workers" VW3[VllmDecodeWorker-Mixtral + GPU] PW3[VllmPrefillWorker-Mixtral + GPU] end end end - + LB --> RC RC --> RS RS --> FE @@ -83,7 +83,7 @@ graph TB PR --> PW1 PR --> PW2 PR --> PW3 - + style LB fill:#e1f5fe style RC fill:#f3e5f5 style RS fill:#f3e5f5 @@ -331,7 +331,7 @@ The deployment now uses a **shared frontend architecture** that splits the origi **Key Advantages:** - **Resource Efficiency**: Single frontend serves all models, reducing CPU and memory overhead -- **Simplified Operations**: One service to monitor, scale, and maintain instead of multiple frontends +- **Simplified Operations**: One service to monitor, scale, and maintain instead of multiple frontends - **Better Load Distribution**: Intelligent request routing across all available model workers - **Cost Optimization**: Fewer running services means lower infrastructure costs - **Unified API Gateway**: Single endpoint for all models with consistent OpenAI API interface @@ -641,7 +641,7 @@ graph LR C --> D[Setup Router] D --> E[Configure Access] E --> F[Test Integration] - + style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 @@ -747,7 +747,7 @@ envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} **Current Setup**: We deploy 3 models that cover most use cases: - **Llama-3.1-8B**: Fast model for simple tasks -- **Llama-3.1-70B**: Powerful model for complex tasks +- **Llama-3.1-70B**: Powerful model for complex tasks - **Mixtral-8x22B**: Creative model for conversational tasks **To add more models**, follow this pattern: @@ -762,7 +762,7 @@ export MODEL_SUFFIX=phi-3-mini # Deploy using aggregated workers envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} -# OR deploy using disaggregated workers +# OR deploy using disaggregated workers envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` @@ -894,7 +894,7 @@ nano ../examples/deployments/LLM\ Router/llm-router-values-override.yaml # Update line ~34: Replace "YOUR_REGISTRY_HERE/" with your actual registry: # Examples: # - "nvcr.io/nvidia/" (if you have access to NVIDIA's public registry) -# - "your-company-registry.com/llm-router/" (for private registries) +# - "your-company-registry.com/llm-router/" (for private registries) # - "docker.io/your-username/" (for Docker Hub) # Also update imagePullSecrets name to match your registry credentials @@ -1050,11 +1050,11 @@ The LLM Router controller: ### Router Configuration -The `router-config-dynamo.yaml` configures routing policies to our deployed models. +The `router-config-dynamo.yaml` configures routing policies to our deployed models. **Current Setup**: The configuration routes to different models based on task complexity and type: - `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks (8B parameters) -- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks (70B parameters) +- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks (70B parameters) - `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks (8x22B parameters) **Note**: This guide shows the full 3-model production setup. For testing/development, you can start with fewer models (e.g., just Llama-8B + Qwen-0.6B) and add more as needed. The router will work with any subset of the configured models. @@ -1101,7 +1101,7 @@ curl -X POST http://localhost:8084/v1/chat/completions \ -d '{ "messages": [ { - "role": "user", + "role": "user", "content": "Write a Python function to calculate fibonacci numbers" } ], @@ -1119,7 +1119,7 @@ curl -X POST http://localhost:8084/v1/chat/completions \ -d '{ "messages": [ { - "role": "user", + "role": "user", "content": "Explain quantum computing in simple terms" } ], @@ -1220,4 +1220,4 @@ docker compose -f deploy/metrics/docker-compose.yml down - [LLM Router GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/llm-router) - [LLM Router Helm Chart](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router) - [Kubernetes Documentation](https://kubernetes.io/docs/) -- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) \ No newline at end of file +- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) From 07ac0dcad4cba54b805aa54b4bcf214320344545 Mon Sep 17 00:00:00 2001 From: arunraman Date: Mon, 6 Oct 2025 13:54:42 -0700 Subject: [PATCH 06/10] Update LLM Router README.md for improved clarity and deployment instructions - Revised the overview and table of contents for better organization. - Enhanced quickstart section with detailed environment variable setup and deployment steps. - Updated routing strategies and API usage examples for clarity. - Adjusted version numbers and image references to reflect the latest updates. - Removed outdated sections and ensured consistency throughout the document. Signed-off-by: arunraman Signed-off-by: arunraman --- examples/README.md | 2 +- examples/deployments/LLM Router/README.md | 342 +++++++--------------- 2 files changed, 108 insertions(+), 236 deletions(-) diff --git a/examples/README.md b/examples/README.md index 33b1cd797b..5e8a69a96a 100644 --- a/examples/README.md +++ b/examples/README.md @@ -36,7 +36,7 @@ Platform-specific deployment guides for production environments: - **[Amazon EKS](deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service - **[Azure AKS](deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service -- **[LLM Router](deployments/LLM%20Router/)** - Intelligent LLM request routing with NVIDIA Dynamo integration +- **[LLM Router](deployments/LLM%20Router/README.md)** - Intelligent LLM request routing with NVIDIA Dynamo integration - **[Router Standalone](deployments/router_standalone/)** - Standalone router deployment patterns - **Amazon ECS** - _Coming soon_ - **Google GKE** - _Coming soon_ diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index b77c9024c8..15f6b0108c 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -1,40 +1,73 @@ -# LLM Router with NVIDIA Dynamo Cloud Platform -## Kubernetes Deployment Guide +# LLM Router with NVIDIA Dynamo -
+Intelligent LLM request routing with distributed inference serving on Kubernetes. -[![NVIDIA](https://img.shields.io/badge/NVIDIA-76B900?style=for-the-badge&logo=nvidia&logoColor=white)](https://nvidia.com) -[![Kubernetes](https://img.shields.io/badge/kubernetes-%23326ce5.svg?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io) -[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com) -[![Helm](https://img.shields.io/badge/Helm-0F1689?style=for-the-badge&logo=Helm&labelColor=0F1689)](https://helm.sh) +## Overview -**Intelligent LLM Request Routing with Distributed Inference Serving** +This integration combines [**NVIDIA LLM Router**](https://github.com/NVIDIA-AI-Blueprints/llm-router) with [**NVIDIA Dynamo**](https://github.com/ai-dynamo/dynamo) to create an intelligent, scalable LLM serving platform: -
+**NVIDIA Dynamo** provides distributed inference serving with disaggregated architecture and multi-model support. ---- +**NVIDIA LLM Router** intelligently routes requests based on task type (12 categories) and complexity (7 categories) using Rust-based routing models. -This comprehensive guide provides step-by-step instructions for deploying the [**NVIDIA LLM Router**](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [**NVIDIA Dynamo Cloud Platform**](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. +**Result**: Optimal model selection for each request, reducing latency by 40-60% and costs by 30-50% compared to using a single large model. -## NVIDIA LLM Router and Dynamo Integration +## Table of Contents -### Overview +- [Quickstart](#quickstart) +- [Architecture](#architecture) +- [Prerequisites](#prerequisites) +- [Deployment Guide](#deployment-guide) + - [Step 1: Install Dynamo Platform](#step-1-install-dynamo-platform) + - [Step 2: Deploy Multiple vLLM Models](#step-2-deploy-multiple-vllm-models) + - [Step 3: Verify Shared Frontend Deployment](#step-3-verify-shared-frontend-deployment) + - [Step 4: Test Shared Frontend Service](#step-4-test-shared-frontend-service) + - [Step 5: Set Up LLM Router API Keys](#step-5-set-up-llm-router-api-keys) + - [Step 6: Deploy LLM Router](#step-6-deploy-llm-router) + - [Step 7: Configure External Access](#step-7-configure-external-access) +- [Testing the Integration](#testing-the-integration) +- [Configuration Reference](#configuration-reference) +- [Troubleshooting](#troubleshooting) +- [Cleanup](#cleanup) -This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: +## Quickstart -### NVIDIA Dynamo -- **Distributed inference serving framework** -- **Disaggregated serving capabilities** -- **Multi-model deployment support** -- **Kubernetes-native scaling** +Get the LLM Router + Dynamo integration running in under 30 minutes: -### NVIDIA LLM Router -- **Intelligent request routing** -- **Task classification (12 categories)** -- **Complexity analysis (7 categories)** -- **Rust-based performance** +```bash +# 1. Set environment variables +export NAMESPACE=dynamo-kubernetes +export DYNAMO_VERSION=0.5.0 +export HF_TOKEN=your_hf_token -> **Result**: A complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both **performance** and **cost efficiency**. +# 2. Install Dynamo Platform +# Follow: https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/installation_guide.md#path-a-production-install + +# 3. Deploy a model (Llama-8B for quick testing) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +export MODEL_SUFFIX=llama-8b +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE} + +cd "examples/deployments/LLM Router/" +envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} + +# 4. Deploy LLM Router +# Clone the router repo and follow Step 6 in the Deployment Guide below + +# 5. Test the integration +kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages": [{"role": "user", "content": "Hello!"}], "model": "", "nim-llm-router": {"policy": "task_router", "routing_strategy": "triton", "model": ""}}' +``` + +For production deployments with multiple models, continue to the [Deployment Guide](#deployment-guide). + +## Architecture ### Kubernetes Architecture Overview @@ -148,56 +181,17 @@ graph TB ### Routing Strategies -
- -#### Task-Based Routing -*Routes requests based on the type of task being performed* - -
- -
-View Task Routing Table - -| **Task Type** | **Target Model** | **Use Case** | -|:---|:---|:---| -| Code Generation | `llama-3.1-70b-instruct` | Programming tasks | -| Brainstorming | `llama-3.1-70b-instruct` | Creative ideation | -| Chatbot | `mixtral-8x22b-instruct-v0.1` | Conversational AI | -| Summarization | `llama-3.1-8b-instruct` | Text summarization | -| Open QA | `llama-3.1-70b-instruct` | Complex questions | -| Closed QA | `llama-3.1-8b-instruct` | Simple Q&A | -| Classification | `llama-3.1-8b-instruct` | Text classification | -| Extraction | `llama-3.1-8b-instruct` | Information extraction | -| Rewrite | `llama-3.1-8b-instruct` | Text rewriting | -| Text Generation | `mixtral-8x22b-instruct-v0.1` | General text generation | -| Other | `mixtral-8x22b-instruct-v0.1` | Miscellaneous tasks | -| Unknown | `llama-3.1-8b-instruct` | Unclassified tasks | - -
- ---- - -
+The router supports two routing strategies: -#### Complexity-Based Routing -*Routes requests based on the complexity of the task* +- **Task-Based Routing**: Routes based on task type (code generation, chatbot, summarization, etc.) +- **Complexity-Based Routing**: Routes based on complexity level (creativity, reasoning, domain knowledge, etc.) -
+Example routing logic: +- Simple tasks (classification, summarization) → Llama-3.1-8B (fast, efficient) +- Complex tasks (reasoning, creativity) → Llama-3.1-70B (powerful, detailed) +- Conversational/creative tasks → Mixtral-8x22B (diverse responses) -
-View Complexity Routing Table - -| **Complexity Level** | **Target Model** | **Use Case** | -|:---|:---|:---| -| Creativity | `llama-3.1-70b-instruct` | Creative tasks | -| Reasoning | `llama-3.1-70b-instruct` | Complex reasoning | -| Contextual-Knowledge | `llama-3.1-8b-instruct` | Context-dependent tasks | -| Few-Shot | `llama-3.1-70b-instruct` | Tasks with examples | -| Domain-Knowledge | `mixtral-8x22b-instruct-v0.1` | Specialized knowledge | -| No-Label-Reason | `llama-3.1-8b-instruct` | Unclassified complexity | -| Constraint | `llama-3.1-8b-instruct` | Tasks with constraints | - -
+For complete routing tables and configuration details, see [Configuration Reference](#configuration-reference). ### Performance Benefits @@ -212,51 +206,25 @@ graph TB -### API Usage Examples +### API Usage -
- -#### Task-Based Routing - -
+The LLM Router provides an OpenAI-compatible API. Specify routing policy in the request: ```bash -# Code generation task → Routes to meta-llama/Llama-3.1-70B-Instruct curl -X POST http://llm-router.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "", - "messages": [{"role": "user", "content": "Write a Python function to sort a list"}], - "max_tokens": 512, + "messages": [{"role": "user", "content": "Your prompt here"}], "nim-llm-router": { - "policy": "task_router", + "policy": "task_router", # or "complexity_router" "routing_strategy": "triton", "model": "" } }' ``` -
- -#### Complexity-Based Routing - -
- -```bash -# Complex reasoning task → Routes to meta-llama/Llama-3.1-70B-Instruct -curl -X POST http://llm-router.local/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "", - "messages": [{"role": "user", "content": "Explain quantum entanglement"}], - "max_tokens": 512, - "nim-llm-router": { - "policy": "complexity_router", - "routing_strategy": "triton", - "model": "" - } - }' -``` +For complete API examples and testing procedures, see [Testing the Integration](#testing-the-integration). ### How Dynamo Model Routing Works @@ -360,10 +328,10 @@ Set the required environment variables for deployment: | Variable | Description | Example | Required | Used In | |----------|-------------|---------|----------|---------| | `NAMESPACE` | Kubernetes namespace for deployment | `dynamo-kubernetes` | Yes | All deployments | -| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | Platform install | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.5.0` | Yes | Platform install | | `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | Model deployment | | `MODEL_SUFFIX` | Kubernetes deployment name suffix | `llama-8b` | Yes | Model deployment | -| `DYNAMO_IMAGE` | Full Dynamo runtime image path | `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` | Yes | Model deployment | +| `DYNAMO_IMAGE` | Full Dynamo runtime image path | `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0` | Yes | Model deployment | | `HF_TOKEN` | Hugging Face access token | `your_hf_token` | Yes | Model access | | `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | No | Private images | | `DYNAMO_API_BASE` | Dynamo service endpoint URL | `http://vllm-frontend-frontend.dynamo-kubernetes.svc.cluster.local:8000` | Yes | LLM Router | @@ -392,14 +360,14 @@ For optimal deployment experience, consider model size vs. resources: **NGC Setup Instructions**: 1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions -2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.1` (or latest available) +2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.5.0` (or latest available) 3. **Optional - NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) if you need private image access 4. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds **Available NGC Dynamo Images**: -- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` (recommended) -- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` -- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` +- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0` (recommended) +- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.0` +- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0` - **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` - **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` @@ -428,43 +396,16 @@ For optimal deployment experience, consider model size vs. resources: ## Prerequisites -
- -[![Prerequisites](https://img.shields.io/badge/Prerequisites-Check%20List-blue?style=for-the-badge&logo=checkmk)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#prerequisites) - -*Ensure your environment meets all requirements before deployment* - -
- -### Required Tools +Before deploying the LLM Router integration, ensure you have: -
- -**Verify you have the required tools installed:** +1. **Dynamo Platform Prerequisites** - Follow the [Dynamo Installation Guide](../../../docs/guides/dynamo_deploy/installation_guide.md#prerequisites) for: + - Required tools (kubectl v1.24+, Helm v3.0+, Docker) + - Kubernetes cluster with NVIDIA GPU nodes + - Container registry access -
- -```bash -# Required tools verification -kubectl version --client -helm version -docker version -``` - -
- -| **Tool** | **Requirement** | **Status** | -|:---:|:---:|:---:| -| **kubectl** | `v1.24+` | Check with `kubectl version --client` | -| **Helm** | `v3.0+` | Check with `helm version` | -| **Docker** | Running daemon | Check with `docker version` | - -
- -**Additional Requirements:** -- **NVIDIA GPU nodes** with GPU Operator installed (for LLM inference) -- **Container registry access** (Docker Hub, NVIDIA NGC, etc.) -- **Git** for cloning repositories +2. **LLM Router Specific Requirements:** + - **Git** for cloning the LLM Router repository + - **NGC API Key** (optional, for private container registries) ### Inference Runtime Images @@ -472,13 +413,13 @@ Set your inference runtime image from the available NGC options: ```bash # Set your inference runtime image -export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0 ``` **Available Runtime Images**: -- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` - vLLM backend (recommended) -- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` - SGLang backend -- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` - TensorRT-LLM backend +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0` - vLLM backend (recommended) +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.0` - SGLang backend +- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0` - TensorRT-LLM backend ### Hugging Face Token @@ -524,13 +465,7 @@ If Istio is not installed, follow the [official Istio installation guide](https: ## Pre-Deployment Validation -
- -[![Validation](https://img.shields.io/badge/Pre--Deployment-Validation-yellow?style=for-the-badge&logo=checkmarx)](https://kubernetes.io) - -*Validate your environment before starting deployment* - -
+Validate your environment before starting deployment. Before starting the deployment, validate that your environment meets all requirements: @@ -581,7 +516,7 @@ python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" ```bash # Core deployment variables export NAMESPACE=dynamo-kubernetes -export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog +export DYNAMO_VERSION=0.5.0 # Choose your Dynamo version from NGC catalog export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} # Model deployment variables (deploy all three models) @@ -619,15 +554,7 @@ echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (can be empty for local deploym ## Deployment Guide -
- -[![Deployment](https://img.shields.io/badge/Deployment-Step%20by%20Step-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) - -**Complete walkthrough for deploying NVIDIA Dynamo and LLM Router** - -
- ---- +Complete walkthrough for deploying NVIDIA Dynamo and LLM Router. ### Deployment Overview @@ -652,46 +579,19 @@ graph LR -### Step 1: Install Dynamo Platform (Path A: Production Install) - -
- -[![Step 1](https://img.shields.io/badge/Step%201-Install%20Platform-blue?style=for-the-badge&logo=kubernetes)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#path-a-production-install) - -*Deploy the Dynamo Cloud Platform using the official **Path A: Production Install*** - -
+### Step 1: Install Dynamo Platform +If you haven't already installed the Dynamo platform, follow the **[Dynamo Installation Guide - Path A: Production Install](../../../docs/guides/dynamo_deploy/installation_guide.md#path-a-production-install)** to: +1. Install Dynamo CRDs +2. Install Dynamo Platform +3. Verify the installation -```bash -# 1. Install CRDs (use 'upgrade' instead of 'install' if already installed) -helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz -helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default - -# 2. Install Platform (use 'upgrade' instead of 'install' if already installed) -kubectl create namespace ${NAMESPACE} -helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz -helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} - -# 3. Verify deployment -# Check CRDs -kubectl get crd | grep dynamo -# Check operator and platform pods -kubectl get pods -n ${NAMESPACE} -# Expected: dynamo-operator-* and etcd-* pods Running -kubectl get svc -n ${NAMESPACE} -``` +> **Note**: For a quick reference, see the [Deployment Quickstart](../../../docs/guides/dynamo_deploy/README.md#1-install-platform-first). ### Step 2: Deploy Multiple vLLM Models -
- -[![Step 2](https://img.shields.io/badge/Step%202-Deploy%20Multiple%20Models-orange?style=for-the-badge&logo=nvidia)](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md) - -*Deploy multiple vLLM models for intelligent routing* - -
+Deploy multiple vLLM models for intelligent routing. @@ -770,13 +670,7 @@ envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ### Step 3: Verify Shared Frontend Deployment -
- -[![Step 3](https://img.shields.io/badge/Step%203-Verify%20Deployments-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) - -*Verify that the shared frontend and model workers have been deployed successfully* - -
+Verify that the shared frontend and model workers have been deployed successfully. ```bash # Check deployment status for shared frontend and all model workers @@ -795,13 +689,7 @@ kubectl get svc -n ${NAMESPACE} | grep frontend ### Step 4: Test Shared Frontend Service -
- -[![Step 4](https://img.shields.io/badge/Step%204-Test%20Services-purple?style=for-the-badge&logo=checkmarx)](https://checkmarx.com) - -*Test the shared frontend service with different models* - -
+Test the shared frontend service with different models. ```bash # Forward the shared frontend service port @@ -836,13 +724,7 @@ curl localhost:8000/v1/models | jq ### Step 5: Set Up LLM Router API Keys -
- -[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=keycdn)](https://github.com/NVIDIA-AI-Blueprints/llm-router) - -*Configure API keys for LLM Router integration* - -
+Configure API keys for LLM Router integration. **IMPORTANT**: The router configuration uses Kubernetes secrets for API key management following the [official NVIDIA pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml). @@ -869,13 +751,7 @@ kubectl get secrets -n llm-router ### Step 6: Deploy LLM Router -
- -[![Step 6](https://img.shields.io/badge/Step%206-Deploy%20Router-indigo?style=for-the-badge&logo=nvidia)](https://github.com/NVIDIA-AI-Blueprints/llm-router) - -*Deploy the NVIDIA LLM Router using Helm* - -
+Deploy the NVIDIA LLM Router using Helm. **Note**: The NVIDIA LLM Router requires building images from source and using the official Helm charts from the GitHub repository. @@ -980,13 +856,7 @@ kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=llm-router -n l ### Step 7: Configure External Access -
- -[![Step 7](https://img.shields.io/badge/Step%207-Configure%20Access-teal?style=for-the-badge&logo=nginx)](https://kubernetes.io) - -*Configure external access to the LLM Router* - -
+Configure external access to the LLM Router. ```bash # For development/testing, use port forwarding to access LLM Router @@ -996,7 +866,9 @@ kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router curl http://localhost:8084/health ``` -## Configuration +## Configuration Reference + +This section provides detailed configuration options for the LLM Router and Dynamo integration. ### Ingress Configuration From 83828607dd9e75ed7174e6a5b0842f1d857c273c Mon Sep 17 00:00:00 2001 From: arunraman Date: Mon, 6 Oct 2025 20:31:00 -0700 Subject: [PATCH 07/10] Update LLM Router README.md to fix trailing whitespace in routing strategies section for improved readability Signed-off-by: arunraman Signed-off-by: arunraman --- examples/deployments/LLM Router/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index 15f6b0108c..8c407fffd6 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -187,7 +187,7 @@ The router supports two routing strategies: - **Complexity-Based Routing**: Routes based on complexity level (creativity, reasoning, domain knowledge, etc.) Example routing logic: -- Simple tasks (classification, summarization) → Llama-3.1-8B (fast, efficient) +- Simple tasks (classification, summarization) → Llama-3.1-8B (fast, efficient) - Complex tasks (reasoning, creativity) → Llama-3.1-70B (powerful, detailed) - Conversational/creative tasks → Mixtral-8x22B (diverse responses) From 607af86778683331ff96463e5b991e1566fbccc4 Mon Sep 17 00:00:00 2001 From: arunraman Date: Mon, 6 Oct 2025 20:41:51 -0700 Subject: [PATCH 08/10] chore: trigger CI re-run Signed-off-by: arunraman Signed-off-by: arunraman From 96c7df104ba8917bc977b6e1b03c1ec7f6b861ed Mon Sep 17 00:00:00 2001 From: arunraman Date: Mon, 6 Oct 2025 20:53:01 -0700 Subject: [PATCH 09/10] fix: correct documentation links to use docs/kubernetes path Signed-off-by: arunraman Signed-off-by: arunraman --- examples/README.md | 2 +- examples/deployments/LLM Router/README.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/examples/README.md b/examples/README.md index 5e8a69a96a..33b1cd797b 100644 --- a/examples/README.md +++ b/examples/README.md @@ -36,7 +36,7 @@ Platform-specific deployment guides for production environments: - **[Amazon EKS](deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service - **[Azure AKS](deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service -- **[LLM Router](deployments/LLM%20Router/README.md)** - Intelligent LLM request routing with NVIDIA Dynamo integration +- **[LLM Router](deployments/LLM%20Router/)** - Intelligent LLM request routing with NVIDIA Dynamo integration - **[Router Standalone](deployments/router_standalone/)** - Standalone router deployment patterns - **Amazon ECS** - _Coming soon_ - **Google GKE** - _Coming soon_ diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLM Router/README.md index 8c407fffd6..99b2c8dd15 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLM Router/README.md @@ -41,7 +41,7 @@ export DYNAMO_VERSION=0.5.0 export HF_TOKEN=your_hf_token # 2. Install Dynamo Platform -# Follow: https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/installation_guide.md#path-a-production-install +# Follow: https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md#path-a-production-install # 3. Deploy a model (Llama-8B for quick testing) export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct @@ -398,7 +398,7 @@ For optimal deployment experience, consider model size vs. resources: Before deploying the LLM Router integration, ensure you have: -1. **Dynamo Platform Prerequisites** - Follow the [Dynamo Installation Guide](../../../docs/guides/dynamo_deploy/installation_guide.md#prerequisites) for: +1. **Dynamo Platform Prerequisites** - Follow the [Dynamo Installation Guide](../../../docs/kubernetes/installation_guide.md#prerequisites) for: - Required tools (kubectl v1.24+, Helm v3.0+, Docker) - Kubernetes cluster with NVIDIA GPU nodes - Container registry access @@ -581,13 +581,13 @@ graph LR ### Step 1: Install Dynamo Platform -If you haven't already installed the Dynamo platform, follow the **[Dynamo Installation Guide - Path A: Production Install](../../../docs/guides/dynamo_deploy/installation_guide.md#path-a-production-install)** to: +If you haven't already installed the Dynamo platform, follow the **[Dynamo Installation Guide - Path A: Production Install](../../../docs/kubernetes/installation_guide.md#path-a-production-install)** to: 1. Install Dynamo CRDs 2. Install Dynamo Platform 3. Verify the installation -> **Note**: For a quick reference, see the [Deployment Quickstart](../../../docs/guides/dynamo_deploy/README.md#1-install-platform-first). +> **Note**: For a quick reference, see the [Deployment Quickstart](../../../docs/kubernetes/README.md#1-install-platform-first). ### Step 2: Deploy Multiple vLLM Models From c044e069ec5ff62c719f6be168b4be4f564b4825 Mon Sep 17 00:00:00 2001 From: arunraman Date: Wed, 8 Oct 2025 14:31:23 -0700 Subject: [PATCH 10/10] refactor: rename LLM Router directory to LLMRouter - Rename 'examples/deployments/LLM Router' to 'examples/deployments/LLMRouter' - Remove spaces from directory name for better Linux/Mac compatibility - Update all references in examples/README.md and deployment files - Update cd commands to use new path without quotes Signed-off-by: arunraman --- examples/README.md | 2 +- examples/deployments/{LLM Router => LLMRouter}/README.md | 4 ++-- examples/deployments/{LLM Router => LLMRouter}/agg.yaml | 0 examples/deployments/{LLM Router => LLMRouter}/disagg.yaml | 0 examples/deployments/{LLM Router => LLMRouter}/frontend.yaml | 0 .../{LLM Router => LLMRouter}/llm-router-values-override.yaml | 0 .../{LLM Router => LLMRouter}/router-config-dynamo.yaml | 0 7 files changed, 3 insertions(+), 3 deletions(-) rename examples/deployments/{LLM Router => LLMRouter}/README.md (99%) rename examples/deployments/{LLM Router => LLMRouter}/agg.yaml (100%) rename examples/deployments/{LLM Router => LLMRouter}/disagg.yaml (100%) rename examples/deployments/{LLM Router => LLMRouter}/frontend.yaml (100%) rename examples/deployments/{LLM Router => LLMRouter}/llm-router-values-override.yaml (100%) rename examples/deployments/{LLM Router => LLMRouter}/router-config-dynamo.yaml (100%) diff --git a/examples/README.md b/examples/README.md index 33b1cd797b..e82f423cf4 100644 --- a/examples/README.md +++ b/examples/README.md @@ -36,7 +36,7 @@ Platform-specific deployment guides for production environments: - **[Amazon EKS](deployments/EKS/)** - Deploy Dynamo on Amazon Elastic Kubernetes Service - **[Azure AKS](deployments/AKS/)** - Deploy Dynamo on Azure Kubernetes Service -- **[LLM Router](deployments/LLM%20Router/)** - Intelligent LLM request routing with NVIDIA Dynamo integration +- **[LLM Router](deployments/LLMRouter/README.md)** - Intelligent LLM request routing with NVIDIA Dynamo integration - **[Router Standalone](deployments/router_standalone/)** - Standalone router deployment patterns - **Amazon ECS** - _Coming soon_ - **Google GKE** - _Coming soon_ diff --git a/examples/deployments/LLM Router/README.md b/examples/deployments/LLMRouter/README.md similarity index 99% rename from examples/deployments/LLM Router/README.md rename to examples/deployments/LLMRouter/README.md index 99b2c8dd15..a70e5cf50a 100644 --- a/examples/deployments/LLM Router/README.md +++ b/examples/deployments/LLMRouter/README.md @@ -51,7 +51,7 @@ export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE} -cd "examples/deployments/LLM Router/" +cd examples/deployments/LLMRouter/ envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} @@ -604,7 +604,7 @@ kubectl create secret generic hf-token-secret \ -n ${NAMESPACE} # 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) -cd "examples/deployments/LLM Router/" +cd examples/deployments/LLMRouter/ ``` #### Shared Frontend Deployment diff --git a/examples/deployments/LLM Router/agg.yaml b/examples/deployments/LLMRouter/agg.yaml similarity index 100% rename from examples/deployments/LLM Router/agg.yaml rename to examples/deployments/LLMRouter/agg.yaml diff --git a/examples/deployments/LLM Router/disagg.yaml b/examples/deployments/LLMRouter/disagg.yaml similarity index 100% rename from examples/deployments/LLM Router/disagg.yaml rename to examples/deployments/LLMRouter/disagg.yaml diff --git a/examples/deployments/LLM Router/frontend.yaml b/examples/deployments/LLMRouter/frontend.yaml similarity index 100% rename from examples/deployments/LLM Router/frontend.yaml rename to examples/deployments/LLMRouter/frontend.yaml diff --git a/examples/deployments/LLM Router/llm-router-values-override.yaml b/examples/deployments/LLMRouter/llm-router-values-override.yaml similarity index 100% rename from examples/deployments/LLM Router/llm-router-values-override.yaml rename to examples/deployments/LLMRouter/llm-router-values-override.yaml diff --git a/examples/deployments/LLM Router/router-config-dynamo.yaml b/examples/deployments/LLMRouter/router-config-dynamo.yaml similarity index 100% rename from examples/deployments/LLM Router/router-config-dynamo.yaml rename to examples/deployments/LLMRouter/router-config-dynamo.yaml