diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index 10535d71d9..eddd65e4c4 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | **GB200 Support** | ✅ | | -## Quick Start +## SGLang Quick Start Below we provide a guide that lets you run all of our common deployment patterns on a single node. diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md index 7de2c3e610..f51043d274 100644 --- a/components/backends/trtllm/README.md +++ b/components/backends/trtllm/README.md @@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | **DP Rank Routing**| ✅ | | | **GB200 Support** | ✅ | | -## Quick Start +## TensorRT-LLM Quick Start Below we provide a guide that lets you run all of our the common deployment patterns on a single node. diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md index 593c20aec2..00fd12925d 100644 --- a/components/backends/vllm/README.md +++ b/components/backends/vllm/README.md @@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | **DP Rank Routing**| ✅ | Supported via external control of DP ranks | | **GB200 Support** | 🚧 | Container functional on main | -## Quick Start +## vLLM Quick Start Below we provide a guide that lets you run all of our the common deployment patterns on a single node. diff --git a/deploy/metrics/k8s/README.md b/deploy/metrics/k8s/README.md index 841cf251d7..0e1fcb6368 100644 --- a/deploy/metrics/k8s/README.md +++ b/deploy/metrics/k8s/README.md @@ -1,3 +1,3 @@ # Dynamo Metrics Collection on Kubernetes -For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/deploy/k8s_metrics.md](../../../docs/guides/deploy/k8s_metrics.md). +For detailed documentation on collecting and visualizing metrics on Kubernetes, see [docs/guides/dynamo_deploy/k8s_metrics.md](../../../docs/guides/dynamo_deploy/k8s_metrics.md). diff --git a/docs/architecture/architecture.md b/docs/architecture/architecture.md index 7b20555a18..6ef00b30e6 100644 --- a/docs/architecture/architecture.md +++ b/docs/architecture/architecture.md @@ -48,7 +48,7 @@ There are multi-faceted challenges: To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access. -## High level architecture and key benefits +## Key benefits The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: diff --git a/docs/architecture/sla_planner.md b/docs/architecture/sla_planner.md index 43e56f3345..532beadefe 100644 --- a/docs/architecture/sla_planner.md +++ b/docs/architecture/sla_planner.md @@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy * **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions * **Correction factors**: Adapts to real-world performance deviations from profiled data -## Architecture +## Design The SLA planner consists of several key components: @@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md). -**Quick Start:** +**To deploy SLA Planner:** ```bash cd components/backends/vllm/deploy kubectl apply -f disagg_planner.yaml -n {$NAMESPACE} diff --git a/docs/components/router/README.md b/docs/components/router/README.md index b2f5d7b61b..b891ad2b19 100644 --- a/docs/components/router/README.md +++ b/docs/components/router/README.md @@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0 The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. -## Quick Start +## KV Router Quick Start To launch the Dynamo frontend with the KV Router: diff --git a/docs/guides/dynamo_deploy/README.md b/docs/guides/dynamo_deploy/README.md index eb4cd7a7ca..090c4ed983 100644 --- a/docs/guides/dynamo_deploy/README.md +++ b/docs/guides/dynamo_deploy/README.md @@ -17,85 +17,130 @@ limitations under the License. # Deploying Inference Graphs to Kubernetes - We expect users to deploy their inference graphs using CRDs or helm charts. +High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. -# 1. Install Dynamo Cloud. +## 1. Install Platform First +**[Dynamo Kubernetes Platform](dynamo_cloud.md)** - Main installation guide with 3 paths -Prior to deploying an inference graph the user should deploy the Dynamo Cloud Platform. Reference the [Quickstart Guide](quickstart.md) for steps to install Dynamo Cloud with Helm. +## 2. Choose Your Backend -Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you. This is a one-time action, only necessary the first time you deploy a DynamoGraph. +Each backend has deployment examples and configuration options: -# 2. Deploy your inference graph. +| Backend | Available Configurations | +|---------|--------------------------| +| **[vLLM](../../../components/backends/vllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner | +| **[SGLang](../../../components/backends/sglang/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node | +| **[TensorRT-LLM](../../../components/backends/trtllm/deploy/README.md)** | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router | -We provide a Custom Resource YAML file for many examples under the components/backends/{engine}/deploy folders. Consult the examples below for the CRs for a specific inference backend. +## 3. Deploy Your First Model -[View SGLang K8s](../../../components/backends/sglang/deploy/README.md) - -[View vLLM K8s](../../../components/backends/vllm/deploy/README.md) +```bash +# Set same namespace from platform install +export NAMESPACE=dynamo-cloud -[View TRT-LLM K8s](../../../components/backends/trtllm/deploy/README.md) +# Deploy any example (this uses vLLM with Qwen model using aggregated serving) +kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} -### Deploying a particular example +# Check status +kubectl get dynamoGraphDeployment -n ${NAMESPACE} -```bash -# Set your dynamo root directory -cd -export PROJECT_ROOT=$(pwd) -export NAMESPACE= # the namespace you used to deploy Dynamo cloud to. +# Test it +kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} +curl http://localhost:8000/v1/models ``` -Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example: +## What's a DynamoGraphDeployment? -```bash -kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} -``` +It's a Kubernetes Custom Resource that defines your inference pipeline: +- Model configuration +- Resource allocation (GPUs, memory) +- Scaling policies +- Frontend/backend connections -You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment. -You can use `kubectl delete dynamoGraphDeployment -n ${NAMESPACE}` to delete the deployment. +The scripts in the `components//launch` folder like `agg.sh` demonstrate how you can serve your models locally. The corresponding YAML files like `agg.yaml` show you how you could create a kubernetes deployment for your inference graph. -We provide a Custom Resource YAML file for many examples under the `deploy/` folder. -Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example. +### Choosing Your Architecture Pattern -**Note 1** Example Image +When creating a deployment, select the architecture pattern that best fits your use case: -The examples use a prebuilt image from the `nvcr.io` registry. -You can utilize public images from [Dynamo NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo) or build your own image and update the image location in your CR file prior to applying. Either way, you will need to overwrite the image in the example YAML. +- **Development / Testing** - Use `agg.yaml` as the base configuration +- **Production with Load Balancing** - Use `agg_router.yaml` to enable scalable, load-balanced inference +- **High Performance / Disaggregated** - Use `disagg_router.yaml` for maximum throughput and modular scalability -To build your own image: +### Frontend and Worker Components -```bash -./container/build.sh --framework -``` +You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that: -For example for the `sglang` run -```bash -./container/build.sh --framework sglang -``` +- Provides OpenAI-compatible `/v1/chat/completions` endpoint +- Auto-discovers backend workers via etcd +- Routes requests and handles load balancing +- Validates and preprocesses requests -To overwrite the image in the example: +### Customizing Your Deployment -```bash -extraPodSpec: +Example structure: +```yaml +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: my-llm +spec: + services: + Frontend: + dynamoNamespace: my-llm + componentType: frontend + replicas: 1 + extraPodSpec: mainContainer: - image: + image: your-image + VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker + dynamoNamespace: dynamo-dev + componentType: worker + replicas: 1 + envFromSecret: hf-token-secret # for HuggingFace models + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: your-image + command: ["/bin/sh", "-c"] + args: + - python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags] ``` -**Note 2** -Setup port forward if needed when deploying to Kubernetes. - -List the services in your namespace: - -```bash -kubectl get svc -n ${NAMESPACE} +Worker command examples per backend: +```yaml +# vLLM worker +args: + - python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B + +# SGLang worker +args: + - >- + python3 -m dynamo.sglang + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --tp 1 + --trust-remote-code + +# TensorRT-LLM worker +args: + - python3 -m dynamo.trtllm + --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B + --extra-engine-args engine_configs/agg.yaml ``` -Look for one that ends in `-frontend` and use it for port forward. -```bash -SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1) -kubectl port-forward svc/${SERVICE_NAME}-frontend 8080:8080 -n ${NAMESPACE} -``` +Key customization points include: +- **Model Configuration**: Specify model in the args command +- **Resource Allocation**: Configure GPU requirements under `resources.limits` +- **Scaling**: Set `replicas` for number of worker instances +- **Routing Mode**: Enable KV-cache routing by setting `DYN_ROUTER_MODE=kv` in Frontend envs +- **Worker Specialization**: Add `--is-prefill-worker` flag for disaggregated prefill workers -Additional Resources: -- [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/) -- [Examples Deployment Guide](../../examples/README.md#deploying-a-particular-example) +## Additional Resources +- **[Examples](../../examples/README.md)** - Complete working examples +- **[Create Custom Deployments](create_deployment.md)** - Build your own CRDs +- **[Operator Documentation](dynamo_operator.md)** - How the platform works +- **[Helm Charts](../../../deploy/helm/README.md)** - For advanced users \ No newline at end of file diff --git a/docs/guides/dynamo_deploy/dynamo_cloud.md b/docs/guides/dynamo_deploy/dynamo_cloud.md index 3c549b514d..0264e6d056 100644 --- a/docs/guides/dynamo_deploy/dynamo_cloud.md +++ b/docs/guides/dynamo_deploy/dynamo_cloud.md @@ -15,102 +15,167 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Dynamo Cloud Kubernetes Platform +# Dynamo Kubernetes Platform -The Dynamo Cloud platform is a comprehensive solution for deploying and managing Dynamo inference graphs (also referred to as pipelines) in Kubernetes environments. It provides a streamlined experience for deploying, scaling, and monitoring your inference services. +Deploy and manage Dynamo inference graphs on Kubernetes with automated orchestration and scaling, using the Dynamo Kubernetes Platform. -## Overview +## Quick Start Paths -The Dynamo cloud platform consists of several key components: +**Path A: Production Install** +Install from published artifacts on your existing cluster → [Jump to Path A](#path-a-production-install) -- **Dynamo Operator**: A Kubernetes operator that manages the lifecycle of Dynamo inference graphs from build ➡️ deploy. For more information on the operator, see [Dynamo Kubernetes Operator Documentation](../dynamo_deploy/dynamo_operator.md) -- **Custom Resources**: Kubernetes custom resources for defining and managing Dynamo services +**Path B: Local Development** +Set up Minikube first → [Minikube Setup](minikube.md) → Then follow Path A +**Path C: Custom Development** +Build from source for customization → [Jump to Path C](#path-c-custom-development) -## Deployment Prerequisites - -Before getting started with the Dynamo cloud platform, ensure you have: - -- A Kubernetes cluster (version 1.24 or later) -- [Earthly](https://earthly.dev/) installed for building components -- Docker installed and running -- Access to a container registry (e.g., Docker Hub, NVIDIA NGC, etc.) -- `kubectl` configured to access your cluster -- Helm installed (version 3.0 or later) +## Prerequisites +```bash +# Required tools +kubectl version --client # v1.24+ +helm version # v3.0+ +docker version # Running daemon + +# Set your inference runtime image +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 +# Also available: sglang-runtime, tensorrtllm-runtime +``` > [!TIP] -> Don't have a Kubernetes cluster? Check out our [Minikube setup guide](../../../docs/guides/dynamo_deploy/minikube.md) to set up a local environment! 🏠 +> No cluster? See [Minikube Setup](minikube.md) for local development. -#### 🏗️ Build Dynamo inference runtime. +## Path A: Production Install -[One-time Action] -Before you could use Dynamo make sure you have setup the Inference Runtime Image. -For basic cases you could use the prebuilt image for the Dynamo Inference Runtime. -Just export the environment variable. This will be the image used by your individual components. You pick whatever dynamo version you want or use the latest (default) +Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts) in 3 steps. ```bash -export DYNAMO_IMAGE=nvcr.io/nvidia/dynamo:latest-vllm +# 1. Set environment +export NAMESPACE=dynamo-kubernetes +export RELEASE_VERSION=0.4.0 # any version of Dynamo 0.3.2+ + +# 2. Install CRDs +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default + +# 3. Install Platform +kubectl create namespace ${NAMESPACE} +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} ``` -For a custom setup build and push to your registry Dynamo Base Image for Dynamo inference runtime. This is a one-time operation. +→ [Verify Installation](#verify-installation) -```bash -# Run the script to build the default dynamo:latest-vllm image. -./container/build.sh -export IMAGE_TAG= -# Tag the image -docker tag dynamo:latest-vllm /dynamo:${IMAGE_TAG} -docker push /dynamo:${IMAGE_TAG} -``` +## Path C: Custom Development -## 🚀 Deploying the Dynamo Cloud Platform +Build and deploy from source for customization. -## Prerequisites +### Quick Deploy Script + +```bash +# 1. Set environment +export NAMESPACE=dynamo-cloud +export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/ # or your registry +export DOCKER_USERNAME='$oauthtoken' +export DOCKER_PASSWORD= +export IMAGE_TAG=0.4.0 + +# 2. Build operator +cd deploy/cloud/operator +earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG +cd - + +# 3. Create namespace and secrets +kubectl create namespace ${NAMESPACE} +kubectl create secret docker-registry docker-imagepullsecret \ + --docker-server=${DOCKER_SERVER} \ + --docker-username=${DOCKER_USERNAME} \ + --docker-password=${DOCKER_PASSWORD} \ + --namespace=${NAMESPACE} + +# 4. Deploy +helm repo add bitnami https://charts.bitnami.com/bitnami +./deploy.sh --crds +``` -Before deploying Dynamo Cloud, ensure your Kubernetes cluster meets the following requirements: +### Manual Steps (Alternative) -#### 1. 🛡️ Istio Installation -Dynamo Cloud requires Istio for service mesh capabilities. Verify Istio is installed and running: +
+Click to expand manual installation steps +**Step 1: Install CRDs** ```bash -# Check if Istio is installed -kubectl get pods -n istio-system +helm install dynamo-crds ./crds/ --namespace default +``` -# Expected output should show running Istio pods -# istiod-* pods should be in Running state +**Step 2: Install Platform** +```bash +helm dep build ./platform/ +helm install dynamo-platform ./platform/ \ + --namespace ${NAMESPACE} \ + --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \ + --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \ + --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret" ``` +
+ +→ [Verify Installation](#verify-installation) -#### 2. 💾 PVC Support with Default Storage Class -Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration: +## Verify Installation ```bash -# Check if default storage class exists -kubectl get storageclass +# Check CRDs +kubectl get crd | grep dynamo -# Expected output should show at least one storage class marked as (default) -# Example: -# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE -# standard (default) kubernetes.io/gce-pd Delete Immediate true 1d +# Check operator and platform pods +kubectl get pods -n ${NAMESPACE} +# Expected: dynamo-operator-* and etcd-* pods Running ``` -## Installation +## Next Steps -Follow [Quickstart Guide](./quickstart.md) to install the Dynamo Cloud +1. **Deploy Model/Workflow** + ```bash + # Example: Deploy a vLLM workflow with Qwen3-0.6B using aggregated serving + kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE} -⚠️ **Note:** that omitting `--crds` will skip the CRDs installation/upgrade. This is useful when installing on a shared cluster as CRDs are cluster-scoped resources. + # Port forward and test + kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} + curl http://localhost:8000/v1/models + ``` -⚠️ **Note:** If you'd like to only generate the generated-values.yaml file without deploying to Kubernetes (e.g., for inspection, CI workflows, or dry-run testing), use: +2. **Explore Backend Guides** + - [vLLM Deployments](../../../components/backends/vllm/deploy/README.md) + - [SGLang Deployments](../../../components/backends/sglang/deploy/README.md) + - [TensorRT-LLM Deployments](../../../components/backends/trtllm/deploy/README.md) -```bash -./deploy_dynamo_cloud.py --yaml-only -``` +3. **Optional:** + - [Set up Prometheus & Grafana](k8s_metrics.md) + - [SLA Planner Deployment Guide](sla_planner_deployment.md) (for advanced SLA-aware scheduling and autoscaling) +## Troubleshooting +**Pods not starting?** +```bash +kubectl describe pod -n ${NAMESPACE} +kubectl logs -n ${NAMESPACE} +``` -### Cloud Provider-Specific deployment +**HuggingFace model access?** +```bash +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} +``` -#### Google Kubernetes Engine (GKE) deployment +**Clean uninstall?** +```bash +./uninstall.sh # Removes all CRDs and platform +``` -You can find detailed instructions for deployment in GKE [here](../dynamo_deploy/gke_setup.md) +## Advanced Options +- [GKE-specific setup](gke_setup.md) +- [Create custom deployments](create_deployment.md) +- [Dynamo Operator details](dynamo_operator.md) \ No newline at end of file diff --git a/docs/guides/dynamo_deploy/grove.md b/docs/guides/dynamo_deploy/grove.md new file mode 100644 index 0000000000..d6ecd0982f --- /dev/null +++ b/docs/guides/dynamo_deploy/grove.md @@ -0,0 +1,96 @@ +# Grove Deployment Guide + +Grove is a Kubernetes API specifically designed to address the orchestration challenges of modern AI workloads, particularly disaggregated inference systems. Grove provides seamless integration with NVIDIA Dynamo for comprehensive AI infrastructure management. + +## Overview + +Grove was originally motivated by the challenges of orchestrating multinode, disaggregated inference systems. It provides a consistent and unified API that allows users to define, configure, and scale prefill, decode, and any other components like routing within a single custom resource. + +### How Grove Works for Disaggregated Serving + +Grove enables disaggregated serving by breaking down large language model inference into separate, specialized components that can be independently scaled and managed. This architecture provides several advantages: + +- **Component Specialization**: Separate prefill, decode, and routing components optimized for their specific tasks +- **Independent Scaling**: Each component can scale based on its individual resource requirements and workload patterns +- **Resource Optimization**: Better utilization of hardware resources through specialized workload placement +- **Fault Isolation**: Issues in one component don't necessarily affect others + +## Core Components and API Resources + +Grove implements disaggregated serving through several custom Kubernetes resources that provide declarative composition of role-based pod groups: + +### PodGangSet +The top-level Grove object that defines a group of components managed and colocated together. Key features include: +- Support for autoscaling +- Topology-aware spread of replicas for availability +- Unified management of multiple disaggregated components + +### PodClique +Represents a group of pods with a specific role (e.g., leader, worker, frontend). Each clique features: +- Independent configuration options +- Custom scaling logic support +- Role-specific resource allocation + +### PodCliqueScalingGroup +A set of PodCliques that scale and are scheduled together, ideal for tightly coupled roles like prefill leader and worker components that need coordinated scaling behavior. + +## Key Capabilities for Disaggregated Serving + +Grove provides several specialized features that make it particularly well-suited for disaggregated serving: + +### Flexible Gang Scheduling +PodCliques and PodCliqueScalingGroups allow users to specify flexible gang-scheduling requirements at multiple levels within a PodGangSet to prevent resource deadlocks and ensure all components of a disaggregated system start together. + +### Multi-level Horizontal Auto-Scaling +Supports pluggable horizontal auto-scaling solutions to scale PodGangSet, PodClique, and PodCliqueScalingGroup custom resources independently based on their specific metrics and requirements. + +### Network Topology-Aware Scheduling +Allows specifying network topology pack and spread constraints to optimize for both network performance and service availability, crucial for disaggregated systems where components need efficient inter-node communication. + +### Custom Startup Dependencies +Prescribes the order in which PodCliques must start in a declarative specification, with pod startup decoupled from pod creation or scheduling. This ensures proper initialization order for disaggregated components. + +## Use Cases and Examples + +Grove specifically supports: + +- **Multi-node disaggregated inference** for large models such as DeepSeek-R1 and Llama-4-Maverick +- **Single-node disaggregated inference** for optimized resource utilization +- **Agentic pipelines of models** for complex AI workflows +- **Standard aggregated serving** patterns for single node or single GPU inference + +## Integration with NVIDIA Dynamo + +Grove is strategically aligned with NVIDIA Dynamo for seamless integration within the AI infrastructure stack: + +### Complementary Roles +- **Grove**: Handles the Kubernetes orchestration layer for disaggregated AI workloads +- **Dynamo**: Provides comprehensive AI infrastructure capabilities including serving backends, routing, and resource management + +### Release Coordination +Grove is aligning its release schedule with NVIDIA Dynamo to ensure seamless integration, with the finalized release cadence reflected in the project roadmap. + +### Unified AI Platform +The integration creates a comprehensive platform where: +- Grove manages complex orchestration of disaggregated components +- Dynamo provides the serving infrastructure, routing capabilities, and backend integrations +- Together they enable sophisticated AI serving architectures with simplified management + +## Architecture Benefits + +Grove represents a significant advancement in Kubernetes-based orchestration for AI workloads by: + +1. **Simplifying Complex Deployments**: Provides a unified API that can manage multiple components (prefill, decode, routing) within a single resource definition +2. **Enabling Sophisticated Architectures**: Supports advanced disaggregated inference patterns that were previously difficult to orchestrate +3. **Reducing Operational Complexity**: Abstracts away the complexity of coordinating multiple interdependent AI components +4. **Optimizing Resource Utilization**: Enables fine-grained control over component placement and scaling + +## Getting Started + +> **Note**: Grove is currently in development and aligning with NVIDIA Dynamo's release schedule. + +For installation instructions, see the [Grove Installation Guide](https://github.com/NVIDIA/grove/blob/main/docs/installation.md). + +For practical examples of Grove-based multinode deployments in action, see the [Multinode Deployment Guide](multinode-deployment.md), which demonstrates multi-node disaggregated serving scenarios. + +For the latest updates on Grove, refer to the [official project on GitHub](https://github.com/NVIDIA/grove). \ No newline at end of file diff --git a/docs/guides/deploy/k8s_metrics.md b/docs/guides/dynamo_deploy/k8s_metrics.md similarity index 95% rename from docs/guides/deploy/k8s_metrics.md rename to docs/guides/dynamo_deploy/k8s_metrics.md index 9653c58ad1..e6dc9e7a6e 100644 --- a/docs/guides/deploy/k8s_metrics.md +++ b/docs/guides/dynamo_deploy/k8s_metrics.md @@ -7,7 +7,7 @@ This guide provides a walkthrough for collecting and visualizing metrics from Dy ## Prerequisites ### Install Dynamo Operator -Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Quickstart Guide](../dynamo_deploy/quickstart.md) for detailed instructions on deploying the Dynamo operator. +Before setting up metrics collection, you'll need to have the Dynamo operator installed in your cluster. Follow our [Installation Guide](../dynamo_deploy/dynamo_cloud.md) for detailed instructions on deploying the Dynamo operator. ### Install Prometheus Operator If you don't have an existing Prometheus setup, you'll need to install the Prometheus Operator. The Prometheus Operator introduces custom resources that make it easy to deploy and manage Prometheus monitoring in Kubernetes: @@ -39,7 +39,7 @@ This will create two components: - A Worker component exposing metrics on its system port Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about: -- Deployment configuration: See the [vLLM README](../../../components/backends/vllm/README.md) +- Deployment configuration: See the [vLLM README](../../components/backends/vllm/README.md) - Available metrics: See the [metrics guide](../metrics.md) ### Validate the Deployment @@ -47,7 +47,7 @@ Both components expose a `/metrics` endpoint following the OpenMetrics format, b Let's send some test requests to populate metrics: ```bash -curl localhost:8080/v1/chat/completions \ +curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen3-0.6B", diff --git a/docs/guides/dynamo_deploy/minikube.md b/docs/guides/dynamo_deploy/minikube.md index ee6bf20d36..27e877fd66 100644 --- a/docs/guides/dynamo_deploy/minikube.md +++ b/docs/guides/dynamo_deploy/minikube.md @@ -17,21 +17,19 @@ limitations under the License. # Minikube Setup Guide -Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Cloud locally. +Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally. -## Setting Up Minikube - -### 1. Install Minikube +## 1. Install Minikube First things first! Start by installing Minikube. Follow the official [Minikube installation guide](https://minikube.sigs.k8s.io/docs/start/) for your operating system. -### 2. Configure GPU Support (Optional) +## 2. Configure GPU Support (Optional) Planning to use GPU-accelerated workloads? You'll need to configure GPU support in Minikube. Follow the [Minikube GPU guide](https://minikube.sigs.k8s.io/docs/tutorials/nvidia/) to set up NVIDIA GPU support before proceeding. ```{tip} Make sure to configure GPU support before starting Minikube if you plan to use GPU workloads! ``` -### 3. Start Minikube +## 3. Start Minikube Time to launch your local cluster! ```bash @@ -44,7 +42,7 @@ minikube addons enable istio minikube addons enable storage-provisioner-rancher ``` -### 4. Verify Installation +## 4. Verify Installation Let's make sure everything is working correctly! ```bash @@ -60,5 +58,5 @@ kubectl get storageclass ## Next Steps -Once your local environment is set up, you can proceed with the [Dynamo Cloud deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster. +Once your local environment is set up, you can proceed with the [Dynamo Kubernetes Platform deployment guide](./dynamo_cloud.md) to deploy the platform to your local cluster. diff --git a/docs/guides/dynamo_deploy/model_caching_with_fluid.md b/docs/guides/dynamo_deploy/model_caching_with_fluid.md index cfa376ace5..ccaeed5b0b 100644 --- a/docs/guides/dynamo_deploy/model_caching_with_fluid.md +++ b/docs/guides/dynamo_deploy/model_caching_with_fluid.md @@ -27,7 +27,7 @@ helm install fluid fluid/fluid -n fluid-system ``` For advanced configuration, see the [Fluid Installation Guide](https://fluid-cloudnative.github.io/docs/get-started/installation). -## Quick Start +## Pre-deployment Steps 1. Install Fluid (see [Installation](#installation)). 2. Create a Dataset and Runtime (see [the following example](#webufs-example)). diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md index 488b7024f9..31535ed0db 100644 --- a/docs/guides/metrics.md +++ b/docs/guides/metrics.md @@ -31,7 +31,7 @@ Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also a **Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics. -**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana. +**Kubernetes Integration**: For comprehensive Kubernetes deployment and monitoring setup, see the [Kubernetes Metrics Guide](dynamo_deploy/k8s_metrics.md). This includes Prometheus Operator setup, metrics collection configuration, and visualization in Grafana. ## Metrics Hierarchy diff --git a/examples/runtime/hello_world/README.md b/examples/runtime/hello_world/README.md index c7e6644db9..2063aaa36c 100644 --- a/examples/runtime/hello_world/README.md +++ b/examples/runtime/hello_world/README.md @@ -106,7 +106,7 @@ Hello star! Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational. -Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud. +Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Kubernetes Platform. Then deploy to kubernetes using ```bash