Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
docs: Consolidate documentation and fix redundant headings
This commit consolidates and improves the documentation structure based on tech writer feedback:

**New Documentation:**
- Added Grove advanced Kubernetes scheduling guide
- Added comprehensive K8s metrics setup guide with Prometheus/Grafana

**Heading Fixes:**
- Fixed redundant headings that would appear redundant in Sphinx breadcrumbs
- Changed "Architecture" → "Design" in SLA planner docs
- Changed "Core Components" → "Core Services" to avoid repetition
- Removed duplicate H1 headings in component docs

**Quick Start Disambiguation:**
- "Quick Start" → "SGLang Quick Start" in SGLang README
- "Quick Start" → "TensorRT-LLM Quick Start" in TensorRT-LLM README
- "Quick Start" → "vLLM Quick Start" in vLLM README
- "Quick Start" → "KV Router Quick Start" in router docs
- "Quick Start" → "Pre-deployment Steps" in Fluid caching guide

**Platform Naming:**
- Updated references to use consistent "Dynamo Kubernetes Platform" naming

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
  • Loading branch information
athreesh and claude committed Aug 19, 2025
commit 359b4aeeb5b8ba7eaf4d8a032be59ed5a3327857
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Dynamo supports multiple inference engines (with a focus on SGLang, vLLM, and Te

Each engine provides launch scripts for different deployment patterns in their respective `/launch` & `/deploy` directories.

## Core Components
## Core Services

### [Backends](backends/)

Expand Down
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **GB200 Support** | ✅ | |


## Quick Start
## SGLang Quick Start

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | |

## Quick Start
## TensorRT-LLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |

## Quick Start
## vLLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ There are multi-faceted challenges:

To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.

## High level architecture and key benefits
## Key benefits

The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:

Expand Down
4 changes: 2 additions & 2 deletions docs/architecture/sla_planner.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data

## Architecture
## Design

The SLA planner consists of several key components:

Expand Down Expand Up @@ -108,7 +108,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill

For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../guides/dynamo_deploy/sla_planner_deployment.md).

**Quick Start:**
**To deploy SLA Planner:**
```bash
cd components/backends/vllm/deploy
kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
Expand Down
2 changes: 1 addition & 1 deletion docs/components/router/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ SPDX-License-Identifier: Apache-2.0

The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.

## Quick Start
## KV Router Quick Start

To launch the Dynamo frontend with the KV Router:

Expand Down
171 changes: 171 additions & 0 deletions docs/guides/dynamo_deploy/grove.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Grove: Advanced Kubernetes Scheduling

Grove is an advanced Kubernetes scheduler and batch workload manager built on top of the Dynamo Kubernetes Platform. It enables sophisticated scheduling policies for multi-node GPU workloads, with special support for large-scale LLM inference deployments.

## Overview

Grove extends Kubernetes' default scheduling capabilities with:
- **Gang scheduling**: Ensures all pods in a workload start together or not at all
- **Topology-aware placement**: Optimizes pod placement based on network topology
- **Resource-aware scheduling**: Makes intelligent decisions based on GPU memory, compute capacity, and network bandwidth
- **Priority-based queueing**: Manages workload priorities and preemption policies

## Key Features

### PodGangSet
PodGangSet is Grove's primary scheduling primitive that groups related pods that must be scheduled together.

```yaml
apiVersion: grove.dynamo.ai/v1
kind: PodGangSet
metadata:
name: llm-inference-gang
namespace: default
spec:
template:
spec:
containers:
- name: worker
image: dynamo/worker:latest
resources:
requests:
nvidia.com/gpu: 1
replicas: 8
minAvailable: 8 # All pods must be schedulable
scheduling:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values: ["gpu-compute"]
```
### PodClique
PodClique provides fine-grained control over pod co-location and anti-affinity rules within a gang.
```yaml
apiVersion: grove.dynamo.ai/v1
kind: PodClique
metadata:
name: prefill-decode-clique
spec:
selector:
matchLabels:
app: dynamo-worker
topology:
# Prefer pods to be co-located on the same rack
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
component: prefill
topologyKey: topology.kubernetes.io/rack
```
## Deployment
### Prerequisites
- Kubernetes cluster with GPU nodes
- NVIDIA GPU Operator installed
- Node topology labels configured
### Install Grove Scheduler
```bash
# Install Grove CRDs and scheduler
kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-crds.yaml
kubectl apply -f https://github.com/ai-dynamo/grove/releases/latest/download/grove-scheduler.yaml
```

### Configure Node Topology

Label your nodes with topology information:

```bash
# Label nodes with rack information
kubectl label node gpu-node-01 topology.kubernetes.io/rack=rack-1
kubectl label node gpu-node-02 topology.kubernetes.io/rack=rack-1
kubectl label node gpu-node-03 topology.kubernetes.io/rack=rack-2

# Label nodes with GPU types
kubectl label node gpu-node-01 accelerator=h100
kubectl label node gpu-node-02 accelerator=h100
kubectl label node gpu-node-03 accelerator=a100
```

## Integration with Dynamo

Grove integrates seamlessly with Dynamo's disaggregated serving architecture:

### Multi-Node Prefill/Decode Scheduling

```yaml
apiVersion: grove.dynamo.ai/v1
kind: PodGangSet
metadata:
name: dynamo-multinode-serving
spec:
template:
metadata:
labels:
app: dynamo-worker
spec:
schedulerName: grove-scheduler
containers:
- name: dynamo-worker
image: nvcr.io/nvidia/ai-dynamo/sglang-runtime:latest
env:
- name: WORKER_TYPE
value: "prefill" # or "decode"
replicas: 16
minAvailable: 16
scheduling:
# Ensure all workers can communicate efficiently
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: network-tier
operator: In
values: ["high-bandwidth"]
```
## Best Practices
### Resource Planning
- Use `minAvailable: replicas` for strict gang scheduling
- Set appropriate resource requests and limits
- Consider network bandwidth requirements for multi-node workloads

### Topology Awareness
- Label nodes with rack, zone, and network topology information
- Use PodClique for fine-grained placement control
- Test different affinity rules to optimize for your workload

### Monitoring
Grove provides metrics for scheduling decisions:

```bash
# View Grove scheduler metrics
kubectl port-forward -n grove-system svc/grove-scheduler-metrics 8080:8080
curl localhost:8080/metrics | grep grove_
```

## Troubleshooting

### Common Issues

**Pods stuck in Pending state:**
- Check if sufficient resources are available across required nodes
- Verify node labels match gang affinity requirements
- Review Grove scheduler logs: `kubectl logs -n grove-system deployment/grove-scheduler`

**Gang scheduling not working:**
- Ensure `schedulerName: grove-scheduler` is set in pod specs
- Verify PodGangSet controller is running
- Check for resource conflicts with other scheduled workloads

For more detailed troubleshooting, see the [Grove Documentation](https://grove.dynamo.ai/docs).
Loading
Loading