Skip to content

Commit 3d953e3

Browse files
authored
Merge branch 'main' into mabdulwahhab/add-crds-to-examples
2 parents 7747765 + 1630f8b commit 3d953e3

File tree

15 files changed

+65475
-50301
lines changed

15 files changed

+65475
-50301
lines changed

ATTRIBUTIONS-Go.md

Lines changed: 26845 additions & 23008 deletions
Large diffs are not rendered by default.

ATTRIBUTIONS-Python.md

Lines changed: 16904 additions & 8904 deletions
Large diffs are not rendered by default.

ATTRIBUTIONS-Rust.md

Lines changed: 20889 additions & 18385 deletions
Large diffs are not rendered by default.

Cargo.lock

Lines changed: 2 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

container/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
114114
VLLM_V1_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
115115
VLLM_V1_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
116116

117-
NIXL_COMMIT=16348080f5bdeb9fe6058a23be140cec020ef3f3
117+
NIXL_COMMIT=3503658e71143b56f9d5b1b440d84a94b9c41af8
118118
NIXL_REPO=ai-dynamo/nixl.git
119119

120120
NIXL_UCX_EFA_REF=7ec95b95e524a87e81cac92f5ca8523e3966b16b

docs/dynamo_glossary.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# NVIDIA Dynamo Glossary
2+
3+
## B
4+
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
5+
6+
## C
7+
**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
8+
9+
**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
10+
11+
## D
12+
**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
13+
14+
**depends()** - A Dynamo function that creates dependencies between services, enabling automatic client generation and service discovery.
15+
16+
**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
17+
18+
**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
19+
20+
**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
21+
22+
**Dynamo Artifact** - A packaged archive containing an inference graph and its dependencies, created using `dynamo build`. It's the containerized, deployable version of a Graph.
23+
24+
**Dynamo Cloud** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
25+
26+
**dynamo build** - The CLI command to containerize inference graphs or parts of graphs into Docker containers.
27+
28+
**dynamo deploy** - The CLI command to deploy inference graphs to Kubernetes with Helm charts or custom operators.
29+
30+
**dynamo run** - The CLI command to quickly experiment and test models with various LLM engines.
31+
32+
**dynamo serve** - The CLI command to compose and serve inference graphs locally.
33+
34+
## E
35+
**@endpoint** - A Python decorator used to define service endpoints within a Dynamo component.
36+
37+
**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
38+
39+
## F
40+
**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
41+
42+
## G
43+
**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
44+
45+
## I
46+
**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
47+
48+
## K
49+
**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
50+
51+
**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
52+
53+
**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
54+
55+
**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
56+
57+
**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
58+
59+
## N
60+
**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
61+
62+
**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
63+
64+
## P
65+
**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
66+
67+
**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
68+
69+
**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
70+
71+
**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
72+
73+
**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
74+
75+
## R
76+
**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
77+
78+
**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
79+
80+
## S
81+
**@service** - Python decorator used to define a Dynamo service class.
82+
83+
**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
84+
85+
## T
86+
**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
87+
88+
**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
89+
90+
**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
91+
92+
## V
93+
**vLLM** - High-throughput LLM serving engine with Ray distributed support and PagedAttention.
94+
95+
## X
96+
**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.

docs/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,5 +135,11 @@ Dive in: Examples
135135
Multinode Examples <examples/multinode.md>
136136
LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
137137

138+
.. toctree::
139+
:hidden:
140+
:caption: Reference
141+
142+
Glossary <dynamo_glossary.md>
143+
KVBM Reading <architecture/kvbm_reading.md>
138144

139145

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
logs/*
2+
outputs/*
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Example: Deploy Multi-node SGLang with Dynamo on SLURM
2+
3+
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster.
4+
5+
## Overview
6+
7+
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode.
8+
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
9+
10+
## Scripts
11+
12+
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
13+
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
14+
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
15+
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
16+
17+
## Logs Folder Structure
18+
19+
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
20+
21+
### Log File Structure
22+
23+
```
24+
logs/
25+
├── 3062824/ # Job ID directory
26+
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
27+
│ ├── log.err # Main job errors
28+
│ ├── node0197_prefill.out # Prefill node stdout (node0197)
29+
│ ├── node0197_prefill.err # Prefill node stderr (node0197)
30+
│ ├── node0200_prefill.out # Prefill node stdout (node0200)
31+
│ ├── node0200_prefill.err # Prefill node stderr (node0200)
32+
│ ├── node0201_decode.out # Decode node stdout (node0201)
33+
│ ├── node0201_decode.err # Decode node stderr (node0201)
34+
│ ├── node0204_decode.out # Decode node stdout (node0204)
35+
│ ├── node0204_decode.err # Decode node stderr (node0204)
36+
│ ├── node0197_prefill_gpu_utilization.log # GPU utilization monitoring (node0197)
37+
│ ├── node0200_prefill_gpu_utilization.log # GPU utilization monitoring (node0200)
38+
│ ├── node0201_decode_gpu_utilization.log # GPU utilization monitoring (node0201)
39+
│ └── node0204_decode_gpu_utilization.log # GPU utilization monitoring (node0204)
40+
├── 3063137/ # Another job ID directory
41+
├── 3062689/ # Another job ID directory
42+
└── ...
43+
```
44+
45+
## Setup
46+
47+
For simplicity of the example, we will make some assumptions about your SLURM cluster:
48+
1. We assume you have access to a SLURM cluster with multiple GPU nodes
49+
available. For functional testing, most setups should be fine. For performance
50+
testing, you should aim to allocate groups of nodes that are performantly
51+
inter-connected, such as those in an NVL72 setup.
52+
2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
53+
SPANK plugin setup. In particular, the `job_script_template.j2` template in this
54+
example will use `srun` arguments like `--container-image`,
55+
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
56+
If your cluster supports similar container based plugins, you may be able to
57+
modify the template to use that instead.
58+
3. We assume you have already built a recent Dynamo+SGLang container image as
59+
described [here](../dsr1-wideep.md#instructions).
60+
This is the image that can be passed to the `--container-image` argument in later steps.
61+
62+
## Usage
63+
64+
1. **Submit a benchmark job**:
65+
```bash
66+
python submit_job_script.py \
67+
--template job_script_template.j2 \
68+
--model-dir /path/to/model \
69+
--config-dir /path/to/configs \
70+
--container-image container-image-uri \
71+
--account your-slurm-account
72+
```
73+
74+
**Required arguments**:
75+
- `--template`: Path to Jinja2 template file
76+
- `--model-dir`: Model directory path
77+
- `--config-dir`: Config directory path
78+
- `--container-image`: Container image URI (e.g., `registry/repository:tag`)
79+
- `--account`: SLURM account
80+
81+
**Optional arguments**:
82+
- `--prefill-nodes`: Number of prefill nodes (default: `2`)
83+
- `--decode-nodes`: Number of decode nodes (default: `2`)
84+
- `--gpus-per-node`: Number of GPUs per node (default: `8`)
85+
- `--network-interface`: Network interface to use (default: `eth3`)
86+
- `--job-name`: SLURM job name (default: `dynamo_setup`)
87+
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
88+
89+
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
90+
91+
2. **Monitor job progress**:
92+
```bash
93+
squeue -u $USER
94+
```
95+
96+
3. **Check logs in real-time**:
97+
```bash
98+
tail -f logs/{JOB_ID}/log.out
99+
```
100+
101+
4. **Monitor GPU utilization**:
102+
```bash
103+
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
104+
```
105+
106+
## Outputs
107+
108+
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
#!/bin/bash
2+
#SBATCH --job-name={{ job_name }}
3+
#SBATCH --nodes={{ total_nodes }}
4+
#SBATCH --ntasks={{ total_nodes }}
5+
#SBATCH --ntasks-per-node=1
6+
#SBATCH --account={{ account }}
7+
#SBATCH --time={{ time_limit }}
8+
#SBATCH --output=logs/%j/log.out
9+
#SBATCH --error=logs/%j/log.err
10+
11+
# Constants
12+
PREFILL_NODES={{ prefill_nodes }}
13+
DECODE_NODES={{ decode_nodes }}
14+
TOTAL_NODES=$((PREFILL_NODES + DECODE_NODES))
15+
GPUS_PER_NODE={{ gpus_per_node }}
16+
LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}/"
17+
SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
18+
OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
19+
MODEL_DIR="{{ model_dir }}"
20+
CONFIG_DIR="{{ config_dir }}"
21+
CONTAINER_IMAGE="{{ container_image }}"
22+
NETWORK_INTERFACE="{{ network_interface }}"
23+
24+
{% raw %}
25+
26+
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
27+
28+
nodes=($(scontrol show hostnames $SLURM_NODELIST))
29+
if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
30+
echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
31+
exit 1
32+
fi
33+
34+
# Print node information
35+
for i in "${!nodes[@]}"; do
36+
echo "Node $i: ${nodes[$i]}"
37+
done
38+
39+
PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
40+
if [ -z "$PREFILL_HOST_IP" ]; then
41+
echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE"
42+
exit 1
43+
fi
44+
echo "Prefill host IP address: $PREFILL_HOST_IP"
45+
46+
DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
47+
if [ -z "$DECODE_HOST_IP" ]; then
48+
echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE"
49+
exit 1
50+
fi
51+
echo "Decode host IP address: $DECODE_HOST_IP"
52+
53+
# Prepare enroot arguments to pass to srun commands
54+
ENROOT_ARGS="\
55+
--container-image=${CONTAINER_IMAGE} \
56+
--no-container-entrypoint \
57+
--container-mount-home \
58+
--no-container-remap-root \
59+
--container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
60+
"
61+
62+
# Launch prefill tasks on the first PREFILL_NODES nodes
63+
for i in $(seq 0 $((PREFILL_NODES - 1))); do
64+
node=${nodes[$i]}
65+
rank=$i
66+
echo "Launching prefill task on node ${i} (rank ${rank}): $node"
67+
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err"
68+
echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &"
69+
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
70+
--output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \
71+
python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &
72+
done
73+
74+
# Launch decode tasks on the next DECODE_NODES nodes
75+
for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do
76+
node=${nodes[$i]}
77+
rank=$((i - PREFILL_NODES))
78+
echo "Launching decode task on node ${i} (rank ${rank}): $node"
79+
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err"
80+
echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &"
81+
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
82+
--output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \
83+
python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &
84+
done
85+
86+
echo ""
87+
echo "To connect to the host prefill node:"
88+
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
89+
90+
echo ""
91+
echo "Make sure to cancel the job at the end:"
92+
echo "scancel $SLURM_JOB_ID"
93+
94+
# Wait for all tasks to complete
95+
wait
96+
echo "Script finished at $(date)"
97+
98+
{% endraw %}

0 commit comments

Comments
 (0)