ai-dynamo
diff --git a/‎ATTRIBUTIONS-Go.md‎
Lines changed: 26845 additions & 23008 deletions b/‎ATTRIBUTIONS-Go.md‎
Lines changed: 26845 additions & 23008 deletions
diff --git a/‎ATTRIBUTIONS-Python.md‎
Lines changed: 16904 additions & 8904 deletions b/‎ATTRIBUTIONS-Python.md‎
Lines changed: 16904 additions & 8904 deletions
diff --git a/‎ATTRIBUTIONS-Rust.md‎
Lines changed: 20889 additions & 18385 deletions b/‎ATTRIBUTIONS-Rust.md‎
Lines changed: 20889 additions & 18385 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 2 additions & 1 deletion b/‎Cargo.lock‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎container/build.sh‎
Lines changed: 1 addition & 1 deletion b/‎container/build.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/dynamo_glossary.md‎
Lines changed: 96 additions & 0 deletions b/‎docs/dynamo_glossary.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎docs/index.rst‎
Lines changed: 6 additions & 0 deletions b/‎docs/index.rst‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎examples/sglang/slurm_jobs/.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎examples/sglang/slurm_jobs/.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/sglang/slurm_jobs/README.md‎
Lines changed: 108 additions & 0 deletions b/‎examples/sglang/slurm_jobs/README.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎examples/sglang/slurm_jobs/job_script_template.j2‎
Lines changed: 98 additions & 0 deletions b/‎examples/sglang/slurm_jobs/job_script_template.j2‎
Lines changed: 98 additions & 0 deletions
@@ -114,7 +114,7 @@ SGLANG_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
 VLLM_V1_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
 VLLM_V1_BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
 
-NIXL_COMMIT=16348080f5bdeb9fe6058a23be140cec020ef3f3
+NIXL_COMMIT=3503658e71143b56f9d5b1b440d84a94b9c41af8
 NIXL_REPO=ai-dynamo/nixl.git
 
 NIXL_UCX_EFA_REF=7ec95b95e524a87e81cac92f5ca8523e3966b16b
 
@@ -0,0 +1,96 @@
+# NVIDIA Dynamo Glossary
+
+## B
+**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
+
+## C
+**Component** - The fundamental deployable unit in Dynamo. A discoverable service entity that can host multiple endpoints and typically maps to a Docker container (such as VllmWorker, Router, Processor).
+
+**Conditional Disaggregation** - Dynamo's intelligent decision-making process within disaggregated serving that determines whether a request is processed locally or sent to a remote prefill engine based on prefill length and queue status.
+
+## D
+**Decode Phase** - The second phase of LLM inference that generates output tokens one at a time.
+
+**depends()** - A Dynamo function that creates dependencies between services, enabling automatic client generation and service discovery.
+
+**Disaggregated Serving** - Dynamo's core architecture that separates prefill and decode phases into specialized engines to maximize GPU throughput and improve performance.
+
+**Distributed Runtime** - Dynamo's Rust-based core system that manages service discovery, communication, and component lifecycle across distributed clusters.
+
+**Dynamo** - NVIDIA's high-performance distributed inference framework for Large Language Models (LLMs) and generative AI models, designed for multinode environments with disaggregated serving and cache-aware routing.
+
+**Dynamo Artifact** - A packaged archive containing an inference graph and its dependencies, created using `dynamo build`. It's the containerized, deployable version of a Graph.
+
+**Dynamo Cloud** - A Kubernetes platform providing managed deployment experience for Dynamo inference graphs.
+
+**dynamo build** - The CLI command to containerize inference graphs or parts of graphs into Docker containers.
+
+**dynamo deploy** - The CLI command to deploy inference graphs to Kubernetes with Helm charts or custom operators.
+
+**dynamo run** - The CLI command to quickly experiment and test models with various LLM engines.
+
+**dynamo serve** - The CLI command to compose and serve inference graphs locally.
+
+## E
+**@endpoint** - A Python decorator used to define service endpoints within a Dynamo component.
+
+**Endpoint** - A specific network-accessible API within a Dynamo component, such as `generate` or `load_metrics`.
+
+## F
+**Frontend** - Dynamo's API server component that receives user requests and provides OpenAI-compatible HTTP endpoints.
+
+## G
+**Graph** - A collection of interconnected Dynamo components that form a complete inference pipeline with request paths (single-in) and response paths (many-out for streaming). A graph can be packaged into a Dynamo Artifact for deployment.
+
+## I
+**Instance** - A running process with a unique `instance_id`. Multiple instances can serve the same namespace, component, and endpoint for load balancing
+
+## K
+**KV Block Manager (KVBM)** - Dynamo's scalable runtime component that handles memory allocation, management, and remote sharing of Key-Value blocks across heterogeneous and distributed environments.
+
+**KV Cache** - Key-Value cache that stores computed attention states from previous tokens to avoid recomputation during inference.
+
+**KV Router** - Dynamo's intelligent routing system that directs requests to workers with the highest cache overlap to maximize KV cache reuse. Determines routing based on KV cache hit rates and worker metrics.
+
+**KVIndexer** - Dynamo component that maintains a global view of cached blocks across all workers using a prefix tree structure to calculate cache hit rates.
+
+**KVPublisher** - Dynamo component that emits KV cache events (stored/removed) from individual workers to the global KVIndexer.
+
+## N
+**Namespace** - Dynamo's logical grouping mechanism for related components. Similar to directories in a file system, they prevent collisions between different deployments.
+
+**NIXL (NVIDIA Inference tranXfer Library)** - High-performance data transfer library optimized for inference workloads, supporting direct GPU-to-GPU transfers and multiple memory hierarchies.
+
+## P
+**PagedAttention** - Memory management technique from vLLM that efficiently manages KV cache by chunking requests into blocks.
+
+**Planner** - Dynamo component that performs dynamic resource scaling based on real-time demand signals and system metrics.
+
+**Prefill Phase** - The first phase of LLM inference that processes the input prompt and generates KV cache.
+
+**Prefix Caching** - Optimization technique that reuses previously computed KV cache for common prompt prefixes.
+
+**Processor** - Dynamo component that handles request preprocessing, tokenization, and routing decisions.
+
+## R
+**RadixAttention** - Technique from SGLang that uses a prefix tree structure for efficient KV cache matching, insertion, and eviction.
+
+**RDMA (Remote Direct Memory Access)** - Technology that allows direct memory access between distributed systems, used for efficient KV cache transfers.
+
+## S
+**@service** - Python decorator used to define a Dynamo service class.
+
+**SGLang** - Fast LLM inference framework with native embedding support and RadixAttention.
+
+## T
+**Tensor Parallelism (TP)** - Model parallelism technique where model weights are distributed across multiple GPUs.
+
+**TensorRT-LLM** - NVIDIA's optimized LLM inference engine with multinode MPI distributed support.
+
+**Time-To-First-Token (TTFT)** - The latency from receiving a request to generating the first output token.
+
+## V
+**vLLM** - High-throughput LLM serving engine with Ray distributed support and PagedAttention.
+
+## X
+**xPyD (x Prefill y Decode)** - Dynamo notation describing disaggregated serving configurations where x prefill workers serve y decode workers. Dynamo supports runtime-reconfigurable xPyD.
@@ -135,5 +135,11 @@ Dive in: Examples
    Multinode Examples <examples/multinode.md>
    LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
 
+.. toctree::
+   :hidden:
+   :caption: Reference
+
+   Glossary <dynamo_glossary.md>
+   KVBM Reading <architecture/kvbm_reading.md>
 
 
@@ -0,0 +1,2 @@
+logs/*
+outputs/*
@@ -0,0 +1,108 @@
+# Example: Deploy Multi-node SGLang with Dynamo on SLURM
+
+This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster.
+
+## Overview
+
+The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode.
+The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
+
+## Scripts
+
+- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
+- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
+- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
+- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
+
+## Logs Folder Structure
+
+Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
+
+### Log File Structure
+
+```
+logs/
+├── 3062824/                    # Job ID directory
+│   ├── log.out                 # Main job output (node allocation, IP addresses, launch commands)
+│   ├── log.err                 # Main job errors
+│   ├── node0197_prefill.out     # Prefill node stdout (node0197)
+│   ├── node0197_prefill.err     # Prefill node stderr (node0197)
+│   ├── node0200_prefill.out     # Prefill node stdout (node0200)
+│   ├── node0200_prefill.err     # Prefill node stderr (node0200)
+│   ├── node0201_decode.out      # Decode node stdout (node0201)
+│   ├── node0201_decode.err      # Decode node stderr (node0201)
+│   ├── node0204_decode.out      # Decode node stdout (node0204)
+│   ├── node0204_decode.err      # Decode node stderr (node0204)
+│   ├── node0197_prefill_gpu_utilization.log    # GPU utilization monitoring (node0197)
+│   ├── node0200_prefill_gpu_utilization.log    # GPU utilization monitoring (node0200)
+│   ├── node0201_decode_gpu_utilization.log     # GPU utilization monitoring (node0201)
+│   └── node0204_decode_gpu_utilization.log     # GPU utilization monitoring (node0204)
+├── 3063137/                    # Another job ID directory
+├── 3062689/                    # Another job ID directory
+└── ...
+```
+
+## Setup
+
+For simplicity of the example, we will make some assumptions about your SLURM cluster:
+1. We assume you have access to a SLURM cluster with multiple GPU nodes
+   available. For functional testing, most setups should be fine. For performance
+   testing, you should aim to allocate groups of nodes that are performantly
+   inter-connected, such as those in an NVL72 setup.
+2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
+   SPANK plugin setup. In particular, the `job_script_template.j2` template in this
+   example will use `srun` arguments like `--container-image`,
+   `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
+   If your cluster supports similar container based plugins, you may be able to
+   modify the template to use that instead.
+3. We assume you have already built a recent Dynamo+SGLang container image as
+   described [here](../dsr1-wideep.md#instructions).
+   This is the image that can be passed to the `--container-image` argument in later steps.
+
+## Usage
+
+1. **Submit a benchmark job**:
+   ```bash
+   python submit_job_script.py \
+     --template job_script_template.j2 \
+     --model-dir /path/to/model \
+     --config-dir /path/to/configs \
+     --container-image container-image-uri \
+     --account your-slurm-account
+   ```
+
+   **Required arguments**:
+   - `--template`: Path to Jinja2 template file
+   - `--model-dir`: Model directory path
+   - `--config-dir`: Config directory path
+   - `--container-image`: Container image URI (e.g., `registry/repository:tag`)
+   - `--account`: SLURM account
+
+   **Optional arguments**:
+   - `--prefill-nodes`: Number of prefill nodes (default: `2`)
+   - `--decode-nodes`: Number of decode nodes (default: `2`)
+   - `--gpus-per-node`: Number of GPUs per node (default: `8`)
+   - `--network-interface`: Network interface to use (default: `eth3`)
+   - `--job-name`: SLURM job name (default: `dynamo_setup`)
+   - `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
+
+   **Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
+
+2. **Monitor job progress**:
+   ```bash
+   squeue -u $USER
+   ```
+
+3. **Check logs in real-time**:
+   ```bash
+   tail -f logs/{JOB_ID}/log.out
+   ```
+
+4. **Monitor GPU utilization**:
+   ```bash
+   tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
+   ```
+
+## Outputs
+
+Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
@@ -0,0 +1,98 @@
+#!/bin/bash
+#SBATCH --job-name={{ job_name }}
+#SBATCH --nodes={{ total_nodes }}
+#SBATCH --ntasks={{ total_nodes }}
+#SBATCH --ntasks-per-node=1
+#SBATCH --account={{ account }}
+#SBATCH --time={{ time_limit }}
+#SBATCH --output=logs/%j/log.out
+#SBATCH --error=logs/%j/log.err
+
+# Constants
+PREFILL_NODES={{ prefill_nodes }}
+DECODE_NODES={{ decode_nodes }}
+TOTAL_NODES=$((PREFILL_NODES + DECODE_NODES))
+GPUS_PER_NODE={{ gpus_per_node }}
+LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}/"
+SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
+OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
+MODEL_DIR="{{ model_dir }}"
+CONFIG_DIR="{{ config_dir }}"
+CONTAINER_IMAGE="{{ container_image }}"
+NETWORK_INTERFACE="{{ network_interface }}"
+
+{% raw %}
+
+mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
+
+nodes=($(scontrol show hostnames $SLURM_NODELIST))
+if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
+    echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
+    exit 1
+fi
+
+# Print node information
+for i in "${!nodes[@]}"; do
+    echo "Node $i: ${nodes[$i]}"
+done
+
+PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
+if [ -z "$PREFILL_HOST_IP" ]; then
+    echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE"
+    exit 1
+fi
+echo "Prefill host IP address: $PREFILL_HOST_IP"
+
+DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
+if [ -z "$DECODE_HOST_IP" ]; then
+    echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE"
+    exit 1
+fi
+echo "Decode host IP address: $DECODE_HOST_IP"
+
+# Prepare enroot arguments to pass to srun commands
+ENROOT_ARGS="\
+    --container-image=${CONTAINER_IMAGE} \
+    --no-container-entrypoint \
+    --container-mount-home \
+    --no-container-remap-root \
+    --container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
+"
+
+# Launch prefill tasks on the first PREFILL_NODES nodes
+for i in $(seq 0 $((PREFILL_NODES - 1))); do
+    node=${nodes[$i]}
+    rank=$i
+    echo "Launching prefill task on node ${i} (rank ${rank}): $node"
+    echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err"
+    echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &"
+    srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
+    --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \
+    python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &
+done
+
+# Launch decode tasks on the next DECODE_NODES nodes
+for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do
+    node=${nodes[$i]}
+    rank=$((i - PREFILL_NODES))
+    echo "Launching decode task on node ${i} (rank ${rank}): $node"
+    echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err"
+    echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &"
+    srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
+    --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \
+    python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &
+done
+
+echo ""
+echo "To connect to the host prefill node:"
+echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
+
+echo ""
+echo "Make sure to cancel the job at the end:"
+echo "scancel $SLURM_JOB_ID"
+
+# Wait for all tasks to complete
+wait
+echo "Script finished at $(date)"
+
+{% endraw %}