Skip to content
45 changes: 25 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,30 @@ limitations under the License.
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)

| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Containers & Helm Charts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamos)**

# NVIDIA Dynamo

High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

## The Era of Multi-GPU, Multi-Node

<p align="center">
Expand All @@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
</p>

## Framework Support Matrix

| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|----------------------|----------------------------|----------------------------------------|
| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |

To learn more about each framework and their capabilities, check out each framework's README!
- **[vLLM](components/backends/vllm/README.md)**
- **[SGLang](components/backends/sglang/README.md)**
- **[TensorRT-LLM](components/backends/trtllm/README.md)**

Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.

# Installation

The following examples require a few system level packages.
Expand Down Expand Up @@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.

## SGLang


```
# Install libnuma
# Install libnuma-dev
apt install -y libnuma-dev

# Install flashinfer-python pre-release (required by sglang for optimized inference)
uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow

# Install ai-dynamo with sglang support
uv pip install ai-dynamo[sglang]
```

Expand Down
67 changes: 52 additions & 15 deletions benchmarks/llm/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,52 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
# LLM Benchmarking Tools

This directory contains tools for benchmarking LLM inference performance in Dynamo deployments.

## Overview

The benchmarking suite includes:
- **`perf.sh`** - Automated performance benchmarking script using GenAI-Perf
- **`plot_pareto.py`** - Results analysis and Pareto efficiency visualization
- **`nginx.conf`** - Load balancer configuration for multi-backend setups

## Key Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `--tensor-parallelism, --tp` | Tensor parallelism for aggregated mode | 0 |
| `--data-parallelism, --dp` | Data parallelism for aggregated mode | 0 |
| `--prefill-tp` | Prefill tensor parallelism for disaggregated mode | 0 |
| `--prefill-dp` | Prefill data parallelism for disaggregated mode | 0 |
| `--decode-tp` | Decode tensor parallelism for disaggregated mode | 0 |
| `--decode-dp` | Decode data parallelism for disaggregated mode | 0 |
| `--model` | HuggingFace model ID | `neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic` |
| `--url` | Target inference endpoint | `http://localhost:8000` |
| `--concurrency` | Comma-separated concurrency levels | `1,2,4,8,16,32,64,128,256` |
| `--isl` | Input sequence length | 3000 |
| `--osl` | Output sequence length | 150 |
| `--mode` | Serving mode (`aggregated` or `disaggregated`) | `aggregated` |


## Best Practices

1. **Warm up services** before benchmarking to ensure stable performance
2. **Match parallelism settings** to your actual deployment configuration
3. **Run multiple benchmark iterations** for statistical confidence
4. **Monitor resource utilization** during benchmarks to identify bottlenecks
5. **Compare configurations** using Pareto plots to find optimal settings

## Requirements

- GenAI-Perf tool installed and available in PATH
- Python 3.7+ with matplotlib, pandas, seaborn, numpy
- nginx (for load balancing scenarios)
- Access to target LLM inference service

## Troubleshooting

- Ensure the target URL is accessible before running benchmarks
- Verify model names match those available in your deployment
- Check that parallelism settings align with your hardware configuration
- Monitor system resources to avoid resource contention during benchmarks


2 changes: 1 addition & 1 deletion container/Dockerfile.sglang
Original file line number Diff line number Diff line change
Expand Up @@ -480,4 +480,4 @@ COPY ATTRIBUTION* LICENSE /workspace/
ENV PYTHONPATH=/workspace/examples/sglang/utils:$PYTHONPATH

ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
CMD []
CMD []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some EOF issue

20 changes: 10 additions & 10 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,16 @@ limitations under the License.

# Dynamo Examples

This directory contains practical examples demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
## Framework Support

The /examples directory shows how Dynamo broadly works using various inference engines.

If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations

This directory contains practical examples & tutorials demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.

> **Want to see a specific example?**
> Open a [GitHub issue](https://github.com/ai-dynamo/dynamo/issues) to request an example you'd like to see, or [open a pull request](https://github.com/ai-dynamo/dynamo/pulls) if you'd like to contribute your own!
Expand Down Expand Up @@ -67,12 +76,3 @@ Before running any examples, ensure you have:
- **CUDA-compatible GPU** - For LLM inference (except hello_world, which is non-GPU aware)
- **Python 3.9++** - For client scripts and utilities
- **Kubernetes cluster** - For any cloud deployment/K8s examples

## Framework Support

These examples show how Dynamo broadly works using major inference engines.

If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
Loading