ai-dynamo · athreesh · Jul 30, 2025 · Jul 30, 2025 · Jul 30, 2025 · Jul 30, 2025
diff --git a/README.md b/README.md
@@ -21,12 +21,30 @@ limitations under the License.
 [![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/D92uqZRjCZ)
 [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/ai-dynamo/dynamo)
 
-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |  **[Containers & Helm Charts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamos)**
 
 # NVIDIA Dynamo
 
 High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
 
+## Framework Support Matrix
+
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|----------------------|----------------------------|----------------------------------------|
+| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
+| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
+| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
+| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
+
+To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
+- **[vLLM](components/backends/vllm/README.md)**
+- **[SGLang](components/backends/sglang/README.md)**
+- **[TensorRT-LLM](components/backends/trtllm/README.md)**
+
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
+
 ## The Era of Multi-GPU, Multi-Node
 
 <p align="center">
@@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
   <img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
 </p>
 
-## Framework Support Matrix
-
-| Feature | vLLM | SGLang | TensorRT-LLM |
-|---------|----------------------|----------------------------|----------------------------------------|
-| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
-| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
-| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
-| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
-
-To learn more about each framework and their capabilities, check out each framework's README!
-- **[vLLM](components/backends/vllm/README.md)**
-- **[SGLang](components/backends/sglang/README.md)**
-- **[TensorRT-LLM](components/backends/trtllm/README.md)**
-
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
-
 # Installation
 
 The following examples require a few system level packages.
@@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
 
 ## SGLang
 
+
 ```
-# Install libnuma
+# Install libnuma-dev
 apt install -y libnuma-dev
 
+# Install flashinfer-python pre-release (required by sglang for optimized inference)
+uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow
+
+# Install ai-dynamo with sglang support
 uv pip install ai-dynamo[sglang]
 ```
 

diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md
@@ -1,15 +1,52 @@
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
--->
-
-[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
+# LLM Benchmarking Tools
+
+This directory contains tools for benchmarking LLM inference performance in Dynamo deployments.
+
+## Overview
+
+The benchmarking suite includes:
+- **`perf.sh`** - Automated performance benchmarking script using GenAI-Perf
+- **`plot_pareto.py`** - Results analysis and Pareto efficiency visualization 
+- **`nginx.conf`** - Load balancer configuration for multi-backend setups
+
+## Key Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--tensor-parallelism, --tp` | Tensor parallelism for aggregated mode | 0 |
+| `--data-parallelism, --dp` | Data parallelism for aggregated mode | 0 |
+| `--prefill-tp` | Prefill tensor parallelism for disaggregated mode | 0 |
+| `--prefill-dp` | Prefill data parallelism for disaggregated mode | 0 |
+| `--decode-tp` | Decode tensor parallelism for disaggregated mode | 0 |
+| `--decode-dp` | Decode data parallelism for disaggregated mode | 0 |
+| `--model` | HuggingFace model ID | `neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic` |
+| `--url` | Target inference endpoint | `http://localhost:8000` |
+| `--concurrency` | Comma-separated concurrency levels | `1,2,4,8,16,32,64,128,256` |
+| `--isl` | Input sequence length | 3000 |
+| `--osl` | Output sequence length | 150 |
+| `--mode` | Serving mode (`aggregated` or `disaggregated`) | `aggregated` |
+
+
+## Best Practices
+
+1. **Warm up services** before benchmarking to ensure stable performance
+2. **Match parallelism settings** to your actual deployment configuration
+3. **Run multiple benchmark iterations** for statistical confidence
+4. **Monitor resource utilization** during benchmarks to identify bottlenecks
+5. **Compare configurations** using Pareto plots to find optimal settings
+
+## Requirements
+
+- GenAI-Perf tool installed and available in PATH
+- Python 3.7+ with matplotlib, pandas, seaborn, numpy
+- nginx (for load balancing scenarios)
+- Access to target LLM inference service
+
+## Troubleshooting
+
+- Ensure the target URL is accessible before running benchmarks
+- Verify model names match those available in your deployment
+- Check that parallelism settings align with your hardware configuration
+- Monitor system resources to avoid resource contention during benchmarks
+
+
@@ -480,4 +480,4 @@ COPY ATTRIBUTION* LICENSE /workspace/
 ENV PYTHONPATH=/workspace/examples/sglang/utils:$PYTHONPATH
 
 ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
-CMD []
+CMD []
@@ -17,7 +17,16 @@ limitations under the License.
 
 # Dynamo Examples
 
-This directory contains practical examples demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
+## Framework Support
+
+The /examples directory shows how Dynamo broadly works using various inference engines.
+
+If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
+- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
+- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
+- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
+
+This directory contains practical examples & tutorials demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
 
 > **Want to see a specific example?**
 > Open a [GitHub issue](https://github.com/ai-dynamo/dynamo/issues) to request an example you'd like to see, or [open a pull request](https://github.com/ai-dynamo/dynamo/pulls) if you'd like to contribute your own!
@@ -67,12 +76,3 @@ Before running any examples, ensure you have:
 - **CUDA-compatible GPU** - For LLM inference (except hello_world, which is non-GPU aware)
 - **Python 3.9++** - For client scripts and utilities
 - **Kubernetes cluster** - For any cloud deployment/K8s examples
-
-## Framework Support
-
-These examples show how Dynamo broadly works using major inference engines.
-
-If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
-- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
-- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
-- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
-Original file line number
+Diff line change
@@ Expand Up / @@ -480,4 +480,4 @@ COPY ATTRIBUTION* LICENSE /workspace/ @@
     ENV PYTHONPATH=/workspace/examples/sglang/utils:$PYTHONPATH
     ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
-    CMD []
+    CMD []