diff --git a/README.md b/README.md
index 2138f69498..b6a0f4081a 100644
--- a/README.md
+++ b/README.md
@@ -21,12 +21,30 @@ limitations under the License.
[](https://discord.gg/D92uqZRjCZ)
[](https://deepwiki.com/ai-dynamo/dynamo)
-| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** |
+| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Documentation](https://docs.nvidia.com/dynamo/latest/index.html)** | **[Support Matrix](docs/support_matrix.md)** | **[Examples](https://github.com/ai-dynamo/dynamo/tree/main/examples)** | **[Design Proposals](https://github.com/ai-dynamo/enhancements)** | **[Containers & Helm Charts](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamos)**
# NVIDIA Dynamo
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
+## Framework Support Matrix
+
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|----------------------|----------------------------|----------------------------------------|
+| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
+| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
+| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
+| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
+| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
+| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
+
+To learn more about each framework and their capabilities, check out each framework's README and deploy them with Dynamo!
+- **[vLLM](components/backends/vllm/README.md)**
+- **[SGLang](components/backends/sglang/README.md)**
+- **[TensorRT-LLM](components/backends/trtllm/README.md)**
+
+Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
+
## The Era of Multi-GPU, Multi-Node
@@ -47,24 +65,6 @@ Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLa
-## Framework Support Matrix
-
-| Feature | vLLM | SGLang | TensorRT-LLM |
-|---------|----------------------|----------------------------|----------------------------------------|
-| [**Disaggregated Serving**](/docs/architecture/disagg_serving.md) | ✅ | ✅ | ✅ |
-| [**Conditional Disaggregation**](/docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | 🚧 | 🚧 |
-| [**KV-Aware Routing**](/docs/architecture/kv_cache_routing.md) | ✅ | ✅ | ✅ |
-| [**SLA-Based Planner**](/docs/architecture/sla_planner.md) | ✅ | 🚧 | 🚧 |
-| [**Load Based Planner**](/docs/architecture/load_planner.md) | ✅ | 🚧 | 🚧 |
-| [**KVBM**](/docs/architecture/kvbm_architecture.md) | 🚧 | 🚧 | 🚧 |
-
-To learn more about each framework and their capabilities, check out each framework's README!
-- **[vLLM](components/backends/vllm/README.md)**
-- **[SGLang](components/backends/sglang/README.md)**
-- **[TensorRT-LLM](components/backends/trtllm/README.md)**
-
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach.
-
# Installation
The following examples require a few system level packages.
@@ -167,10 +167,15 @@ To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
## SGLang
+
```
-# Install libnuma
+# Install libnuma-dev
apt install -y libnuma-dev
+# Install flashinfer-python pre-release (required by sglang for optimized inference)
+uv pip install "flashinfer-python==0.2.9rc2" --prerelease=allow
+
+# Install ai-dynamo with sglang support
uv pip install ai-dynamo[sglang]
```
diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md
index e0cb8e976d..da55a2a0ea 100644
--- a/benchmarks/llm/README.md
+++ b/benchmarks/llm/README.md
@@ -1,15 +1,52 @@
-
-
-[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md)
+# LLM Benchmarking Tools
+
+This directory contains tools for benchmarking LLM inference performance in Dynamo deployments.
+
+## Overview
+
+The benchmarking suite includes:
+- **`perf.sh`** - Automated performance benchmarking script using GenAI-Perf
+- **`plot_pareto.py`** - Results analysis and Pareto efficiency visualization
+- **`nginx.conf`** - Load balancer configuration for multi-backend setups
+
+## Key Parameters
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--tensor-parallelism, --tp` | Tensor parallelism for aggregated mode | 0 |
+| `--data-parallelism, --dp` | Data parallelism for aggregated mode | 0 |
+| `--prefill-tp` | Prefill tensor parallelism for disaggregated mode | 0 |
+| `--prefill-dp` | Prefill data parallelism for disaggregated mode | 0 |
+| `--decode-tp` | Decode tensor parallelism for disaggregated mode | 0 |
+| `--decode-dp` | Decode data parallelism for disaggregated mode | 0 |
+| `--model` | HuggingFace model ID | `neuralmagic/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic` |
+| `--url` | Target inference endpoint | `http://localhost:8000` |
+| `--concurrency` | Comma-separated concurrency levels | `1,2,4,8,16,32,64,128,256` |
+| `--isl` | Input sequence length | 3000 |
+| `--osl` | Output sequence length | 150 |
+| `--mode` | Serving mode (`aggregated` or `disaggregated`) | `aggregated` |
+
+
+## Best Practices
+
+1. **Warm up services** before benchmarking to ensure stable performance
+2. **Match parallelism settings** to your actual deployment configuration
+3. **Run multiple benchmark iterations** for statistical confidence
+4. **Monitor resource utilization** during benchmarks to identify bottlenecks
+5. **Compare configurations** using Pareto plots to find optimal settings
+
+## Requirements
+
+- GenAI-Perf tool installed and available in PATH
+- Python 3.7+ with matplotlib, pandas, seaborn, numpy
+- nginx (for load balancing scenarios)
+- Access to target LLM inference service
+
+## Troubleshooting
+
+- Ensure the target URL is accessible before running benchmarks
+- Verify model names match those available in your deployment
+- Check that parallelism settings align with your hardware configuration
+- Monitor system resources to avoid resource contention during benchmarks
+
+
diff --git a/container/Dockerfile.sglang b/container/Dockerfile.sglang
index 329fe0b838..c9d3d0701c 100644
--- a/container/Dockerfile.sglang
+++ b/container/Dockerfile.sglang
@@ -480,4 +480,4 @@ COPY ATTRIBUTION* LICENSE /workspace/
ENV PYTHONPATH=/workspace/examples/sglang/utils:$PYTHONPATH
ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"]
-CMD []
+CMD []
\ No newline at end of file
diff --git a/examples/README.md b/examples/README.md
index afa83c7691..1edf17a175 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -17,7 +17,16 @@ limitations under the License.
# Dynamo Examples
-This directory contains practical examples demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
+## Framework Support
+
+The /examples directory shows how Dynamo broadly works using various inference engines.
+
+If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
+- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
+- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
+- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations
+
+This directory contains practical examples & tutorials demonstrating how to deploy and use Dynamo for distributed LLM inference. Each example includes setup instructions, configuration files, and explanations to help you understand different deployment patterns and use cases.
> **Want to see a specific example?**
> Open a [GitHub issue](https://github.com/ai-dynamo/dynamo/issues) to request an example you'd like to see, or [open a pull request](https://github.com/ai-dynamo/dynamo/pulls) if you'd like to contribute your own!
@@ -67,12 +76,3 @@ Before running any examples, ensure you have:
- **CUDA-compatible GPU** - For LLM inference (except hello_world, which is non-GPU aware)
- **Python 3.9++** - For client scripts and utilities
- **Kubernetes cluster** - For any cloud deployment/K8s examples
-
-## Framework Support
-
-These examples show how Dynamo broadly works using major inference engines.
-
-If you want to see advanced, framework-specific deployment patterns and best practices, check out the [Components Workflows](../components/backends/) directory:
-- **[vLLM](../components/backends/vllm/)** – vLLM-specific deployment and configuration
-- **[SGLang](../components/backends/sglang/)** – SGLang integration examples and workflows
-- **[TensorRT-LLM](../components/backends/trtllm/)** – TensorRT-LLM workflows and optimizations