From 051a8eeca87f0e10f67a4f3fcefb44099fb7b870 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 18 Aug 2025 22:21:43 -0700
Subject: [PATCH 01/17] First draft: conf.py greatly simplified (need to add
 back nv theme, etc.), all errors fixed, most warnings fixed. Remaining
 warnings are mostly myst_xrefs for documents that reference relative paths to
 other documents that are correct in their context, but myst_parser doesn't
 understand them

---
 components/backends/trtllm/README.md          |  26 +-
 docs/components/backends/llm/README.md        |   1 -
 docs/conf.py                                  | 290 +++---------------
 docs/guides/backend.md                        |   2 +-
 docs/guides/dynamo_deploy/README.md           |   2 +-
 docs/guides/dynamo_deploy/dynamo_cloud.md     |   6 +-
 .../dynamo_deploy/operator_deployment.md      |   1 -
 docs/hidden_toctree.rst                       |   6 +-
 docs/index.rst                                |  17 +-
 9 files changed, 80 insertions(+), 271 deletions(-)
 delete mode 120000 docs/components/backends/llm/README.md
 mode change 100755 => 100644 docs/conf.py
 delete mode 120000 docs/guides/dynamo_deploy/operator_deployment.md

diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
index 7de2c3e610d..df7176be1a2 100644
--- a/components/backends/trtllm/README.md
+++ b/components/backends/trtllm/README.md
@@ -49,22 +49,22 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ### Core Dynamo Features
 
-| Feature | TensorRT-LLM | Notes |
-|---------|--------------|-------|
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
-| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
-| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
+| Feature                                                                                                   | TensorRT-LLM | Notes             |
+| --------------------------------------------------------------------------------------------------------- | ------------ | ----------------- |
+| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md)                                 | ✅            |                   |
+| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧            | Not supported yet |
+| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md)                                    | ✅            |                   |
+| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md)                                        | 🚧            | Planned           |
+| [**Load Based Planner**](../../../docs/architecture/load_planner.md)                                      | 🚧            | Planned           |
+| [**KVBM**](../../../docs/architecture/kvbm_architecture.md)                                               | 🚧            | Planned           |
 
 ### Large Scale P/D and WideEP Features
 
-| Feature            | TensorRT-LLM | Notes                                                                 |
-|--------------------|--------------|-----------------------------------------------------------------------|
-| **WideEP**         | ✅           |                                                                 |
-| **DP Rank Routing**| ✅           |                                                                 |
-| **GB200 Support**  | ✅           |                                                                 |
+| Feature             | TensorRT-LLM | Notes |
+| ------------------- | ------------ | ----- |
+| **WideEP**          | ✅            |       |
+| **DP Rank Routing** | ✅            |       |
+| **GB200 Support**   | ✅            |       |
 
 ## Quick Start
 
diff --git a/docs/components/backends/llm/README.md b/docs/components/backends/llm/README.md
deleted file mode 120000
index 615da9417bd..00000000000
--- a/docs/components/backends/llm/README.md
+++ /dev/null
@@ -1 +0,0 @@
-../../../../components/backends/llm/README.md
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
old mode 100755
new mode 100644
index 3c10e46c2e0..bd29fccfd06
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,269 +1,71 @@
-#!/usr/bin/env python3
-
-# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
 # Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-import json
-import os
-import sys
-from datetime import date
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import httplib2
-from packaging.version import Version
-
-sys.path.insert(0, os.path.abspath("_extensions"))
-
-# -- conf.py setup -----------------------------------------------------------
-
-# conf.py needs to be run in the top level 'docs'
-# directory but the calling build script needs to
-# be called from the current working directory. We
-# change to the 'docs' dir here and then revert back
-# at the end of the file.
-# current_dir = os.getcwd()
-# os.chdir("docs")
 
 # -- Project information -----------------------------------------------------
-
-project = "Dynamo"
-copyright = "2025-{}, NVIDIA Corporation".format(date.today().year)
+project = "NVIDIA Dynamo"
+copyright = "2024-2025, NVIDIA CORPORATION & AFFILIATES"
 author = "NVIDIA"
 
-# Get the version of dynamo this is building.
-version_long = "0.1.0"
-
-version_short = version_long
-version_short_split = version_short.split(".")
-one_before = f"{version_short_split[0]}.{int(version_short_split[1]) - 1}.{version_short_split[2]}"
-
-
 # -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
 extensions = [
-    "ablog",
-    "myst_parser",
-    "sphinx_copybutton",
-    "sphinx_design",
-    "sphinx_prompt",
-    # "sphinxcontrib.bibtex",
-    "sphinx_tabs.tabs",
-    "sphinx_sitemap",
-    "sphinx.ext.autodoc",
-    "sphinx.ext.autosummary",
-    "sphinx.ext.mathjax",
-    "sphinx.ext.napoleon",
-    "sphinx.ext.ifconfig",
-    "sphinx.ext.extlinks",
-    "sphinxcontrib.mermaid",
-    "github_alerts",  # Custom extension for GitHub alert conversion
+    "myst_parser",  # Markdown support
+    "sphinx_design",  # Grid and card directives
+    "sphinx.ext.autodoc",  # Auto-generate docs from docstrings
+    "sphinx.ext.viewcode",  # Add source code links
+    "sphinx.ext.napoleon",  # Google/NumPy style docstrings
+    "sphinxcontrib.mermaid",  # Uncomment after: pip install sphinxcontrib-mermaid
 ]
 
-suppress_warnings = ["myst.domains", "ref.ref", "myst.header"]
+# Handle Mermaid diagrams as code blocks (not directives) to avoid warnings
+myst_fence_as_directive = ["mermaid"]  # Uncomment if sphinxcontrib-mermaid is installed
 
+# File extensions (myst_parser automatically handles .md files)
 source_suffix = [".rst", ".md"]
 
-autodoc_default_options = {
-    "members": True,
-    "undoc-members": True,
-    "private-members": True,
-}
-
-autosummary_generate = True
-autosummary_mock_imports = [
-    "tritonclient.grpc.model_config_pb2",
-    "tritonclient.grpc.service_pb2",
-    "tritonclient.grpc.service_pb2_grpc",
-]
-
-napoleon_include_special_with_doc = True
-
-numfig = True
-
-# final location of docs for seo/sitemap
-html_baseurl = "https://docs.nvidia.com/dynamo/latest/"
-
+# MyST parser configuration
 myst_enable_extensions = [
-    "dollarmath",
-    "amsmath",
-    "deflist",
-    # "html_admonition",
-    "html_image",
-    "colon_fence",
-    # "smartquotes",
-    "replacements",
-    # "linkify",
-    "substitution",
+    "colon_fence",  # ::: code blocks
+    "deflist",  # Definition lists
+    "html_image",  # HTML images
+    "tasklist",  # Task lists
 ]
-myst_heading_anchors = 5
-myst_fence_as_directive = ["mermaid"]
 
-# Add any paths that contain templates here, relative to this directory.
-# templates_path = ["_templates"] # disable it for nvidia-sphinx-theme to show footer
+# Templates path
+templates_path = ["_templates"]
 
+# List of patterns to ignore when looking for source files
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"]
 
 # -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = "nvidia_sphinx_theme"
-
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
+html_theme = "alabaster"
 html_static_path = ["_static"]
-# html_js_files = ["custom.js"]
-# html_css_files = ["custom.css"] # Not needed with new theme
 
+# Theme options
 html_theme_options = {
-    "collapse_navigation": False,
-    "github_url": "https://github.com/ai-dynamo/dynamo",
-    # "switcher": {
-    # use for local testing
-    # "json_url": "http://localhost:8000/_static/switcher.json",
-    # "json_url": "https://docs.nvidia.com/dynamo/latest/_static/switcher.json",
-    # "version_match": one_before if "dev" in version_long else version_short,
-    # },
-    "navbar_start": ["navbar-logo", "version-switcher"],
-    "primary_sidebar_end": [],
-}
-
-# Theme options are theme-specific and customize the look and feel of a theme
-# further.  For a list of options available for each theme, see the
-# documentation.
-#
-html_theme_options.update(
-    {
-        "collapse_navigation": False,
-    }
-)
-
-deploy_ngc_org = "nvidia"
-deploy_ngc_team = "dynamo"
-myst_substitutions = {
-    "VersionNum": version_short,
-    "deploy_ngc_org_team": f"{deploy_ngc_org}/{deploy_ngc_team}"
-    if deploy_ngc_team
-    else deploy_ngc_org,
+    "description": "High-performance, low-latency inference framework",
+    "github_user": "ai-dynamo",
+    "github_repo": "dynamo",
+    "github_button": True,
+    "github_banner": True,
+    "show_related": False,
+    "note_bg": "#FFF59C",
 }
 
+# Document settings
+master_doc = "index"
+html_title = f"{project} Documentation"
+html_short_title = project
+
+# Suppress warnings for external links and missing references
+suppress_warnings = [
+    #'ref.doc',              # External document references
+    #'myst.xref_missing',    # Missing cross-references
+    #'toc.not_readable',     # Unreadable toctree entries
+    #'myst.directive_unknown', # Unknown directives (like mermaid without extension)
+]
 
-def ultimateReplace(app, docname, source):
-    result = source[0]
-    for key in app.config.ultimate_replacements:
-        result = result.replace(key, app.config.ultimate_replacements[key])
-    source[0] = result
-
-
-# this is a necessary hack to allow us to fill in variables that exist in code blocks
-ultimate_replacements = {
-    "{VersionNum}": version_short,
-    "{SamplesVersionNum}": version_short,
-    "{NgcOrgTeam}": f"{deploy_ngc_org}/{deploy_ngc_team}"
-    if deploy_ngc_team
-    else deploy_ngc_org,
-}
-
-# bibtex_bibfiles = ["references.bib"]
-# To test that style looks good with common bibtex config
-# bibtex_reference_style = "author_year"
-# bibtex_default_style = "plain"
-
-### We currently use Myst: https://myst-nb.readthedocs.io/en/latest/use/execute.html
-nb_execution_mode = "off"  # Global execution disable
-# execution_excludepatterns = ['tutorials/tts-python-basics.ipynb']  # Individual notebook disable
-
-###############################
-# SETUP SWITCHER
-###############################
-switcher_path = os.path.join(html_static_path[0], "switcher.json")
-versions = []
-# Triton 2 releases
-correction = -1 if "dev" in version_long else 0
-upper_bound = version_short.split(".")[1]
-for i in range(2, int(version_short.split(".")[1]) + correction):
-    versions.append((f"2.{i}.0", f"dynamo{i}0"))
-
-# Patch releases
-# Add here.
-
-versions = sorted(versions, key=lambda v: Version(v[0]), reverse=True)
-
-# Build switcher data
-json_data = []
-for v in versions:
-    json_data.append(
-        {
-            "name": v[0],
-            "version": v[0],
-            "url": f"https://docs.nvidia.com/dynamo/archives/{v[1]}/user-guide/docs",
-        }
-    )
-if "dev" in version_long:
-    json_data.insert(
-        0,
-        {
-            "name": f"{one_before} (current_release)",
-            "version": f"{one_before}",
-            "url": "https://docs.nvidia.com/dynamo/latest/index.html",
-        },
-    )
-else:
-    json_data.insert(
-        0,
-        {
-            "name": f"{version_short} (current release)",
-            "version": f"{version_short}",
-            "url": "https://docs.nvidia.com/dynamo/latest/index.html",
-        },
-    )
-
-# Trim to last N releases.
-json_data = json_data[0:12]
-
-json_data.append(
-    {
-        "name": "older releases",
-        "version": "archives",
-        "url": "https://docs.nvidia.com/dynamo/archives/",
-    }
-)
-
-# validate the links
-for i, d in enumerate(json_data):
-    h = httplib2.Http()
-    resp = h.request(d["url"], "HEAD")
-    if int(resp[0]["status"]) >= 400:
-        print(d["url"], "NOK", resp[0]["status"])
-        # exit(1)
+# Mermaid diagram support
+myst_enable_extensions.append("html_admonition")
 
-# Write switcher data to file
-with open(switcher_path, "w") as f:
-    json.dump(json_data, f, ensure_ascii=False, indent=4)
+# Additional MyST configuration
+myst_heading_anchors = 3  # Generate anchors for headers
+myst_substitutions = {}  # Custom substitutions
diff --git a/docs/guides/backend.md b/docs/guides/backend.md
index 68b0e984328..e0ed7643372 100644
--- a/docs/guides/backend.md
+++ b/docs/guides/backend.md
@@ -76,7 +76,7 @@ The `model_type` can be:
 
 See `components/backends` for full code examples.
 
-### Component names
+## Component names
 
 A worker needs three names to register itself: namespace.component.endpoint
 
diff --git a/docs/guides/dynamo_deploy/README.md b/docs/guides/dynamo_deploy/README.md
index eb4cd7a7cae..a21ca18b998 100644
--- a/docs/guides/dynamo_deploy/README.md
+++ b/docs/guides/dynamo_deploy/README.md
@@ -35,7 +35,7 @@ We provide a Custom Resource YAML file for many examples under the components/ba
 
 [View TRT-LLM K8s](../../../components/backends/trtllm/deploy/README.md)
 
-### Deploying a particular example
+## Deploying a particular example
 
 ```bash
 # Set your dynamo root directory
diff --git a/docs/guides/dynamo_deploy/dynamo_cloud.md b/docs/guides/dynamo_deploy/dynamo_cloud.md
index 3c549b514d3..4ba8d459901 100644
--- a/docs/guides/dynamo_deploy/dynamo_cloud.md
+++ b/docs/guides/dynamo_deploy/dynamo_cloud.md
@@ -42,7 +42,7 @@ Before getting started with the Dynamo cloud platform, ensure you have:
 > [!TIP]
 > Don't have a Kubernetes cluster? Check out our [Minikube setup guide](../../../docs/guides/dynamo_deploy/minikube.md) to set up a local environment! 🏠
 
-#### 🏗️ Build Dynamo inference runtime.
+### 🏗️ Build Dynamo inference runtime.
 
 [One-time Action]
 Before you could use Dynamo make sure you have setup the Inference Runtime Image.
@@ -70,7 +70,7 @@ docker push <your-registry>/dynamo:${IMAGE_TAG}
 
 Before deploying Dynamo Cloud, ensure your Kubernetes cluster meets the following requirements:
 
-#### 1. 🛡️ Istio Installation
+### 1. 🛡️ Istio Installation
 Dynamo Cloud requires Istio for service mesh capabilities. Verify Istio is installed and running:
 
 ```bash
@@ -81,7 +81,7 @@ kubectl get pods -n istio-system
 # istiod-* pods should be in Running state
 ```
 
-#### 2. 💾 PVC Support with Default Storage Class
+### 2. 💾 PVC Support with Default Storage Class
 Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration:
 
 ```bash
diff --git a/docs/guides/dynamo_deploy/operator_deployment.md b/docs/guides/dynamo_deploy/operator_deployment.md
deleted file mode 120000
index 80ca4341ee4..00000000000
--- a/docs/guides/dynamo_deploy/operator_deployment.md
+++ /dev/null
@@ -1 +0,0 @@
-../../../guides/dynamo_deploy/operator_deployment.md
\ No newline at end of file
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 19a2ee7705b..c9d0d0d06c8 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -37,13 +37,13 @@
    components/backends/sglang/deploy/README.md
    components/backends/sglang/docs/dsr1-wideep-h100.md
    components/backends/sglang/docs/multinode-examples.md
-   components/backends/sglang/docs/sgl-http-server.md
+
    components/backends/sglang/slurm_jobs/README.md
    components/router/README.md
    examples/README.md
    guides/dynamo_deploy/create_deployment.md
    guides/dynamo_deploy/sla_planner_deployment.md
-   guides/dynamo_deploy/helm_install.md
+
    guides/dynamo_deploy/gke_setup.md
    guides/dynamo_deploy/README.md
    guides/dynamo_run.md
@@ -51,7 +51,7 @@
    components/backends/trtllm/README.md
    components/backends/trtllm/deploy/README.md
    components/backends/trtllm/llama4_plus_eagle.md
-   components/backends/trtllm/multinode-examples.md
+   components/backends/trtllm/multinode/multinode-examples.md
    components/backends/trtllm/kv-cache-transfer.md
    components/backends/vllm/deploy/README.md
    components/backends/vllm/multi-node.md
diff --git a/docs/index.rst b/docs/index.rst
index 822e96b7bbe..4cb4ef06d09 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,7 +26,7 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
 
    - `Dynamo README <https://github.com/ai-dynamo/dynamo/blob/main/README.md>`_
    - `Architecture and features doc <https://github.com/ai-dynamo/dynamo/blob/main/docs/architecture/>`_
-   - `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/guides>`_
+   - `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/docs/guides>`_
    - `Dynamo examples repo <https://github.com/ai-dynamo/dynamo/tree/main/examples>`_
 
 
@@ -135,6 +135,7 @@ The examples below assume you build the latest image yourself from source. If us
    KV Cache Routing <architecture/kv_cache_routing.md>
    Planner <architecture/planner_intro.rst>
    Dynamo Architecture Flow <architecture/dynamo_flow.md>
+   Request Migration <architecture/request_migration.md>
 
 .. toctree::
    :hidden:
@@ -146,11 +147,12 @@ The examples below assume you build the latest image yourself from source. If us
 
 .. toctree::
    :hidden:
-   :caption: Deployment Guides
+   :caption: Deployment guides
 
    Dynamo Deploy Quickstart <guides/dynamo_deploy/quickstart.md>
    Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
-   Manual Helm Deployment <guides/dynamo_deploy/helm_install.md>
+
+   Multinode Deployment <guides/dynamo_deploy/multinode-deployment.md>
    Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
    Model Caching with Fluid <guides/dynamo_deploy/model_caching_with_fluid.md>
 
@@ -167,8 +169,15 @@ The examples below assume you build the latest image yourself from source. If us
 
 .. toctree::
    :hidden:
-   :caption: Reference
+   :caption: Observability
+
+   Dynamo Metrics <guides/metrics.md>
+   K8s Metrics <guides/deploy/k8s_metrics.md>
+
 
+.. toctree::
+   :hidden:
+   :caption: Reference
 
    Glossary <dynamo_glossary.md>
    NIXL Connect API <API/nixl_connect/README.md>

From 9d46868425ab664dd80117f83c8277c0805a6602 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 18 Aug 2025 22:58:29 -0700
Subject: [PATCH 02/17] Bring back some bits from old conf.py

---
 docs/conf.py | 61 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 24 deletions(-)

diff --git a/docs/conf.py b/docs/conf.py
index bd29fccfd06..b39a27429a9 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,4 +1,6 @@
 # Configuration file for the Sphinx documentation builder.
+import os
+import sys
 
 # -- Project information -----------------------------------------------------
 project = "NVIDIA Dynamo"
@@ -6,15 +8,30 @@
 author = "NVIDIA"
 
 # -- General configuration ---------------------------------------------------
+
+# Standard extensions
 extensions = [
-    "myst_parser",  # Markdown support
-    "sphinx_design",  # Grid and card directives
-    "sphinx.ext.autodoc",  # Auto-generate docs from docstrings
-    "sphinx.ext.viewcode",  # Add source code links
-    "sphinx.ext.napoleon",  # Google/NumPy style docstrings
-    "sphinxcontrib.mermaid",  # Uncomment after: pip install sphinxcontrib-mermaid
+    "ablog",
+    "myst_parser",
+    "sphinx_copybutton",
+    "sphinx_design",
+    "sphinx_prompt",
+    # "sphinxcontrib.bibtex",
+    "sphinx_tabs.tabs",
+    "sphinx_sitemap",
+    "sphinx.ext.autodoc",
+    "sphinx.ext.autosummary",
+    "sphinx.ext.mathjax",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.ifconfig",
+    "sphinx.ext.extlinks",
+    "sphinxcontrib.mermaid",
 ]
 
+# Custom extensions
+sys.path.insert(0, os.path.abspath("_extensions"))
+extensions.append("github_alerts")
+
 # Handle Mermaid diagrams as code blocks (not directives) to avoid warnings
 myst_fence_as_directive = ["mermaid"]  # Uncomment if sphinxcontrib-mermaid is installed
 
@@ -36,36 +53,32 @@
 exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"]
 
 # -- Options for HTML output -------------------------------------------------
-html_theme = "alabaster"
+html_theme = "nvidia_sphinx_theme"
 html_static_path = ["_static"]
-
-# Theme options
 html_theme_options = {
-    "description": "High-performance, low-latency inference framework",
-    "github_user": "ai-dynamo",
-    "github_repo": "dynamo",
-    "github_button": True,
-    "github_banner": True,
-    "show_related": False,
-    "note_bg": "#FFF59C",
+    "collapse_navigation": False,
+    "github_url": "https://github.com/ai-dynamo/dynamo",
+    "navbar_start": ["navbar-logo"],
+    "primary_sidebar_end": [],
 }
 
 # Document settings
 master_doc = "index"
 html_title = f"{project} Documentation"
 html_short_title = project
+html_baseurl = "https://docs.nvidia.com/dynamo/latest/"
 
 # Suppress warnings for external links and missing references
 suppress_warnings = [
-    #'ref.doc',              # External document references
-    #'myst.xref_missing',    # Missing cross-references
+    "myst.xref_missing",  # Missing cross-references
     #'toc.not_readable',     # Unreadable toctree entries
-    #'myst.directive_unknown', # Unknown directives (like mermaid without extension)
 ]
 
-# Mermaid diagram support
-myst_enable_extensions.append("html_admonition")
+# TODO: See if this is needed for rendering mermaid diagrams or not
 
-# Additional MyST configuration
-myst_heading_anchors = 3  # Generate anchors for headers
-myst_substitutions = {}  # Custom substitutions
+## Mermaid diagram support
+# myst_enable_extensions.append("html_admonition")
+#
+## Additional MyST configuration
+# myst_heading_anchors = 3  # Generate anchors for headers
+# myst_substitutions = {}  # Custom substitutions

From 80b408d063b051e3eaee426ccbd25cc696118284 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 18 Aug 2025 23:08:15 -0700
Subject: [PATCH 03/17] Try relative paths for component READMEs, undo
 whitespace changes to trtllm README, move unreferenced files to
 hidden_toctree.rst for now

---
 components/backends/trtllm/README.md | 26 +++++++++++++-------------
 docs/conf.py                         |  3 +++
 docs/hidden_toctree.rst              |  4 ++++
 docs/index.rst                       | 23 +++++++----------------
 4 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md
index df7176be1a2..7de2c3e610d 100644
--- a/components/backends/trtllm/README.md
+++ b/components/backends/trtllm/README.md
@@ -49,22 +49,22 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 
 ### Core Dynamo Features
 
-| Feature                                                                                                   | TensorRT-LLM | Notes             |
-| --------------------------------------------------------------------------------------------------------- | ------------ | ----------------- |
-| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md)                                 | ✅            |                   |
-| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧            | Not supported yet |
-| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md)                                    | ✅            |                   |
-| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md)                                        | 🚧            | Planned           |
-| [**Load Based Planner**](../../../docs/architecture/load_planner.md)                                      | 🚧            | Planned           |
-| [**KVBM**](../../../docs/architecture/kvbm_architecture.md)                                               | 🚧            | Planned           |
+| Feature | TensorRT-LLM | Notes |
+|---------|--------------|-------|
+| [**Disaggregated Serving**](../../../docs/architecture/disagg_serving.md) | ✅ |  |
+| [**Conditional Disaggregation**](../../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**KV-Aware Routing**](../../../docs/architecture/kv_cache_routing.md) | ✅ |  |
+| [**SLA-Based Planner**](../../../docs/architecture/sla_planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../../docs/architecture/load_planner.md) | 🚧 | Planned |
+| [**KVBM**](../../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
 
 ### Large Scale P/D and WideEP Features
 
-| Feature             | TensorRT-LLM | Notes |
-| ------------------- | ------------ | ----- |
-| **WideEP**          | ✅            |       |
-| **DP Rank Routing** | ✅            |       |
-| **GB200 Support**   | ✅            |       |
+| Feature            | TensorRT-LLM | Notes                                                                 |
+|--------------------|--------------|-----------------------------------------------------------------------|
+| **WideEP**         | ✅           |                                                                 |
+| **DP Rank Routing**| ✅           |                                                                 |
+| **GB200 Support**  | ✅           |                                                                 |
 
 ## Quick Start
 
diff --git a/docs/conf.py b/docs/conf.py
index b39a27429a9..48febec9426 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,3 +1,6 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 # Configuration file for the Sphinx documentation builder.
 import os
 import sys
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index c9d0d0d06c8..32e45a656b1 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -56,3 +56,7 @@
    components/backends/vllm/deploy/README.md
    components/backends/vllm/multi-node.md
 
+   guides/metrics.md
+   guides/deploy/k8s_metrics.md
+   architecture/request_migration.md
+
diff --git a/docs/index.rst b/docs/index.rst
index 4cb4ef06d09..fd7fc79f30f 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,7 +26,7 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
 
    - `Dynamo README <https://github.com/ai-dynamo/dynamo/blob/main/README.md>`_
    - `Architecture and features doc <https://github.com/ai-dynamo/dynamo/blob/main/docs/architecture/>`_
-   - `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/docs/guides>`_
+   - `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/guides>`_
    - `Dynamo examples repo <https://github.com/ai-dynamo/dynamo/tree/main/examples>`_
 
 
@@ -135,7 +135,6 @@ The examples below assume you build the latest image yourself from source. If us
    KV Cache Routing <architecture/kv_cache_routing.md>
    Planner <architecture/planner_intro.rst>
    Dynamo Architecture Flow <architecture/dynamo_flow.md>
-   Request Migration <architecture/request_migration.md>
 
 .. toctree::
    :hidden:
@@ -147,7 +146,7 @@ The examples below assume you build the latest image yourself from source. If us
 
 .. toctree::
    :hidden:
-   :caption: Deployment guides
+   :caption: Deployment Guides
 
    Dynamo Deploy Quickstart <guides/dynamo_deploy/quickstart.md>
    Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
@@ -160,20 +159,12 @@ The examples below assume you build the latest image yourself from source. If us
    :hidden:
    :caption: Examples
 
-   Hello World <examples/runtime/hello_world/README.md>
-   LLM Deployment Examples using VLLM <components/backends/vllm/README.md>
-   LLM Deployment Examples using SGLang <components/backends/sglang/README.md>
-   Multinode Examples using SGLang <components/backends/sglang/docs/multinode-examples.md>
+   Hello World <../examples/runtime/hello_world/README.md>
+   LLM Deployment Examples using VLLM <../components/backends/vllm/README.md>
+   LLM Deployment Examples using SGLang <../components/backends/sglang/README.md>
+   LLM Deployment Examples using TensorRT-LLM <../components/backends/trtllm/README.md>
+   Multinode Examples using SGLang <../components/backends/sglang/docs/multinode-examples.md>
    Planner Benchmark Example <guides/planner_benchmark/README.md>
-   LLM Deployment Examples using TensorRT-LLM <components/backends/trtllm/README.md>
-
-.. toctree::
-   :hidden:
-   :caption: Observability
-
-   Dynamo Metrics <guides/metrics.md>
-   K8s Metrics <guides/deploy/k8s_metrics.md>
-
 
 .. toctree::
    :hidden:

From 2a147896fa111d8b0f0faee7552759281206e407 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 18 Aug 2025 23:18:10 -0700
Subject: [PATCH 04/17] Add symlink for sglang README, comment out suppressed
 myst warning

---
 docs/components/backends/sglang/README.md | 1 +
 docs/conf.py                              | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
 create mode 120000 docs/components/backends/sglang/README.md

diff --git a/docs/components/backends/sglang/README.md b/docs/components/backends/sglang/README.md
new file mode 120000
index 00000000000..c481015d877
--- /dev/null
+++ b/docs/components/backends/sglang/README.md
@@ -0,0 +1 @@
+../../../../components/backends/sglang/README.md
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
index 48febec9426..8e13feedbc8 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -73,7 +73,7 @@
 
 # Suppress warnings for external links and missing references
 suppress_warnings = [
-    "myst.xref_missing",  # Missing cross-references
+    # "myst.xref_missing",  # Missing cross-references
     #'toc.not_readable',     # Unreadable toctree entries
 ]
 

From d103271b0e3effd2b8e59a9eeca4a8af75ad14d3 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Tue, 19 Aug 2025 00:44:07 -0700
Subject: [PATCH 05/17] Add minimal symlinks, and suppress myst.xref_missing
 warnings

---
 docs/components/router/README.md | 146 -------------------------------
 docs/conf.py                     |  14 +--
 docs/examples/README.md          |  93 +-------------------
 docs/hidden_toctree.rst          |  13 ---
 4 files changed, 5 insertions(+), 261 deletions(-)
 delete mode 100644 docs/components/router/README.md
 mode change 100644 => 120000 docs/examples/README.md

diff --git a/docs/components/router/README.md b/docs/components/router/README.md
deleted file mode 100644
index b2f5d7b61bc..00000000000
--- a/docs/components/router/README.md
+++ /dev/null
@@ -1,146 +0,0 @@
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
--->
-
-# KV Router
-
-## Overview
-
-The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
-
-## Quick Start
-
-To launch the Dynamo frontend with the KV Router:
-
-```bash
-python -m dynamo.frontend --router-mode kv --http-port 8080
-```
-
-This command:
-- Launches the Dynamo frontend service with KV routing enabled
-- Exposes the service on port 8080 (configurable)
-- Automatically handles all backend workers registered to the Dynamo endpoint
-
-Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
-- Tracks the state of all registered workers
-- Makes routing decisions based on KV cache overlap
-- Balances load across available workers
-
-### Important Arguments
-
-The KV Router supports several key configuration options:
-
-- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
-
-- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
-  - `0.0`: Deterministic selection of the best worker
-  - `> 0.0`: Probabilistic selection using softmax sampling
-  - Higher values increase randomness, helping prevent worker saturation
-
-- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
-  - `--kv-events`: Uses real-time events from workers for accurate cache tracking
-  - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
-
-For a complete list of available options:
-```bash
-python -m dynamo.frontend --help
-```
-
-## KV Router Architecture
-
-The KV Router tracks two key metrics for each worker:
-
-1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
-
-2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
-   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
-   - Potential prefill blocks = New prefill tokens / Block size
-
-### Block Tracking Mechanisms
-
-The router maintains block information through two complementary systems:
-
-- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
-  - Incremented when adding a new request
-  - Updated during token generation
-  - Decremented upon request completion
-
-- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
-
-## Cost Function
-
-The KV Router's routing decision is based on a simple cost function:
-
-```
-logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
-```
-
-Where:
-- Lower logit values are better (less computational cost)
-- The router uses softmax sampling with optional temperature to select workers
-
-### Key Parameter: kv-overlap-score-weight
-
-The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
-
-- **Higher values (> 1.0)**: Emphasize reducing prefill cost
-  - Prioritizes routing to workers with better cache hits
-  - Optimizes for Time To First Token (TTFT)
-  - Best for workloads where initial response latency is critical
-
-- **Lower values (< 1.0)**: Emphasize decode performance
-  - Distributes active decoding blocks more evenly
-  - Optimizes for Inter-Token Latency (ITL)
-  - Best for workloads with long generation sequences
-
-## KV Events vs. Approximation Mode
-
-The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
-
-- **With KV Events (default)**:
-  - Calculates overlap accurately using actual cached blocks
-  - Provides higher accuracy with event processing overhead
-  - Recommended for production deployments
-
-- **Without KV Events (--no-kv-events)**:
-  - Uses ApproxKvIndexer to estimate cached blocks from routing decisions
-  - Assumes blocks from recent requests remain cached
-  - Reduces overhead at the cost of routing accuracy
-  - Suitable for testing or when event processing becomes a bottleneck
-
-## Tuning Guidelines
-
-### 1. Understand Your Workload Characteristics
-
-- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
-- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
-
-### 2. Monitor Key Metrics
-
-The router logs the cost calculation for each worker:
-```
-Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
-```
-
-This shows:
-- Total cost (125.3)
-- Overlap weight × prefill blocks (1.0 × 100.5)
-- Active blocks (25.0)
-- Cached blocks that contribute to overlap (15)
-
-### 3. Temperature-Based Routing
-
-The `router_temperature` parameter controls routing randomness:
-- **0.0 (default)**: Deterministic selection of the best worker
-- **> 0.0**: Probabilistic selection, higher values increase randomness
-- Useful for preventing worker saturation and improving load distribution
-
-### 4. Iterative Optimization
-
-1. Begin with default settings
-2. Monitor TTFT and ITL metrics
-3. Adjust `kv-overlap-score-weight` to meet your performance goals:
-   - To reduce TTFT: Increase the weight
-   - To reduce ITL: Decrease the weight
-4. If you observe severe load imbalance, increase the temperature setting
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
index 8e13feedbc8..546b8c3ad06 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -73,15 +73,9 @@
 
 # Suppress warnings for external links and missing references
 suppress_warnings = [
-    # "myst.xref_missing",  # Missing cross-references
-    #'toc.not_readable',     # Unreadable toctree entries
+    "myst.xref_missing",  # Missing cross-references of relative links outside docs folder
 ]
 
-# TODO: See if this is needed for rendering mermaid diagrams or not
-
-## Mermaid diagram support
-# myst_enable_extensions.append("html_admonition")
-#
-## Additional MyST configuration
-# myst_heading_anchors = 3  # Generate anchors for headers
-# myst_substitutions = {}  # Custom substitutions
+# Additional MyST configuration
+myst_heading_anchors = 7  # Generate anchors for headers
+myst_substitutions = {}  # Custom substitutions
diff --git a/docs/examples/README.md b/docs/examples/README.md
deleted file mode 100644
index 560360cd626..00000000000
--- a/docs/examples/README.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# Examples of using Dynamo Platform
-
-## Serving examples locally
-
-Follow individual examples under components/backends/ to serve models locally.
-
-For example follow the [vLLM Backend Example](../../components/backends/vllm/README.md)
-
-For a basic GPU - unaware example see the [Hello World Example](../../examples/runtime/hello_world/README.md)
-
-## Deploying Examples to Kubernetes
-
-First you need to install the Dynamo Cloud Platform. Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you.
-Before you can deploy your graphs, you need to deploy the Dynamo Runtime and Dynamo Cloud images. This is a one-time action, only necessary the first time you deploy a DynamoGraph.
-
-### Instructions for Dynamo User
-If you are a **👤 Dynamo User** first follow the [Quickstart Guide](../guides/dynamo_deploy/quickstart.md) first.
-
-### Instructions for Dynamo Contributor
-If you are a **🧑‍💻 Dynamo Contributor** you may have to rebuild the dynamo platform images as the code evolves.
-For more details read the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
-Read more on deploying Dynamo Cloud read [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md).
-
-
-### Deploying a particular example
-
-```bash
-# Set your dynamo root directory
-cd <root-dynamo-folder>
-export PROJECT_ROOT=$(pwd)
-export NAMESPACE=<your-namespace> # the namespace you used to deploy Dynamo cloud to.
-```
-
-Deploying an example consists of the simple `kubectl apply -f ... -n ${NAMESPACE}` command. For example:
-
-```bash
-kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
-```
-
-You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your deployment.
-You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.
-
-We provide a Custom Resource yaml file for many examples under the `components/backends/<backend-name>/deploy/`folder.
-Consult the examples below for the CRs for your specific inference backend.
-
-[View SGLang k8s](../../components/backends/sglang/deploy/README.md)
-
-[View vLLM K8s](../../components/backends/vllm/deploy/README.md)
-
-[View TRTLLM k8s](../../components/backends/trtllm/deploy/README.md)
-
-**Note 1** Example Image
-
-The examples use a prebuilt image from the `nvcr.io` registry.
-You can build your own image and update the image location in your CR file prior to applying.
-You could build your own image using
-
-```bash
-./container/build.sh --framework <your-inference-framework>
-```
-
-For example for the `sglang` run
-```bash
-./container/build.sh --framework sglang
-```
-
-Then you would need to overwrite the image in the examples.
-
-```bash
-extraPodSpec:
-        mainContainer:
-          image: <image-in-your-$DYNAMO_IMAGE>
-```
-
-**Note 2**
-Setup port forward if needed when deploying to Kubernetes.
-
-List the services in your namespace:
-
-```bash
-kubectl get svc -n ${NAMESPACE}
-```
-Look for one that ends in `-frontend` and use it for port forward.
-
-```bash
-SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1)
-kubectl port-forward svc/${SERVICE_NAME}-frontend 8080:8080 -n ${NAMESPACE}
-```
-
-Consult the [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
-
-
diff --git a/docs/examples/README.md b/docs/examples/README.md
new file mode 120000
index 00000000000..6fa53604d90
--- /dev/null
+++ b/docs/examples/README.md
@@ -0,0 +1 @@
+../../examples/README.md
\ No newline at end of file
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 32e45a656b1..ad4f87b7e0b 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -34,12 +34,7 @@
    API/nixl_connect/writable_operation.md
    API/nixl_connect/read_operation.md
    API/nixl_connect/write_operation.md
-   components/backends/sglang/deploy/README.md
-   components/backends/sglang/docs/dsr1-wideep-h100.md
-   components/backends/sglang/docs/multinode-examples.md
 
-   components/backends/sglang/slurm_jobs/README.md
-   components/router/README.md
    examples/README.md
    guides/dynamo_deploy/create_deployment.md
    guides/dynamo_deploy/sla_planner_deployment.md
@@ -47,14 +42,6 @@
    guides/dynamo_deploy/gke_setup.md
    guides/dynamo_deploy/README.md
    guides/dynamo_run.md
-   components/backends/vllm/README.md
-   components/backends/trtllm/README.md
-   components/backends/trtllm/deploy/README.md
-   components/backends/trtllm/llama4_plus_eagle.md
-   components/backends/trtllm/multinode/multinode-examples.md
-   components/backends/trtllm/kv-cache-transfer.md
-   components/backends/vllm/deploy/README.md
-   components/backends/vllm/multi-node.md
 
    guides/metrics.md
    guides/deploy/k8s_metrics.md

From efbbfc86a6da3cb48237d64385c028d343ccfbc8 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Tue, 19 Aug 2025 01:18:47 -0700
Subject: [PATCH 06/17] Restore docs/components/router/README.md

---
 docs/components/router/README.md | 146 +++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100644 docs/components/router/README.md

diff --git a/docs/components/router/README.md b/docs/components/router/README.md
new file mode 100644
index 00000000000..b2f5d7b61bc
--- /dev/null
+++ b/docs/components/router/README.md
@@ -0,0 +1,146 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+-->
+
+# KV Router
+
+## Overview
+
+The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
+
+## Quick Start
+
+To launch the Dynamo frontend with the KV Router:
+
+```bash
+python -m dynamo.frontend --router-mode kv --http-port 8080
+```
+
+This command:
+- Launches the Dynamo frontend service with KV routing enabled
+- Exposes the service on port 8080 (configurable)
+- Automatically handles all backend workers registered to the Dynamo endpoint
+
+Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
+- Tracks the state of all registered workers
+- Makes routing decisions based on KV cache overlap
+- Balances load across available workers
+
+### Important Arguments
+
+The KV Router supports several key configuration options:
+
+- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
+
+- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
+  - `0.0`: Deterministic selection of the best worker
+  - `> 0.0`: Probabilistic selection using softmax sampling
+  - Higher values increase randomness, helping prevent worker saturation
+
+- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
+  - `--kv-events`: Uses real-time events from workers for accurate cache tracking
+  - `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
+
+For a complete list of available options:
+```bash
+python -m dynamo.frontend --help
+```
+
+## KV Router Architecture
+
+The KV Router tracks two key metrics for each worker:
+
+1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
+
+2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
+   - New prefill tokens = Total input tokens - (Overlap blocks × Block size)
+   - Potential prefill blocks = New prefill tokens / Block size
+
+### Block Tracking Mechanisms
+
+The router maintains block information through two complementary systems:
+
+- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
+  - Incremented when adding a new request
+  - Updated during token generation
+  - Decremented upon request completion
+
+- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
+
+## Cost Function
+
+The KV Router's routing decision is based on a simple cost function:
+
+```
+logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
+```
+
+Where:
+- Lower logit values are better (less computational cost)
+- The router uses softmax sampling with optional temperature to select workers
+
+### Key Parameter: kv-overlap-score-weight
+
+The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
+
+- **Higher values (> 1.0)**: Emphasize reducing prefill cost
+  - Prioritizes routing to workers with better cache hits
+  - Optimizes for Time To First Token (TTFT)
+  - Best for workloads where initial response latency is critical
+
+- **Lower values (< 1.0)**: Emphasize decode performance
+  - Distributes active decoding blocks more evenly
+  - Optimizes for Inter-Token Latency (ITL)
+  - Best for workloads with long generation sequences
+
+## KV Events vs. Approximation Mode
+
+The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
+
+- **With KV Events (default)**:
+  - Calculates overlap accurately using actual cached blocks
+  - Provides higher accuracy with event processing overhead
+  - Recommended for production deployments
+
+- **Without KV Events (--no-kv-events)**:
+  - Uses ApproxKvIndexer to estimate cached blocks from routing decisions
+  - Assumes blocks from recent requests remain cached
+  - Reduces overhead at the cost of routing accuracy
+  - Suitable for testing or when event processing becomes a bottleneck
+
+## Tuning Guidelines
+
+### 1. Understand Your Workload Characteristics
+
+- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
+- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
+
+### 2. Monitor Key Metrics
+
+The router logs the cost calculation for each worker:
+```
+Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
+```
+
+This shows:
+- Total cost (125.3)
+- Overlap weight × prefill blocks (1.0 × 100.5)
+- Active blocks (25.0)
+- Cached blocks that contribute to overlap (15)
+
+### 3. Temperature-Based Routing
+
+The `router_temperature` parameter controls routing randomness:
+- **0.0 (default)**: Deterministic selection of the best worker
+- **> 0.0**: Probabilistic selection, higher values increase randomness
+- Useful for preventing worker saturation and improving load distribution
+
+### 4. Iterative Optimization
+
+1. Begin with default settings
+2. Monitor TTFT and ITL metrics
+3. Adjust `kv-overlap-score-weight` to meet your performance goals:
+   - To reduce TTFT: Increase the weight
+   - To reduce ITL: Decrease the weight
+4. If you observe severe load imbalance, increase the temperature setting
\ No newline at end of file

From f5a6e4998a096497b677ed76f89264373b366743 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Sun, 24 Aug 2025 21:25:29 -0700
Subject: [PATCH 07/17] Fix new warnings of unused docs after sync with main

---
 docs/hidden_toctree.rst | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index ad4f87b7e0b..5a0f72784df 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -35,15 +35,18 @@
    API/nixl_connect/read_operation.md
    API/nixl_connect/write_operation.md
 
-   examples/README.md
    guides/dynamo_deploy/create_deployment.md
    guides/dynamo_deploy/sla_planner_deployment.md
-
    guides/dynamo_deploy/gke_setup.md
+   guides/dynamo_deploy/grove.md
+   guides/dynamo_deploy/k8s_metrics.md
    guides/dynamo_deploy/README.md
    guides/dynamo_run.md
-
    guides/metrics.md
-   guides/deploy/k8s_metrics.md
+   guides/run_kvbm_in_vllm.md
+
    architecture/request_migration.md
 
+   examples/README.md
+
+   components/router/README.md

From c18e5b44f617a593637c05f305dccade9f056d16 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 01:15:48 -0700
Subject: [PATCH 08/17] docs: Refactor left side table of contents for docs
 website - v1

---
 docs/_includes/dive_in_examples.rst           |  32 ++++
 docs/_includes/install.rst                    |  44 +++++
 docs/_includes/quick_start_local.rst          |  42 +++++
 docs/_sections/architecture.rst               |  10 ++
 docs/_sections/backends.rst                   |  42 +++++
 docs/_sections/examples.rst                   |   8 +
 docs/_sections/installation.rst               |  10 ++
 docs/_sections/kubernetes.rst                 |  12 ++
 docs/_sections/quickstart.rst                 |  10 ++
 docs/_sections/slurm.rst                      |  12 ++
 .../trtllm/multinode/multinode-examples.md    |   1 +
 docs/hidden_toctree.rst                       |  25 ++-
 docs/index.rst                                | 159 +++---------------
 13 files changed, 260 insertions(+), 147 deletions(-)
 create mode 100644 docs/_includes/dive_in_examples.rst
 create mode 100644 docs/_includes/install.rst
 create mode 100644 docs/_includes/quick_start_local.rst
 create mode 100644 docs/_sections/architecture.rst
 create mode 100644 docs/_sections/backends.rst
 create mode 100644 docs/_sections/examples.rst
 create mode 100644 docs/_sections/installation.rst
 create mode 100644 docs/_sections/kubernetes.rst
 create mode 100644 docs/_sections/quickstart.rst
 create mode 100644 docs/_sections/slurm.rst
 create mode 120000 docs/components/backends/trtllm/multinode/multinode-examples.md

diff --git a/docs/_includes/dive_in_examples.rst b/docs/_includes/dive_in_examples.rst
new file mode 100644
index 00000000000..60eb9048fb7
--- /dev/null
+++ b/docs/_includes/dive_in_examples.rst
@@ -0,0 +1,32 @@
+The examples below assume you build the latest image yourself from source. If using a prebuilt image follow the examples from the corresponding branch.
+
+.. grid:: 1 2 2 2
+    :gutter: 3
+    :margin: 0
+    :padding: 3 4 0 0
+
+    .. grid-item-card:: :doc:`Hello World <../examples/runtime/hello_world/README>`
+        :link: ../examples/runtime/hello_world/README
+        :link-type: doc
+
+        Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph
+
+    .. grid-item-card:: :doc:`vLLM <../components/backends/vllm/README>`
+        :link: ../components/backends/vllm/README
+        :link-type: doc
+
+        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.
+
+    .. grid-item-card:: :doc:`SGLang <../components/backends/sglang/README>`
+        :link: ../components/backends/sglang/README
+        :link-type: doc
+
+        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with SGLang.
+
+    .. grid-item-card:: :doc:`TensorRT-LLM <../components/backends/trtllm/README>`
+        :link: ../components/backends/trtllm/README
+        :link-type: doc
+
+        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with TensorRT-LLM.
+
+
diff --git a/docs/_includes/install.rst b/docs/_includes/install.rst
new file mode 100644
index 00000000000..7d7309763ff
--- /dev/null
+++ b/docs/_includes/install.rst
@@ -0,0 +1,44 @@
+Pip (PyPI)
+----------
+
+Install a pre-built wheel from PyPI.
+
+.. code-block:: bash
+
+   # Create a virtual environment and activate it
+   uv venv venv
+   source venv/bin/activate
+
+   # Install Dynamo from PyPI (choose one backend extra)
+   uv pip install "ai-dynamo[sglang]==0.4.1"  # or [vllm], [trtllm]
+
+
+Pip from source
+---------------
+
+Install directly from a local checkout for development.
+
+.. code-block:: bash
+
+   # Clone the repository
+   git clone https://github.com/ai-dynamo/dynamo.git
+   cd dynamo
+
+   # Create a virtual environment and activate it
+   uv venv venv
+   source venv/bin/activate
+   uv pip install ".[sglang]"  # or [vllm], [trtllm]
+
+
+Docker
+------
+
+Pull and run prebuilt images from NVIDIA NGC (`nvcr.io`).
+
+.. code-block:: bash
+
+   # Run a container (mount your workspace if needed)
+   docker run --rm -it \
+     --gpus all \
+     --network host \
+     nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1  # or vllm, tensorrtllm
diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst
new file mode 100644
index 00000000000..984bb91810e
--- /dev/null
+++ b/docs/_includes/quick_start_local.rst
@@ -0,0 +1,42 @@
+Get started with Dynamo locally in just a few commands:
+
+**1. Install Dynamo**
+
+.. code-block:: bash
+
+   # Install uv (recommended Python package manager)
+   curl -LsSf https://astral.sh/uv/install.sh | sh
+
+   # Create virtual environment and install Dynamo
+   uv venv venv
+   source venv/bin/activate
+   uv pip install "ai-dynamo[sglang]==0.4.1"  # or [vllm], [trtllm]
+
+**2. Start etcd/NATS**
+
+.. code-block:: bash
+
+   # Start etcd and NATS using Docker Compose
+   docker compose -f deploy/docker-compose.yml up -d
+
+**3. Run Dynamo**
+
+.. code-block:: bash
+
+   # Start the OpenAI compatible frontend
+   python -m dynamo.frontend
+
+   # In another terminal, start an SGLang worker
+   python -m dynamo.sglang --model-path Qwen/Qwen3-0.6B
+
+**4. Test your deployment**
+
+.. code-block:: bash
+
+   curl localhost:8080/v1/chat/completions \
+     -H "Content-Type: application/json" \
+     -d '{"model": "Qwen/Qwen3-0.6B",
+          "messages": [{"role": "user", "content": "Hello!"}],
+          "max_tokens": 50}'
+
+
diff --git a/docs/_sections/architecture.rst b/docs/_sections/architecture.rst
new file mode 100644
index 00000000000..75c730d17f3
--- /dev/null
+++ b/docs/_sections/architecture.rst
@@ -0,0 +1,10 @@
+Overview
+============
+
+.. include:: ../architecture/architecture.md
+   :parser: myst_parser.sphinx_
+
+.. toctree::
+   :hidden:
+
+   Disaggregated Serving <../architecture/disagg_serving>
diff --git a/docs/_sections/backends.rst b/docs/_sections/backends.rst
new file mode 100644
index 00000000000..4b6b294b711
--- /dev/null
+++ b/docs/_sections/backends.rst
@@ -0,0 +1,42 @@
+..
+    SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+    SPDX-License-Identifier: Apache-2.0
+
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+
+Backends
+========
+
+NVIDIA Dynamo supports multiple inference backends to provide flexibility and performance optimization for different use cases and model architectures. Backends are the underlying engines that execute AI model inference, each optimized for specific scenarios, hardware configurations, and performance requirements.
+
+Overview
+--------
+
+Dynamo's multi-backend architecture allows you to:
+
+* **Choose the optimal engine** for your specific workload and hardware
+* **Switch between backends** without changing your application code
+* **Leverage specialized optimizations** from each backend
+* **Scale flexibly** across different deployment scenarios
+
+Supported Backends
+------------------
+
+Dynamo currently supports the following high-performance inference backends:
+
+.. toctree::
+   :maxdepth: 1
+
+   vLLM <../components/backends/vllm/README>
+   SGLang <../components/backends/sglang/README>
+   TensorRT-LLM <../components/backends/trtllm/README>
diff --git a/docs/_sections/examples.rst b/docs/_sections/examples.rst
new file mode 100644
index 00000000000..30258a46bee
--- /dev/null
+++ b/docs/_sections/examples.rst
@@ -0,0 +1,8 @@
+..
+    Quickstart Page (left sidebar target)
+..
+
+Examples
+========
+
+.. include:: ../_includes/dive_in_examples.rst
\ No newline at end of file
diff --git a/docs/_sections/installation.rst b/docs/_sections/installation.rst
new file mode 100644
index 00000000000..b9543fb5586
--- /dev/null
+++ b/docs/_sections/installation.rst
@@ -0,0 +1,10 @@
+..
+    Installation Page (left sidebar target)
+..
+
+Installation
+============
+
+.. include:: ../_includes/install.rst
+
+
diff --git a/docs/_sections/kubernetes.rst b/docs/_sections/kubernetes.rst
new file mode 100644
index 00000000000..a8d610b0eb5
--- /dev/null
+++ b/docs/_sections/kubernetes.rst
@@ -0,0 +1,12 @@
+Kubernetes
+============
+
+.. toctree::
+   :hidden:
+
+   Quickstart <../guides/dynamo_deploy/dynamo_cloud.md>
+   Dynamo Operator <../guides/dynamo_deploy/dynamo_operator.md>
+   Metrics <../guides/dynamo_deploy/k8s_metrics.md>
+   Model Caching <../guides/dynamo_deploy/model_caching_with_fluid.md>
+   Multinode <../guides/dynamo_deploy/multinode-deployment.md>
+   Minikube Setup <../guides/dynamo_deploy/minikube.md>
\ No newline at end of file
diff --git a/docs/_sections/quickstart.rst b/docs/_sections/quickstart.rst
new file mode 100644
index 00000000000..c57380203f8
--- /dev/null
+++ b/docs/_sections/quickstart.rst
@@ -0,0 +1,10 @@
+..
+    Quickstart Page (left sidebar target)
+..
+
+Quickstart
+==========
+
+.. include:: ../_includes/quick_start_local.rst
+
+
diff --git a/docs/_sections/slurm.rst b/docs/_sections/slurm.rst
new file mode 100644
index 00000000000..0eff33af055
--- /dev/null
+++ b/docs/_sections/slurm.rst
@@ -0,0 +1,12 @@
+Slurm
+=====
+
+While Slurm is not common for production deployments, it is a popular choice for
+research and development. This section provides some examples for deploying
+Dynamo on Slurm.
+
+.. toctree::
+   :hidden:
+
+   TRTLLM <../../components/backends/trtllm/multinode/multinode-examples.md>
+   SGLang <../../components/backends/sglang/docs/multinode-examples.md>
\ No newline at end of file
diff --git a/docs/components/backends/trtllm/multinode/multinode-examples.md b/docs/components/backends/trtllm/multinode/multinode-examples.md
new file mode 120000
index 00000000000..495f44690b0
--- /dev/null
+++ b/docs/components/backends/trtllm/multinode/multinode-examples.md
@@ -0,0 +1 @@
+../../../../../components/backends/trtllm/multinode/multinode-examples.md
\ No newline at end of file
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 5a0f72784df..04e09727e1e 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -4,18 +4,6 @@
     SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
     SPDX-License-Identifier: Apache-2.0
 
-    Licensed under the Apache License, Version 2.0 (the "License");
-    you may not use this file except in compliance with the License.
-    You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-    Unless required by applicable law or agreed to in writing, software
-    distributed under the License is distributed on an "AS IS" BASIS,
-    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    See the License for the specific language governing permissions and
-    limitations under the License.
-
 .. This hidden toctree includes readmes etc that aren't meant to be in the main table of contents but should be accounted for in the sphinx project structure
 
 
@@ -49,4 +37,15 @@
 
    examples/README.md
 
-   components/router/README.md
+   API/nixl_connect/README.md
+   architecture/kv_cache_routing.md
+   examples/runtime/hello_world/README.md
+   guides/dynamo_deploy/quickstart.md
+   guides/planner_benchmark/README.md
+
+   architecture/distributed_runtime.md
+   architecture/dynamo_flow.md
+
+
+..   TODO: architecture/distributed_runtime.md and architecture/dynamo_flow.md
+     have some outdated names/references and need a refresh.
diff --git a/docs/index.rst b/docs/index.rst
index fd7fc79f30f..60911612866 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -14,6 +14,10 @@
     See the License for the specific language governing permissions and
     limitations under the License.
 
+..
+   Main Page
+..
+
 Welcome to NVIDIA Dynamo
 ========================
 
@@ -22,156 +26,43 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
 .. admonition:: 💎 Discover the latest developments!
    :class: seealso
 
-   This guide is a snapshot of the `Dynamo GitHub Repository <https://github.com/ai-dynamo/dynamo>`_ at a specific point in time. For the latest information and examples, see:
-
-   - `Dynamo README <https://github.com/ai-dynamo/dynamo/blob/main/README.md>`_
-   - `Architecture and features doc <https://github.com/ai-dynamo/dynamo/blob/main/docs/architecture/>`_
-   - `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/guides>`_
-   - `Dynamo examples repo <https://github.com/ai-dynamo/dynamo/tree/main/examples>`_
-
-
-Quick Start
------------------
-
-Local Deployment
-~~~~~~~~~~~~~~~~
-
-Get started with Dynamo locally in just a few commands:
-
-**1. Install Dynamo**
-
-.. code-block:: bash
-
-   # Install uv (recommended Python package manager)
-   curl -LsSf https://astral.sh/uv/install.sh | sh
-
-   # Create virtual environment and install Dynamo
-   uv venv venv
-   source venv/bin/activate
-   uv pip install "ai-dynamo[sglang]"  # or [vllm], [trtllm]
-
-**2. Start etcd/NATS**
-
-.. code-block:: bash
-
-   # Start etcd and NATS using Docker Compose
-   docker compose -f deploy/docker-compose.yml up -d
-
-**3. Run Dynamo**
-
-.. code-block:: bash
-
-   # Start the OpenAI compatible frontend
-   python -m dynamo.frontend
-
-   # In another terminal, start an SGLang worker
-   python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
-
-**4. Test your deployment**
-
-.. code-block:: bash
-
-   curl localhost:8080/v1/chat/completions \
-     -H "Content-Type: application/json" \
-     -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
-          "messages": [{"role": "user", "content": "Hello!"}],
-          "max_tokens": 50}'
-
-Kubernetes Deployment
-~~~~~~~~~~~~~~~~~~~~~
-
-For deployments on Kubernetes, follow the :doc:`Dynamo Platform Quickstart Guide <guides/dynamo_deploy/quickstart>`.
-
-
-Dive in: Examples
------------------
-
-The examples below assume you build the latest image yourself from source. If using a prebuilt image follow the examples from the corresponding branch.
-
-.. grid:: 1 2 2 2
-    :gutter: 3
-    :margin: 0
-    :padding: 3 4 0 0
-
-    .. grid-item-card:: :doc:`Hello World <examples/runtime/hello_world/README>`
-        :link: examples/runtime/hello_world/README
-        :link-type: doc
-
-        Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph
-
-    .. grid-item-card:: :doc:`LLM Serving with VLLM <components/backends/vllm/README>`
-        :link: components/backends/vllm/README
-        :link-type: doc
-
-        Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM.
-
-    .. grid-item-card:: :doc:`Multinode with SGLang <components/backends/sglang/docs/multinode-examples>`
-        :link: components/backends/sglang/docs/multinode-examples
-        :link-type: doc
-
-        Demonstrates disaggregated serving on several nodes.
-
-    .. grid-item-card:: :doc:`TensorRT-LLM <components/backends/trtllm/README>`
-        :link: components/backends/trtllm/README
-        :link-type: doc
-
-        Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
+   This guide is a snapshot at a specfic point in time. For the latest information and examples, see the `Dynamo GitHub repository <https://github.com/ai-dynamo/dynamo>`_.
 
+..
+   Sidebar
+..
 
 .. toctree::
    :hidden:
+   :caption: Getting Started
 
-   Welcome to Dynamo <self>
+   Overview <self>
+   Quickstart <_sections/quickstart>
+   Installation <_sections/installation>
+   Examples <_sections/examples>
    Support Matrix <support_matrix.md>
 
 .. toctree::
    :hidden:
-   :caption: Architecture & Features
+   :caption: Deployment
 
-   High Level Architecture <architecture/architecture.md>
-   Distributed Runtime <architecture/distributed_runtime.md>
-   Disaggregated Serving <architecture/disagg_serving.md>
-   KV Block Manager <architecture/kvbm_intro.rst>
-   KV Cache Routing <architecture/kv_cache_routing.md>
-   Planner <architecture/planner_intro.rst>
-   Dynamo Architecture Flow <architecture/dynamo_flow.md>
+   Kubernetes <_sections/kubernetes>
+   Slurm <_sections/slurm>
 
 .. toctree::
    :hidden:
-   :caption: Using Dynamo
+   :caption: Architecture
 
-   Writing Python Workers in Dynamo <guides/backend.md>
-   Disaggregation and Performance Tuning <guides/disagg_perf_tuning.md>
-   Working with Dynamo Kubernetes Operator <guides/dynamo_deploy/dynamo_operator.md>
+   Overview <_sections/architecture>
+   Backends <_sections/backends>
+   Router <components/router/README>
+   Planner <architecture/planner_intro>
+   KVBM <architecture/kvbm_intro>
 
 .. toctree::
    :hidden:
-   :caption: Deployment Guides
-
-   Dynamo Deploy Quickstart <guides/dynamo_deploy/quickstart.md>
-   Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
-
-   Multinode Deployment <guides/dynamo_deploy/multinode-deployment.md>
-   Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
-   Model Caching with Fluid <guides/dynamo_deploy/model_caching_with_fluid.md>
-
-.. toctree::
-   :hidden:
-   :caption: Examples
-
-   Hello World <../examples/runtime/hello_world/README.md>
-   LLM Deployment Examples using VLLM <../components/backends/vllm/README.md>
-   LLM Deployment Examples using SGLang <../components/backends/sglang/README.md>
-   LLM Deployment Examples using TensorRT-LLM <../components/backends/trtllm/README.md>
-   Multinode Examples using SGLang <../components/backends/sglang/docs/multinode-examples.md>
-   Planner Benchmark Example <guides/planner_benchmark/README.md>
-
-.. toctree::
-   :hidden:
-   :caption: Reference
+   :caption: Developer Guide
 
+   Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md>
+   Writing Python Workers in Dynamo <guides/backend.md>
    Glossary <dynamo_glossary.md>
-   NIXL Connect API <API/nixl_connect/README.md>
-   KVBM Reading <architecture/kvbm_reading.md>
-
-

From fddd1e3c00e2cde2a21f150358fb9ff478157f20 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 14:18:51 -0700
Subject: [PATCH 09/17] docs: Apply Harry's feedback, condense
 overview/quickstart, remove slurm, add planner benchmark, add lmcache - v2

---
 docs/_sections/architecture.rst               |  1 +
 docs/_sections/kubernetes.rst                 | 12 ----------
 docs/_sections/quickstart.rst                 | 10 --------
 docs/_sections/slurm.rst                      | 12 ----------
 docs/architecture/kvbm_intro.rst              |  5 ++--
 docs/architecture/planner_intro.rst           |  7 +++---
 .../backends/vllm/LMCache_Integration.md      |  1 +
 docs/hidden_toctree.rst                       | 11 +++++----
 docs/index.rst                                | 23 ++++++++++++-------
 9 files changed, 28 insertions(+), 54 deletions(-)
 delete mode 100644 docs/_sections/kubernetes.rst
 delete mode 100644 docs/_sections/quickstart.rst
 delete mode 100644 docs/_sections/slurm.rst
 create mode 120000 docs/components/backends/vllm/LMCache_Integration.md

diff --git a/docs/_sections/architecture.rst b/docs/_sections/architecture.rst
index 75c730d17f3..13e6dad0a58 100644
--- a/docs/_sections/architecture.rst
+++ b/docs/_sections/architecture.rst
@@ -7,4 +7,5 @@ Overview
 .. toctree::
    :hidden:
 
+   Overview <self>
    Disaggregated Serving <../architecture/disagg_serving>
diff --git a/docs/_sections/kubernetes.rst b/docs/_sections/kubernetes.rst
deleted file mode 100644
index a8d610b0eb5..00000000000
--- a/docs/_sections/kubernetes.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Kubernetes
-============
-
-.. toctree::
-   :hidden:
-
-   Quickstart <../guides/dynamo_deploy/dynamo_cloud.md>
-   Dynamo Operator <../guides/dynamo_deploy/dynamo_operator.md>
-   Metrics <../guides/dynamo_deploy/k8s_metrics.md>
-   Model Caching <../guides/dynamo_deploy/model_caching_with_fluid.md>
-   Multinode <../guides/dynamo_deploy/multinode-deployment.md>
-   Minikube Setup <../guides/dynamo_deploy/minikube.md>
\ No newline at end of file
diff --git a/docs/_sections/quickstart.rst b/docs/_sections/quickstart.rst
deleted file mode 100644
index c57380203f8..00000000000
--- a/docs/_sections/quickstart.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-..
-    Quickstart Page (left sidebar target)
-..
-
-Quickstart
-==========
-
-.. include:: ../_includes/quick_start_local.rst
-
-
diff --git a/docs/_sections/slurm.rst b/docs/_sections/slurm.rst
deleted file mode 100644
index 0eff33af055..00000000000
--- a/docs/_sections/slurm.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-Slurm
-=====
-
-While Slurm is not common for production deployments, it is a popular choice for
-research and development. This section provides some examples for deploying
-Dynamo on Slurm.
-
-.. toctree::
-   :hidden:
-
-   TRTLLM <../../components/backends/trtllm/multinode/multinode-examples.md>
-   SGLang <../../components/backends/sglang/docs/multinode-examples.md>
\ No newline at end of file
diff --git a/docs/architecture/kvbm_intro.rst b/docs/architecture/kvbm_intro.rst
index 39b096a5c0d..4c6cb0d2275 100644
--- a/docs/architecture/kvbm_intro.rst
+++ b/docs/architecture/kvbm_intro.rst
@@ -48,9 +48,6 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
    * -
      - ❌
      - SGLang
-   * -
-     - ❌
-     - llama.cpp
    * - **Serving Type**
      - ✅
      - Aggregated
@@ -61,7 +58,9 @@ The Dynamo KV Block Manager serves as a reference implementation that emphasizes
 .. toctree::
    :hidden:
 
+   Overview <self>
    Motivation <kvbm_motivation.md>
    KVBM Architecture <kvbm_architecture.md>
    Understanding KVBM components <kvbm_components.md>
    KVBM Further Reading <kvbm_reading>
+   LMCache Integration <../components/backends/vllm/LMCache_Integration.md>
diff --git a/docs/architecture/planner_intro.rst b/docs/architecture/planner_intro.rst
index e9c2e1eaf44..52d31df2e62 100644
--- a/docs/architecture/planner_intro.rst
+++ b/docs/architecture/planner_intro.rst
@@ -49,9 +49,6 @@ Key features include:
    * -
      - ❌
      - SGLang
-   * -
-     - ❌
-     - llama.cpp
    * - **Serving Type**
      - ✅
      - Aggregated
@@ -73,6 +70,8 @@ Key features include:
 .. toctree::
    :hidden:
 
+   Overview <self>
    Pre-Deployment Profiling <pre_deployment_profiling.md>
    Load-based Planner <load_planner.md>
-   SLA-based Planner <sla_planner.md>
\ No newline at end of file
+   SLA-based Planner <sla_planner.md>
+   Planner Benchmark <../guides/planner_benchmark/README.md>
\ No newline at end of file
diff --git a/docs/components/backends/vllm/LMCache_Integration.md b/docs/components/backends/vllm/LMCache_Integration.md
new file mode 120000
index 00000000000..117bf4be15b
--- /dev/null
+++ b/docs/components/backends/vllm/LMCache_Integration.md
@@ -0,0 +1 @@
+../../../../components/backends/vllm/LMCache_Integration.md
\ No newline at end of file
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 04e09727e1e..2544d51f4d3 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -22,6 +22,7 @@
    API/nixl_connect/writable_operation.md
    API/nixl_connect/read_operation.md
    API/nixl_connect/write_operation.md
+   API/nixl_connect/README.md
 
    guides/dynamo_deploy/create_deployment.md
    guides/dynamo_deploy/sla_planner_deployment.md
@@ -29,19 +30,19 @@
    guides/dynamo_deploy/grove.md
    guides/dynamo_deploy/k8s_metrics.md
    guides/dynamo_deploy/README.md
+   guides/dynamo_deploy/quickstart.md
    guides/dynamo_run.md
    guides/metrics.md
    guides/run_kvbm_in_vllm.md
 
+   architecture/kv_cache_routing.md
    architecture/request_migration.md
 
-   examples/README.md
+   components/backends/trtllm/multinode/multinode-examples.md
+   components/backends/sglang/docs/multinode-examples.md
 
-   API/nixl_connect/README.md
-   architecture/kv_cache_routing.md
+   examples/README.md
    examples/runtime/hello_world/README.md
-   guides/dynamo_deploy/quickstart.md
-   guides/planner_benchmark/README.md
 
    architecture/distributed_runtime.md
    architecture/dynamo_flow.md
diff --git a/docs/index.rst b/docs/index.rst
index 60911612866..0e6bd9f59bb 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -28,6 +28,10 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
 
    This guide is a snapshot at a specfic point in time. For the latest information and examples, see the `Dynamo GitHub repository <https://github.com/ai-dynamo/dynamo>`_.
 
+Quickstart
+==========
+.. include:: _includes/quick_start_local.rst
+
 ..
    Sidebar
 ..
@@ -36,24 +40,27 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
    :hidden:
    :caption: Getting Started
 
-   Overview <self>
-   Quickstart <_sections/quickstart>
+   Quickstart <self>
    Installation <_sections/installation>
-   Examples <_sections/examples>
    Support Matrix <support_matrix.md>
+   Architecture <_sections/architecture>
+   Examples <_sections/examples>
 
 .. toctree::
    :hidden:
-   :caption: Deployment
+   :caption: Kubernetes Deployment
 
-   Kubernetes <_sections/kubernetes>
-   Slurm <_sections/slurm>
+   Quickstart (K8s) <../guides/dynamo_deploy/dynamo_cloud.md>
+   Dynamo Operator <../guides/dynamo_deploy/dynamo_operator.md>
+   Metrics <../guides/dynamo_deploy/k8s_metrics.md>
+   Model Caching <../guides/dynamo_deploy/model_caching_with_fluid.md>
+   Multinode <../guides/dynamo_deploy/multinode-deployment.md>
+   Minikube Setup <../guides/dynamo_deploy/minikube.md>
 
 .. toctree::
    :hidden:
-   :caption: Architecture
+   :caption: Components
 
-   Overview <_sections/architecture>
    Backends <_sections/backends>
    Router <components/router/README>
    Planner <architecture/planner_intro>

From eb6ff81aed4aa4f13946b72108d2a676e319c5a9 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 14:29:26 -0700
Subject: [PATCH 10/17] docs: Address CodeRabbit feedback

---
 docs/_includes/quick_start_local.rst | 2 +-
 docs/index.rst                       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst
index 984bb91810e..02bc5dd1693 100644
--- a/docs/_includes/quick_start_local.rst
+++ b/docs/_includes/quick_start_local.rst
@@ -23,7 +23,7 @@ Get started with Dynamo locally in just a few commands:
 
 .. code-block:: bash
 
-   # Start the OpenAI compatible frontend
+   # Start the OpenAI compatible frontend (default port is 8080)
    python -m dynamo.frontend
 
    # In another terminal, start an SGLang worker
diff --git a/docs/index.rst b/docs/index.rst
index 0e6bd9f59bb..b3afc745210 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,7 +26,7 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor
 .. admonition:: 💎 Discover the latest developments!
    :class: seealso
 
-   This guide is a snapshot at a specfic point in time. For the latest information and examples, see the `Dynamo GitHub repository <https://github.com/ai-dynamo/dynamo>`_.
+   This guide is a snapshot at a specific point in time. For the latest information and examples, see the `Dynamo GitHub repository <https://github.com/ai-dynamo/dynamo>`_.
 
 Quickstart
 ==========

From ba773bff247ac21880a9d0b80cca024423d95e90 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 14:41:46 -0700
Subject: [PATCH 11/17] Replace k8s quickstart with dynamo_deploy/README.md
 instead of dynamo_deploy/dynamo_cloud.md

---
 docs/index.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.rst b/docs/index.rst
index b3afc745210..2c68409d349 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -50,7 +50,7 @@ Quickstart
    :hidden:
    :caption: Kubernetes Deployment
 
-   Quickstart (K8s) <../guides/dynamo_deploy/dynamo_cloud.md>
+   Quickstart (K8s) <../guides/dynamo_deploy/README.md>
    Dynamo Operator <../guides/dynamo_deploy/dynamo_operator.md>
    Metrics <../guides/dynamo_deploy/k8s_metrics.md>
    Model Caching <../guides/dynamo_deploy/model_caching_with_fluid.md>

From 97be098df8ffeee872c6e98040e02a637b6ad887 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 14:45:27 -0700
Subject: [PATCH 12/17] Replace k8s quickstart with dynamo_deploy/README.md
 instead of dynamo_deploy/dynamo_cloud.md, and fix QA bug on missing 'import
 os' in multinode doc

---
 examples/basics/multinode/README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md
index b2258ac7093..8f8d4a975ac 100644
--- a/examples/basics/multinode/README.md
+++ b/examples/basics/multinode/README.md
@@ -315,6 +315,7 @@ Send multiple new conversations to see them distributed across replicas:
 ```python
 import asyncio
 from openai import AsyncOpenAI
+import os
 
 if os.environ.get("DYN_FRONTEND_IP"):
     frontend_ip=os.environ.get("DYN_FRONTEND_IP")

From bc1d9409f9138c287b9cf38768cd6a87bde8ec43 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 15:01:52 -0700
Subject: [PATCH 13/17] Anish's feedback - use dynamo kubernetes platform doc
 as quickstart, remove model caching with fluid doc

---
 docs/architecture/planner_intro.rst           | 1 -
 docs/architecture/pre_deployment_profiling.md | 2 +-
 docs/guides/dynamo_deploy/dynamo_cloud.md     | 8 ++++----
 docs/guides/dynamo_deploy/quickstart.md       | 2 +-
 docs/hidden_toctree.rst                       | 2 ++
 docs/index.rst                                | 3 +--
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/architecture/planner_intro.rst b/docs/architecture/planner_intro.rst
index 52d31df2e62..8c8dbccb5ac 100644
--- a/docs/architecture/planner_intro.rst
+++ b/docs/architecture/planner_intro.rst
@@ -72,6 +72,5 @@ Key features include:
 
    Overview <self>
    Pre-Deployment Profiling <pre_deployment_profiling.md>
-   Load-based Planner <load_planner.md>
    SLA-based Planner <sla_planner.md>
    Planner Benchmark <../guides/planner_benchmark/README.md>
\ No newline at end of file
diff --git a/docs/architecture/pre_deployment_profiling.md b/docs/architecture/pre_deployment_profiling.md
index e8d0fcf76ae..9e655b0528d 100644
--- a/docs/architecture/pre_deployment_profiling.md
+++ b/docs/architecture/pre_deployment_profiling.md
@@ -96,7 +96,7 @@ Use the default pre-built image and inject custom configurations via PVC:
 
 1. **Set the container image:**
    ```bash
-   export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 # or any existing image tag
+   export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 # or any existing image tag
    ```
 
 2. **Inject your custom disagg configuration:**
diff --git a/docs/guides/dynamo_deploy/dynamo_cloud.md b/docs/guides/dynamo_deploy/dynamo_cloud.md
index 0264e6d0568..4cc3753024b 100644
--- a/docs/guides/dynamo_deploy/dynamo_cloud.md
+++ b/docs/guides/dynamo_deploy/dynamo_cloud.md
@@ -39,7 +39,7 @@ helm version             # v3.0+
 docker version           # Running daemon
 
 # Set your inference runtime image
-export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
+export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
 # Also available: sglang-runtime, tensorrtllm-runtime
 ```
 
@@ -53,7 +53,7 @@ Install from [NGC published artifacts](https://catalog.ngc.nvidia.com/orgs/nvidi
 ```bash
 # 1. Set environment
 export NAMESPACE=dynamo-kubernetes
-export RELEASE_VERSION=0.4.0 # any version of Dynamo 0.3.2+
+export RELEASE_VERSION=0.4.1 # any version of Dynamo 0.3.2+
 
 # 2. Install CRDs
 helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
@@ -79,7 +79,7 @@ export NAMESPACE=dynamo-cloud
 export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # or your registry
 export DOCKER_USERNAME='$oauthtoken'
 export DOCKER_PASSWORD=<YOUR_NGC_CLI_API_KEY>
-export IMAGE_TAG=0.4.0
+export IMAGE_TAG=0.4.1
 
 # 2. Build operator
 cd deploy/cloud/operator
@@ -178,4 +178,4 @@ kubectl create secret generic hf-token-secret \
 
 - [GKE-specific setup](gke_setup.md)
 - [Create custom deployments](create_deployment.md)
-- [Dynamo Operator details](dynamo_operator.md)
\ No newline at end of file
+- [Dynamo Operator details](dynamo_operator.md)
diff --git a/docs/guides/dynamo_deploy/quickstart.md b/docs/guides/dynamo_deploy/quickstart.md
index 28910e98dde..5e2cf03efe2 100644
--- a/docs/guides/dynamo_deploy/quickstart.md
+++ b/docs/guides/dynamo_deploy/quickstart.md
@@ -14,7 +14,7 @@ Use this approach when installing from pre-built helm charts and docker images p
 
 ```bash
 export NAMESPACE=dynamo-cloud
-export RELEASE_VERSION=0.4.0
+export RELEASE_VERSION=0.4.1
 ```
 
 Install `envsubst`, `kubectl`, `helm`
diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 2544d51f4d3..22a467cc14a 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -29,6 +29,7 @@
    guides/dynamo_deploy/gke_setup.md
    guides/dynamo_deploy/grove.md
    guides/dynamo_deploy/k8s_metrics.md
+   guides/dynamo_deploy/model_caching_with_fluid.md
    guides/dynamo_deploy/README.md
    guides/dynamo_deploy/quickstart.md
    guides/dynamo_run.md
@@ -36,6 +37,7 @@
    guides/run_kvbm_in_vllm.md
 
    architecture/kv_cache_routing.md
+   architecture/load_planner.md
    architecture/request_migration.md
 
    components/backends/trtllm/multinode/multinode-examples.md
diff --git a/docs/index.rst b/docs/index.rst
index 2c68409d349..daac85fdbd1 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -50,10 +50,9 @@ Quickstart
    :hidden:
    :caption: Kubernetes Deployment
 
-   Quickstart (K8s) <../guides/dynamo_deploy/README.md>
+   Quickstart (K8s) <../guides/dynamo_deploy/dynamo_cloud.md>
    Dynamo Operator <../guides/dynamo_deploy/dynamo_operator.md>
    Metrics <../guides/dynamo_deploy/k8s_metrics.md>
-   Model Caching <../guides/dynamo_deploy/model_caching_with_fluid.md>
    Multinode <../guides/dynamo_deploy/multinode-deployment.md>
    Minikube Setup <../guides/dynamo_deploy/minikube.md>
 

From c373219076f4f5f4c334b5c6c2cfc77ee5c5c4eb Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 15:02:33 -0700
Subject: [PATCH 14/17] Anish's feedback - remove duplicate deploy quickstart,
 we already have quickstarts in README and dynamo kubernetes platform docs

---
 docs/guides/dynamo_deploy/quickstart.md | 196 ------------------------
 1 file changed, 196 deletions(-)
 delete mode 100644 docs/guides/dynamo_deploy/quickstart.md

diff --git a/docs/guides/dynamo_deploy/quickstart.md b/docs/guides/dynamo_deploy/quickstart.md
deleted file mode 100644
index 5e2cf03efe2..00000000000
--- a/docs/guides/dynamo_deploy/quickstart.md
+++ /dev/null
@@ -1,196 +0,0 @@
-# Quickstart
-
-Your onboarding includes 2 steps.
-1. Before deploying your inference graphs you need to install the Dynamo Inference Platform and the Dynamo Cloud.
-Dynamo Cloud acts as an orchestration layer between the end user and Kubernetes, handling the complexity of deploying your graphs for you.
-You could install from [Published Artifacts](#1-installing-dynamo-cloud-from-published-artifacts) or [Source](#2-installing-dynamo-cloud-from-source)
-2. Once you install the Dynamo Cloud, proceed to the [Examples](../../examples/README.md) to deploy an inference graph.
-
-## 1. Installing Dynamo Cloud from Published Artifacts
-
-Use this approach when installing from pre-built helm charts and docker images published to NGC.
-
-### Prerequisites
-
-```bash
-export NAMESPACE=dynamo-cloud
-export RELEASE_VERSION=0.4.1
-```
-
-Install `envsubst`, `kubectl`, `helm`
-
-### Authenticate with NGC
-
-Go to  https://ngc.nvidia.com/org to get your NGC_CLI_API_KEY.
-
-```bash
-helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --username='$oauthtoken' --password=<YOUR_NGC_CLI_API_KEY>
-```
-
-### Fetch Helm Charts
-
-```bash
-# Fetch the CRDs helm chart
-helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
-
-# Fetch the platform helm chart
-helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
-```
-
-### Install Dynamo Cloud
-
-**Step 1: Install Custom Resource Definitions (CRDs)**
-
-```bash
-helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
-  --namespace default \
-  --wait \
-  --atomic
-```
-
-**Step 2: Install Dynamo Platform**
-
-```bash
-kubectl create namespace ${NAMESPACE}
-
-helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
-```
-
-## 2. Installing Dynamo Cloud from Source
-
-Use this approach when developing or customizing Dynamo as a contributor, or using local helm charts from the source repository.
-
-### Prerequisites
-
-Ensure you have the source code checked out and are in the `dynamo` directory:
-
-
-### Set Environment Variables
-
-Our examples use the [`nvcr.io`](https://catalog.ngc.nvidia.com) but you can setup your own values if you use another docker registry.
-
-```bash
-export NAMESPACE=dynamo-cloud # or whatever you prefer.
-export DOCKER_SERVER=nvcr.io/nvidia/ai-dynamo/  # your-docker-registry.com
-export DOCKER_USERNAME='$oauthtoken'  # your-username if not using nvcr.io
-export DOCKER_PASSWORD=YOUR_NGC_CLI_API_KEY  # your-password if not using nvcr.io
-```
-
-### Pick the Dynamo Inference Image
-
-Export the tag of the Dynamo Runtime Image.
-If you are using a pre-defined release:
-
-```bash
-export IMAGE_TAG=RELEASE_VERSION # i.e. 0.3.2 - the release you are using
-```
-
-Or build your own image first and tag it with IMAGE_TAG
-
-```bash
-export IMAGE_TAG=<your-pick>
-./container/build.sh
-docker tag dynamo:latest-vllm <your-registry>/dynamo-base:$IMAGE_TAG
-docker login <your-registry>
-docker push <your-registry>/dynamo-base:latest-vllm
-```
-
-### Install Dynamo Cloud
-
-You need to build and push the Dynamo Cloud Operator Image by running
-
-```bash
-cd deploy/cloud/operator
-earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
-```
-
-The  Nvidia Cloud Operator image will be pulled from the `$DOCKER_SERVER/dynamo-operator:$IMAGE_TAG`.
-
-You could run the `deploy.sh` or use the manual commands under Step 1 and Step 2.
-
-**Installing with a script (alternative to the Step 1 and Step 2)**
-
-Create the namespace and the docker registry secret.
-
-```bash
-kubectl create namespace ${NAMESPACE}
-kubectl create secret docker-registry docker-imagepullsecret \
-  --docker-server=${DOCKER_SERVER} \
-  --docker-username=${DOCKER_USERNAME} \
-  --docker-password=${DOCKER_PASSWORD} \
-  --namespace=${NAMESPACE}
-```
-
-You need to add the bitnami helm repository by running:
-
-```bash
-helm repo add bitnami https://charts.bitnami.com/bitnami
-```
-
-```bash
-./deploy.sh --crds
-```
-
-if you want guidance during the process, run the deployment script with the `--interactive` flag:
-
-```bash
-./deploy.sh --crds --interactive
-```
-
-**Installing CRDs manually  (alternative to the script deploy.sh)**
-
-***Step 1: Install Custom Resource Definitions (CRDs)**
-
-```bash
-helm install dynamo-crds ./crds/ \
-  --namespace default \
-  --wait \
-  --atomic
-```
-
-***Step 2: Build Dependencies and Install Platform**
-
-```bash
-cd deploy/cloud/helm
-helm dep build ./platform/
-
-kubectl create namespace ${NAMESPACE}
-
-# Create docker registry secret
-kubectl create secret docker-registry docker-imagepullsecret \
-  --docker-server=${DOCKER_SERVER} \
-  --docker-username=${DOCKER_USERNAME} \
-  --docker-password=${DOCKER_PASSWORD} \
-  --namespace=${NAMESPACE}
-
-# Install platform
-helm install dynamo-platform ./platform/ \
-  --namespace ${NAMESPACE} \
-  --set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
-  --set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
-  --set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
-```
-
-[More on Deploying to Dynamo Cloud](./dynamo_cloud.md)
-
-## Uninstall CRDs for a clean start
-
-We provide a script to uninstall CRDs should you need a clean start.
-
-```bash
-./uninstall.sh
-```
-
-## Explore Examples
-
-If deploying to Kubernetes, create a Kubernetes secret containing your sensitive values if needed:
-
-```bash
-export HF_TOKEN=your_hf_token
-kubectl create secret generic hf-token-secret \
-  --from-literal=HF_TOKEN=${HF_TOKEN} \
-  -n ${NAMESPACE}
-```
-
-Follow the [Examples](../../examples/README.md)
-For more details on how to create your own deployments follow [Create Deployment Guide](create_deployment.md)

From d6db2fac0ef1db6c4a160a4db258c1345dbe4a5f Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 15:06:04 -0700
Subject: [PATCH 15/17] fix broken links from deleted dynamo deploy quickstart

---
 components/backends/sglang/deploy/README.md  | 4 ++--
 components/backends/trtllm/deploy/README.md  | 6 +++---
 components/backends/vllm/deploy/README.md    | 6 +++---
 deploy/inference-gateway/README.md           | 2 +-
 docs/guides/dynamo_deploy/dynamo_operator.md | 2 +-
 examples/runtime/hello_world/README.md       | 4 ++--
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/components/backends/sglang/deploy/README.md b/components/backends/sglang/deploy/README.md
index 4929eeb9711..cd7715bc814 100644
--- a/components/backends/sglang/deploy/README.md
+++ b/components/backends/sglang/deploy/README.md
@@ -145,7 +145,7 @@ All templates use **DeepSeek-R1-Distill-Llama-8B** as the default model. But you
 ## Further Reading
 
 - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
-- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
+- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/README.md)
 - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
 - **Kubernetes CRDs**: [Custom Resources Documentation](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
@@ -159,4 +159,4 @@ Common issues and solutions:
 3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
 4. **Out of memory**: Increase memory limits or reduce model batch size
 
-For additional support, refer to the [deployment guide](../../../../docs/guides/dynamo_deploy/quickstart.md).
+For additional support, refer to the [deployment guide](../../../../docs/guides/dynamo_deploy/README.md).
diff --git a/components/backends/trtllm/deploy/README.md b/components/backends/trtllm/deploy/README.md
index a02b7afe62d..cde9bd02bca 100644
--- a/components/backends/trtllm/deploy/README.md
+++ b/components/backends/trtllm/deploy/README.md
@@ -81,7 +81,7 @@ extraPodSpec:
 
 Before using these templates, ensure you have:
 
-1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md)
+1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/README.md)
 2. **Kubernetes cluster with GPU support**
 3. **Container registry access** for TensorRT-LLM runtime images
 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
@@ -257,7 +257,7 @@ Configure the `model` name and `host` based on your deployment.
 ## Further Reading
 
 - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
-- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
+- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/README.md)
 - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
 - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
@@ -277,4 +277,4 @@ Common issues and solutions:
 6. **Git LFS issues**: Ensure git-lfs is installed before building containers
 7. **ARM deployment**: Use `--platform linux/arm64` when building on ARM machines
 
-For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
+For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/README.md).
diff --git a/components/backends/vllm/deploy/README.md b/components/backends/vllm/deploy/README.md
index a720036a909..db43a7801fb 100644
--- a/components/backends/vllm/deploy/README.md
+++ b/components/backends/vllm/deploy/README.md
@@ -82,7 +82,7 @@ extraPodSpec:
 
 Before using these templates, ensure you have:
 
-1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/quickstart.md)
+1. **Dynamo Cloud Platform installed** - See [Quickstart Guide](../../../../docs/guides/dynamo_deploy/README.md)
 2. **Kubernetes cluster with GPU support**
 3. **Container registry access** for vLLM runtime images
 4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
@@ -236,7 +236,7 @@ args:
 ## Further Reading
 
 - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/guides/dynamo_deploy/create_deployment.md)
-- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/quickstart.md)
+- **Quickstart**: [Deployment Quickstart](../../../../docs/guides/dynamo_deploy/README.md)
 - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
 - **SLA Planner**: [SLA Planner Deployment Guide](../../../../docs/guides/dynamo_deploy/sla_planner_deployment.md)
 - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
@@ -252,4 +252,4 @@ Common issues and solutions:
 4. **Out of memory**: Increase memory limits or reduce model batch size
 5. **Port forwarding issues**: Ensure correct pod UUID in port-forward command
 
-For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
\ No newline at end of file
+For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/README.md).
diff --git a/deploy/inference-gateway/README.md b/deploy/inference-gateway/README.md
index 2ef635c9464..ada2af2293e 100644
--- a/deploy/inference-gateway/README.md
+++ b/deploy/inference-gateway/README.md
@@ -20,7 +20,7 @@ Currently, these setups are only supported with the kGateway based Inference Gat
 
 1. **Install Dynamo Platform**
 
-[See Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud.
+[See Quickstart Guide](../../docs/guides/dynamo_deploy/README.md) to install Dynamo Cloud.
 
 
 2. **Deploy Inference Gateway**
diff --git a/docs/guides/dynamo_deploy/dynamo_operator.md b/docs/guides/dynamo_deploy/dynamo_operator.md
index 4d3c2a04eb9..960719f3a67 100644
--- a/docs/guides/dynamo_deploy/dynamo_operator.md
+++ b/docs/guides/dynamo_deploy/dynamo_operator.md
@@ -93,7 +93,7 @@ The GitOps workflow for Dynamo deployments consists of three main steps:
 
 ### Step 1: Build and Push Dynamo Cloud Operator
 
-First, follow to [See Install Dynamo Cloud](quickstart.md#install-dynamo-cloud).
+First, follow to [See Install Dynamo Cloud](README.md).
 
 ### Step 2: Create Initial Deployment
 
diff --git a/examples/runtime/hello_world/README.md b/examples/runtime/hello_world/README.md
index 2063aaa36cb..97a363a8687 100644
--- a/examples/runtime/hello_world/README.md
+++ b/examples/runtime/hello_world/README.md
@@ -106,7 +106,7 @@ Hello star!
 Note that this a very simple degenerate example which does not demonstrate the standard Dynamo FrontEnd-Backend deployment. The hello-world client is not a web server, it is a one-off function which sends the predefined text "world,sun,moon,star" to the backend. The example is meant to show the HelloWorldWorker. As such you will only see the HelloWorldWorker pod in deployment. The client will run and exit and the pod will not be operational.
 
 
-Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Kubernetes Platform.
+Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/README.md) to install Dynamo Kubernetes Platform.
 Then deploy to kubernetes using
 
 ```bash
@@ -119,4 +119,4 @@ to delete your deployment:
 
 ```bash
 kubectl delete dynamographdeployment hello-world -n ${NAMESPACE}
-```
\ No newline at end of file
+```

From 41ed5c65438f3e0f98675b5b56394e582142a5f2 Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 15:11:23 -0700
Subject: [PATCH 16/17] CodeRabbit feedback: download docker compose file since
 no assumed git repo in these steps

---
 docs/_includes/quick_start_local.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/_includes/quick_start_local.rst b/docs/_includes/quick_start_local.rst
index 02bc5dd1693..8d74d3d2ba1 100644
--- a/docs/_includes/quick_start_local.rst
+++ b/docs/_includes/quick_start_local.rst
@@ -16,8 +16,9 @@ Get started with Dynamo locally in just a few commands:
 
 .. code-block:: bash
 
-   # Start etcd and NATS using Docker Compose
-   docker compose -f deploy/docker-compose.yml up -d
+   # Fetch and start etcd and NATS using Docker Compose
+   curl -fsSL -o docker-compose.yml https://raw.githubusercontent.com/ai-dynamo/dynamo/release/0.4.1/deploy/docker-compose.yml
+   docker compose -f docker-compose.yml up -d
 
 **3. Run Dynamo**
 

From 050fdc21910a8303848a95d86d1b448fc22fea0e Mon Sep 17 00:00:00 2001
From: Ryan McCormick <rmccormick@nvidia.com>
Date: Mon, 25 Aug 2025 15:12:34 -0700
Subject: [PATCH 17/17] Remove outdated deploy quickstart from hidden_toctree

---
 docs/hidden_toctree.rst | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/hidden_toctree.rst b/docs/hidden_toctree.rst
index 22a467cc14a..dd6b1f17013 100644
--- a/docs/hidden_toctree.rst
+++ b/docs/hidden_toctree.rst
@@ -31,7 +31,6 @@
    guides/dynamo_deploy/k8s_metrics.md
    guides/dynamo_deploy/model_caching_with_fluid.md
    guides/dynamo_deploy/README.md
-   guides/dynamo_deploy/quickstart.md
    guides/dynamo_run.md
    guides/metrics.md
    guides/run_kvbm_in_vllm.md