diff --git a/benchmarks/llm/README.md b/benchmarks/llm/README.md index e0cb8e976d..614dbd9be4 100644 --- a/benchmarks/llm/README.md +++ b/benchmarks/llm/README.md @@ -12,4 +12,3 @@ See the License for the specific language governing permissions and limitations under the License. --> -[../../examples/llm/benchmarks/README.md](../../examples/llm/benchmarks/README.md) diff --git a/components/README.md b/components/README.md index 2c5677eae7..3f638f5371 100644 --- a/components/README.md +++ b/components/README.md @@ -77,4 +77,4 @@ To get started with Dynamo components: 4. **Run deployment scripts** from the engine's launch directory 5. **Monitor performance** using the metrics component -For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/). +For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/). diff --git a/components/backends/llama_cpp/README.md b/components/backends/llama_cpp/README.md index f7c9e6520e..78a553c0c1 100644 --- a/components/backends/llama_cpp/README.md +++ b/components/backends/llama_cpp/README.md @@ -13,7 +13,7 @@ python -m dynamo.llama_cpp --model-path /data/models/Qwen3-0.6B-Q8_0.gguf [args] ## Request Migration -In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend. +In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend. The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues. diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index e1d71516d5..705c65d3a8 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -52,8 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ## Quick Start -Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each. - +Below we provide a guide that lets you run all of our the common deployment patterns on a single node. ### Start NATS and ETCD in the background Start using [Docker Compose](../../../deploy/docker-compose.yml) @@ -141,7 +140,7 @@ cd $DYNAMO_ROOT/components/backends/sglang ## Request Migration -In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend. +In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend. The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues. @@ -164,7 +163,6 @@ Below we provide a selected list of advanced examples. Please open up an issue i ### Large scale P/D disaggregation with WideEP - **[Run DeepSeek-R1 on 104+ H100s](docs/dsr1-wideep-h100.md)** -- **[Run DeepSeek-R1 on GB200s](docs/dsr1-wideep-gb200.md)** ### Speculative Decoding - **[Deploying DeepSeek-R1 with MTP - coming soon!](.)** diff --git a/components/backends/sglang/docs/dsr1-wideep-h100.md b/components/backends/sglang/docs/dsr1-wideep-h100.md index d766bc3edf..ecfb7a9f60 100644 --- a/components/backends/sglang/docs/dsr1-wideep-h100.md +++ b/components/backends/sglang/docs/dsr1-wideep-h100.md @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 # Running DeepSeek-R1 Disaggregated with WideEP on H100s -Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs). +Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs). ## Instructions diff --git a/components/backends/sglang/slurm_jobs/README.md b/components/backends/sglang/slurm_jobs/README.md index 19f7c27ada..7fa454f39c 100644 --- a/components/backends/sglang/slurm_jobs/README.md +++ b/components/backends/sglang/slurm_jobs/README.md @@ -1,10 +1,10 @@ # Example: Deploy Multi-node SGLang with Dynamo on SLURM -This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster. +This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster. ## Overview -The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode. +The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode. The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks. ## Scripts @@ -56,7 +56,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl If your cluster supports similar container based plugins, you may be able to modify the template to use that instead. 3. We assume you have already built a recent Dynamo+SGLang container image as - described [here](../dsr1-wideep.md#instructions). + described [here](../docs/dsr1-wideep-h100.md#instructions). This is the image that can be passed to the `--container-image` argument in later steps. ## Usage diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md index aa38ea0cf6..452b8f1f6b 100644 --- a/components/backends/trtllm/README.md +++ b/components/backends/trtllm/README.md @@ -263,7 +263,7 @@ Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disag ## Request Migration -In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend. +In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend. The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues. @@ -279,7 +279,7 @@ The migrated request will continue responding to the original request, allowing ## Client -See [client](../llm/README.md#client) section to learn how to send request to the deployment. +See the [quickstart guide](../../../examples/basics/quickstart/README.md#3-send-requests) to learn how to send request to the deployment. NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend `. diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md index 6ff95160bb..cd4de036a3 100644 --- a/components/backends/vllm/README.md +++ b/components/backends/vllm/README.md @@ -235,7 +235,7 @@ The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html ## Request Migration -In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the Frontend and the Backend. +In a Distributed System, a request may fail due to connectivity issues between the Frontend and the Backend. The Frontend will automatically track which Backends are having connectivity issues with it and avoid routing new requests to the Backends with known connectivity issues. diff --git a/deploy/cloud/README.md b/deploy/cloud/README.md index 0f4ad5635e..dfbb10f392 100644 --- a/deploy/cloud/README.md +++ b/deploy/cloud/README.md @@ -21,6 +21,6 @@ This directory contains the infrastructure components required for the Dynamo cl For detailed documentation on setting up and using the Dynamo Cloud Platform, please refer to: - [Dynamo Cloud Platform Guide](../../docs/guides/dynamo_deploy/dynamo_cloud.md) -- [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md) +- [Operator Deployment Guide](../../docs/guides/dynamo_deploy/dynamo_operator.md) -For a quick start example, see [examples/hello_world/README.md#deploying-to-kubernetes-using-dynamo-cloud-and-dynamo-deploy-cli](../../examples/hello_world/README.md#deploying-to-kubernetes-using-dynamo-cloud-and-dynamo-deploy-cli) \ No newline at end of file +For a quick start example, see [examples/runtime/hello_world/README.md#deployment-to-kubernetes](../../examples/runtime/hello_world/README.md#deployment-to-kubernetes) \ No newline at end of file diff --git a/deploy/inference-gateway/README.md b/deploy/inference-gateway/README.md index 7787d57b64..0476e978bc 100644 --- a/deploy/inference-gateway/README.md +++ b/deploy/inference-gateway/README.md @@ -18,8 +18,7 @@ Currently, this setup is only kgateway based Inference Gateway. 1. **Install Dynamo Platform** -[See Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud. - +[See Quickstart Guide](../../docs/guides/dynamo_deploy/quickstart.md) to install Dynamo Cloud. 2. **Deploy Inference Gateway** diff --git a/deploy/metrics/README.md b/deploy/metrics/README.md index ce3b8e6aef..e23a13263f 100644 --- a/deploy/metrics/README.md +++ b/deploy/metrics/README.md @@ -87,7 +87,7 @@ Grafana is pre-configured with: ## Required Files The following configuration files should be present in this directory: -- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services +- [docker-compose.yml](../docker-compose.yml): Defines the Prometheus and Grafana services - [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration - [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration - [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration diff --git a/docs/API/nixl_connect/connector.md b/docs/API/nixl_connect/connector.md index 7b8b1fa611..99bc81fc5b 100644 --- a/docs/API/nixl_connect/connector.md +++ b/docs/API/nixl_connect/connector.md @@ -28,7 +28,7 @@ The connector provides two methods of moving data between workers: - Preparing local memory to be read by a remote worker. -In both cases, local memory is registered with the NIXL-based RDMA subsystem via the [`Descriptor`](#descriptor) class and provided to the connector. +In both cases, local memory is registered with the NIXL-based RDMA subsystem via the [`Descriptor`](descriptor.md) class and provided to the connector. The connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object. The operation control object, either a [`ReadableOperation`](readable_operation.md) or a [`WritableOperation`](writable_operation.md), provides RDMA metadata ([RdmaMetadata](rdma_metadata.md)) via its `.metadata()` method, functionality to query the operation's current state, as well as the ability to cancel the operation prior to its completion. diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md index 32146e1188..a17a7b11ec 100644 --- a/docs/architecture/dynamo_flow.md +++ b/docs/architecture/dynamo_flow.md @@ -17,7 +17,7 @@ limitations under the License. # Dynamo Architecture Flow -This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm). Color-coded flows indicate different types of operations: +This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/v0.3.2/examples/llm). Color-coded flows indicate different types of operations: ## 🔵 Main Request Flow (Blue) The primary user journey through the system: diff --git a/docs/components/backends/llm/README.md b/docs/components/backends/llm/README.md deleted file mode 120000 index 615da9417b..0000000000 --- a/docs/components/backends/llm/README.md +++ /dev/null @@ -1 +0,0 @@ -../../../../components/backends/llm/README.md \ No newline at end of file diff --git a/docs/guides/dynamo_deploy/README.md b/docs/guides/dynamo_deploy/README.md index 516162d911..44155e46ca 100644 --- a/docs/guides/dynamo_deploy/README.md +++ b/docs/guides/dynamo_deploy/README.md @@ -38,5 +38,4 @@ Users who need more control over their deployments can use the manual deployment - Provides full control over deployment parameters - Requires manual management of infrastructure components - Documentation: - - [Using the Deployment Script](manual_helm_deployment.md#using-the-deployment-script): all-in-one script for manual deployment - - [Helm Deployment Guide](manual_helm_deployment.md#helm-deployment-guide): detailed instructions for manual deployment + - [Helm Deployment Guide](../../../deploy/helm/README.md): detailed instructions for manual deployment diff --git a/docs/guides/dynamo_deploy/operator_deployment.md b/docs/guides/dynamo_deploy/operator_deployment.md deleted file mode 120000 index 80ca4341ee..0000000000 --- a/docs/guides/dynamo_deploy/operator_deployment.md +++ /dev/null @@ -1 +0,0 @@ -../../../guides/dynamo_deploy/operator_deployment.md \ No newline at end of file diff --git a/docs/guides/dynamo_deploy/quickstart.md b/docs/guides/dynamo_deploy/quickstart.md index 5639b92f87..fd49463a43 100644 --- a/docs/guides/dynamo_deploy/quickstart.md +++ b/docs/guides/dynamo_deploy/quickstart.md @@ -67,7 +67,7 @@ Ensure you have the source code checked out and are in the `dynamo` directory: ### Set Environment Variables -Our examples use the [`nvcr.io`](https://nvcr.io/nvidia/ai-dynamo/) but you can setup your own values if you use another docker registry. +Our examples use the `nvcr.io` but you can setup your own values if you use another docker registry. ```bash export NAMESPACE=dynamo-cloud # or whatever you prefer. diff --git a/docs/guides/dynamo_run.md b/docs/guides/dynamo_run.md index 0453fc7ccd..9a30270dea 100644 --- a/docs/guides/dynamo_run.md +++ b/docs/guides/dynamo_run.md @@ -211,7 +211,7 @@ The KV-aware routing arguments: ### Request Migration -In a [Distributed System](#distributed-system), a request may fail due to connectivity issues between the HTTP Server and the Worker Engine. +In a Distributed System, a request may fail due to connectivity issues between the HTTP Server and the Worker Engine. The HTTP Server will automatically track which Worker Engines are having connectivity issues with it and avoid routing new requests to the Engines with known connectivity issues. @@ -482,11 +482,11 @@ The trtllm engine requires [etcd](https://etcd.io/) and [nats](https://nats.io/) ##### Step 1: Build the environment -See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#build-docker) to build the dynamo container with TensorRT-LLM. +See instructions [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-container) to build the dynamo container with TensorRT-LLM. ##### Step 2: Run the environment -See instructions [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/README.md#run-container) to run the built environment. +See instructions [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#run-container) to run the built environment. ##### Step 3: Execute `dynamo-run` command @@ -679,10 +679,6 @@ Here are some example engines: - Chat: * [sglang](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/server_sglang_tok.py) -More fully-featured Backend engines (used by `dynamo-run`): -- [vllm](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/vllm_inc.py) -- [sglang](https://github.com/ai-dynamo/dynamo/blob/main/launch/dynamo-run/src/subprocess/sglang_inc.py) - ### Debugging `dynamo-run` and `dynamo-runtime` support [tokio-console](https://github.com/tokio-rs/console). Build with the feature to enable: diff --git a/examples/basics/disaggregated_serving/README.md b/examples/basics/disaggregated_serving/README.md index dee80fcb0f..ba501c43be 100644 --- a/examples/basics/disaggregated_serving/README.md +++ b/examples/basics/disaggregated_serving/README.md @@ -37,8 +37,8 @@ docker compose -f deploy/metrics/docker-compose.yml up -d ## Components - [Frontend](../../../components/frontend/README) - HTTP API endpoint that receives requests and forwards them to the decode worker -- [vLLM Prefill Worker](../../../components/backends/vllm/README) - Specialized worker for prefill phase execution -- [vLLM Decode Worker](../../../components/backends/vllm/README) - Specialized worker that handles requests and decides between local/remote prefill +- [vLLM Prefill Worker](../../../components/backends/vllm/README.md) - Specialized worker for prefill phase execution +- [vLLM Decode Worker](../../../components/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill ```mermaid --- diff --git a/examples/basics/multimodal/README.md b/examples/basics/multimodal/README.md index 693bfdeb98..0425ede473 100644 --- a/examples/basics/multimodal/README.md +++ b/examples/basics/multimodal/README.md @@ -35,7 +35,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) ### Components -- workers: For aggregated serving, we have two workers, [encode_worker](components/encode_worker.py) for encoding and [decode_worker](components/decode_worker.py) for prefilling and decoding. +- workers: For aggregated serving, we have two workers, [encode_worker](../../../components/encode_worker.py) for encoding and [decode_worker](components/decode_worker.py) for prefilling and decoding. - processor: Tokenizes the prompt and passes it to the decode worker. - frontend: HTTP endpoint to handle incoming requests. diff --git a/examples/basics/multinode/README.md b/examples/basics/multinode/README.md index 9959899648..fadd8af294 100644 --- a/examples/basics/multinode/README.md +++ b/examples/basics/multinode/README.md @@ -85,7 +85,7 @@ Install Dynamo with [SGLang](https://docs.sglang.ai/) support: pip install ai-dynamo[sglang] ``` -For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../components/backends/sglang/README.md). +For more information about the SGLang backend and its integration with Dynamo, see the [SGLang Backend Documentation](../../../components/backends/sglang/README.md). ### 3. Network Requirements diff --git a/examples/basics/quickstart/README.md b/examples/basics/quickstart/README.md index 694243d5d6..99dc405a0f 100644 --- a/examples/basics/quickstart/README.md +++ b/examples/basics/quickstart/README.md @@ -18,7 +18,7 @@ docker compose -f deploy/metrics/docker-compose.yml up -d ## Components - [Frontend](../../../components/frontend/README) - A built-in component that launches an OpenAI compliant HTTP server, a pre-processor, and a router in a single process -- [vLLM Backend](../../../components/backends/vllm/README) - A built-in component that runs vLLM within the Dynamo runtime +- [vLLM Backend](../../../components/backends/vllm/README.md) - A built-in component that runs vLLM within the Dynamo runtime ```mermaid ---