Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
c83ca40
cp(#2351): Move backend READMEs to docs folder and fix relative path …
rmccorm4 Aug 16, 2025
5bb296a
cp(#2346): Move hello_world example README to docs, swap symlinks, fi…
rmccorm4 Aug 16, 2025
ce14f57
docs: Copy over index.rst and hidden_toctree.rst from v0.4.0 to main
rmccorm4 Aug 17, 2025
5a8d52e
Revert "cp(#2351): Move backend READMEs to docs folder and fix relati…
rmccorm4 Aug 17, 2025
016d2c1
Revert "cp(#2346): Move hello_world example README to docs, swap syml…
rmccorm4 Aug 17, 2025
915847c
Rename multimodal_v1 to multimodal, and fix sglang link
rmccorm4 Aug 17, 2025
201c7f9
Bring back missing benchmark README
rmccorm4 Aug 17, 2025
241c614
Fix all broken links caught by lychee
rmccorm4 Aug 17, 2025
eac4cc5
Add github action for link validation (lychee)
rmccorm4 Aug 17, 2025
fc63b57
Update RELEASE_VERSION to 0.4.0 in dynamo_deploy quickstart
rmccorm4 Aug 17, 2025
ced1a73
Merge branch 'main' into rmccormick/cp_anish_docs_to_main
rmccorm4 Aug 17, 2025
8d6be0a
Remove benchmark README for easier review - will restore in a separat…
rmccorm4 Aug 17, 2025
f7f9350
Merge branch 'rmccormick/cp_anish_docs_to_main' of github.com:ai-dyna…
rmccorm4 Aug 17, 2025
a8e5396
Remove unused env var from link check action
rmccorm4 Aug 17, 2025
8d0e50b
Add WAR for lychee cert error
rmccorm4 Aug 17, 2025
c3ec608
Address CodeRabbit feedback
rmccorm4 Aug 17, 2025
4381aa8
Address CodeRabbit feedback - add TODO in workflow for lychee install
rmccorm4 Aug 17, 2025
ef0b231
Try installing ca-certs for cert errors
rmccorm4 Aug 17, 2025
fdfd807
Set GITHUB_TOKEN to avoid github rate limits on URL checks
rmccorm4 Aug 17, 2025
6b7c690
Add lychee result caching
rmccorm4 Aug 17, 2025
3e60e42
Add lychee result caching docs reference
rmccorm4 Aug 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix all broken links caught by lychee
  • Loading branch information
rmccorm4 committed Aug 17, 2025
commit 241c614b94f4e89100f3e86f32e4d2bbde27f831
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ Run the backend/worker like this:
python -m dynamo.sglang.worker --help
```

You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/advanced_features/server_arguments.html . See there to use multiple GPUs.

## TensorRT-LLM

Expand Down
19 changes: 3 additions & 16 deletions benchmarks/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ This guide provides detailed steps on benchmarking Large Language Models (LLMs)
3. Start NATS and ETCD

```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
docker compose -f deploy/docker-compose.yml up -d
```

> [!NOTE]
Expand Down Expand Up @@ -312,27 +312,14 @@ With the Pareto Frontiers of the baseline and the disaggregated results plotted
greatest increase in throughput (along the y-axis) between the baseline and the disaggregated result Pareto Frontier,
over different latencies (along the x-axis).

For example, at 45 tokens/s/user, the increase in tokens/s/gpu is `145 - 80 = 65`, from the orange baseline to the
blue disaggregated line, so the improvement is around 1.44x speed up:
![Example Pareto Plot](./example_plots/single_node_pareto_plot.png)
Note: The above example was collected over a single benchmarking run, the actual number may vary between runs, configurations and hardware.

## Supporting Additional Models

The instructions above can be used for nearly any model desired.
More complex setup instructions might be required for certain models.
The above instruction regarding ETCD, NATS, nginx, dynamo-serve, and GenAI-Perf still apply and can be reused.
The specifics of deploying with different hardware, in a unique environment, or using another model framework can be adapted using the links below.

Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](../../../benchmarks/llm/perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.

### Deployment Examples

- [Dynamo Multinode Deployments](../../../docs/examples/multinode.md)
- [Dynamo TensorRT LLM Deployments](../../../docs/examples/trtllm.md)
- [Aggregated Deployment of Very Large Models](../../../docs/examples/multinode.md#aggregated-deployment)
- [Dynamo vLLM Deployments](../../../docs/examples/llm_deployment.md)

Regardless of the deployment mechanism, the GenAI-Perf tool will report the same metrics and measurements so long as an accessible endpoint is available for it to interact with. Use the provided [perf.sh](perf.sh) script to automate the measurement of model throughput and latency against multiple request concurrences.

## Monitor Benchmark Startup Status

Expand Down Expand Up @@ -388,7 +375,7 @@ when the script is invoked, it will:
## Metrics and Visualization

For instructions on how to acquire per worker metrics and visualize them using Grafana,
please see the provided [Visualization with Prometheus and Grafana](../../../deploy/metrics/README.md).
please see the provided [Visualization with Prometheus and Grafana](../../docs/guides/metrics.md).

## Troubleshooting

Expand Down
2 changes: 1 addition & 1 deletion components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ To get started with Dynamo components:
4. **Run deployment scripts** from the engine's launch directory
5. **Monitor performance** using the metrics component

For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../../docs/).
For detailed instructions, see the README files in each component directory and the main [Dynamo documentation](../docs/).
2 changes: 1 addition & 1 deletion components/backends/sglang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

## Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node. See our different [architectures](../llm/README.md#deployment-architectures) for a high level overview of each pattern and the architecture diagram for each.
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

### Start NATS and ETCD in the background

Expand Down
4 changes: 2 additions & 2 deletions components/backends/sglang/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ extraPodSpec:

Before using these templates, ensure you have:

1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md)
1. **Dynamo Cloud Platform installed** - See [Installing Dynamo Cloud](../../../../docs/guides/dynamo_deploy/dynamo_cloud.md)
2. **Kubernetes cluster with GPU support**
3. **Container registry access** for SGLang runtime images
4. **HuggingFace token secret** (referenced as `envFromSecret: hf-token-secret`)
Expand Down Expand Up @@ -159,4 +159,4 @@ Common issues and solutions:
3. **Health check failures**: Review model loading logs and increase `initialDelaySeconds`
4. **Out of memory**: Increase memory limits or reduce model batch size

For additional support, refer to the [deployment troubleshooting guide](../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
For additional support, refer to the [deployment troubleshooting guide](../../../../docs/guides/dynamo_deploy/quickstart.md#troubleshooting).
2 changes: 1 addition & 1 deletion components/backends/sglang/docs/dsr1-wideep-h100.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0

# Running DeepSeek-R1 Disaggregated with WideEP on H100s

Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://www.nvidia.com/en-us/technologies/ai/deepseek-r1-large-scale-p-d-with-wide-expert-parallelism/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).
Dynamo supports SGLang's implementation of wide expert parallelism and large scale P/D for DeepSeek-R1! You can read their blog post [here](https://lmsys.org/blog/2025-05-05-large-scale-ep/) for more details. We provide a Dockerfile for this in `container/Dockerfile.sglang-deepep` and configurations to deploy this at scale. In this example, we will run 1 prefill worker on 4 H100 nodes and 1 decode worker on 9 H100 nodes (104 total GPUs).

## Instructions

Expand Down
6 changes: 3 additions & 3 deletions components/backends/sglang/slurm_jobs/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Example: Deploy Multi-node SGLang with Dynamo on SLURM

This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) on a SLURM cluster.
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) on a SLURM cluster.

## Overview

The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../dsr1-wideep.md) example, with separate nodes handling prefill and decode.
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](../docs/dsr1-wideep-h100.md) example, with separate nodes handling prefill and decode.
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.

## Scripts
Expand Down Expand Up @@ -57,7 +57,7 @@ For simplicity of the example, we will make some assumptions about your SLURM cl
If your cluster supports similar container based plugins, you may be able to
modify the template to use that instead.
3. We assume you have already built a recent Dynamo+SGLang container image as
described [here](../dsr1-wideep.md#instructions).
described [here](../docs/dsr1-wideep-h100.md#instructions).
This is the image that can be passed to the `--container-image` argument in later steps.

## Usage
Expand Down
4 changes: 2 additions & 2 deletions components/backends/trtllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ For complete Kubernetes deployment instructions, configurations, and troubleshoo

### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.
See [client](../vllm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

Expand Down Expand Up @@ -233,7 +233,7 @@ This allows a request to be migrated up to 3 times before failing. See the [Requ

## Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.
See [client](../vllm/README.md#client) section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ envs:

## Testing the Deployment

Send a test request to verify your deployment. See the [client section](../../../../components/backends/llm/README.md#client) for detailed instructions.
Send a test request to verify your deployment. See the [client section](../../../../components/backends/vllm/README.md#client) for detailed instructions.

**Note:** For multi-node deployments, target the node running `python3 -m dynamo.frontend <args>`.

Expand Down
2 changes: 1 addition & 1 deletion components/backends/trtllm/gpt-oss.md
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@ flowchart TD

## Next Steps

- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../examples/basics/multinode/README.md)
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use GenAI-Perf to measure and optimize your deployment performance
2 changes: 1 addition & 1 deletion components/backends/vllm/LMCache_Integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,5 +166,5 @@ lmcache_config = {
## References and Additional Resources

- [LMCache Documentation](https://docs.lmcache.ai/index.html) - Comprehensive guide and API reference
- [Configuration Reference](https://docs.lmcache.ai/api_reference/config.html) - Detailed configuration options
- [Configuration Reference](https://docs.lmcache.ai/api_reference/configurations.html) - Detailed configuration options

File renamed without changes.
4 changes: 1 addition & 3 deletions components/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ The deprecated `metrics` component is a utility for collecting, aggregating, and

**Note**: This is a demo implementation. The deprecated `metrics` component is no longer under active development.
- In this demo the metrics names use the prefix "llm", but in production they will be prefixed with "dynamo" (e.g., the HTTP `/metrics` endpoint will serve metrics with "dynamo" prefixes)
- This demo will only work when using examples/llm/configs/agg.yml-- other configurations will not work

<div align="center">
<img src="images/dynamo_metrics_grafana.png" alt="Dynamo Metrics Dashboard"/>
Expand Down Expand Up @@ -81,8 +80,7 @@ metrics --component MyComponent --endpoint my_endpoint

### Real Worker

To run a more realistic deployment to gathering metrics from,
see the examples in [examples/llm](../../examples/llm).
To run a more realistic deployment to gather metrics:

```bash
python -m dynamo.frontend &
Expand Down
2 changes: 1 addition & 1 deletion deploy/metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), th
### Required Files

The following configuration files should be present in this directory:
- [docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
- [docker-compose.yml](../docker-compose.yml): Defines the Prometheus and Grafana services
- [prometheus.yml](./prometheus.yml): Contains Prometheus scraping configuration
- [grafana-datasources.yml](./grafana-datasources.yml): Contains Grafana datasource configuration
- [grafana_dashboards/grafana-dashboard-providers.yml](./grafana_dashboards/grafana-dashboard-providers.yml): Contains Grafana dashboard provider configuration
Expand Down
10 changes: 4 additions & 6 deletions docs/API/nixl_connect/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ flowchart LR

### Multimodal Example

In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md):
In the case of the [Dynamo Multimodal Disaggregated Example](../../../examples/multimodal/README.md):

1. The HTTP frontend accepts a text prompt and a URL to an image.

Expand Down Expand Up @@ -153,11 +153,11 @@ flowchart LR

#### Code Examples

See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example,
See [prefill_worker](../../../examples/multimodal/components/worker.py) or [decode_worker](../../../examples/multimodal/components/worker.py) from our Multimodal example,
for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md),
sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.

See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
See [encode_worker](../../..//examples/multimodal/components/encode_worker.py#L190) from our Multimodal example,
for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker,
and the worker awaits for the data transfer to complete for yielding a response.
Expand All @@ -170,15 +170,13 @@ and the worker awaits for the data transfer to complete for yielding a response.
- [Device](device.md)
- [ReadOperation](read_operation.md)
- [ReadableOperation](readable_operation.md)
- [SerializedRequest](serialized_request.md)
- [WritableOperation](writable_operation.md)
- [WriteOperation](write_operation.md)


## References

- [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
- [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect)
- [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
- [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
- [Dynamo Multimodal Example](../../..//examples/multimodal)
- [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
2 changes: 1 addition & 1 deletion docs/architecture/dynamo_flow.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ limitations under the License.

# Dynamo Architecture Flow

This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [examples/llm](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm). Color-coded flows indicate different types of operations:
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in [components/backends/vllm](../../components/backends/vllm). Color-coded flows indicate different types of operations:

## 🔵 Main Request Flow (Blue)
The primary user journey through the system:
Expand Down
4 changes: 2 additions & 2 deletions docs/guides/deploy/k8s_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ This will create two components:
- A Worker component exposing metrics on its system port

Both components expose a `/metrics` endpoint following the OpenMetrics format, but with different metrics appropriate to their roles. For details about:
- Deployment configuration: See the [vLLM README](../../../../components/backends/vllm/README.md)
- Deployment configuration: See the [vLLM README](../../components/backends/vllm/README.md)
- Available metrics: See the [metrics guide](../metrics.md)

### Validate the Deployment
Expand All @@ -62,7 +62,7 @@ curl localhost:8080/v1/chat/completions \
}'
```

For more information about validating the deployment, see the [vLLM README](../../../../components/backends/vllm/README.md).
For more information about validating the deployment, see the [vLLM README](../../components/backends/vllm/README.md).

## Set Up Metrics Collection

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/dynamo_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ You can use `kubectl get dynamoGraphDeployment -n ${NAMESPACE}` to view your dep
You can use `kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}` to delete the deployment.

We provide a Custom Resource YAML file for many examples under the `deploy/` folder.
Use [VLLM YAML](../../components/backends/vllm/deploy/agg.yaml) for an example.
Use [VLLM YAML](../../../components/backends/vllm/deploy/agg.yaml) for an example.

**Note 1** Example Image

Expand Down
2 changes: 1 addition & 1 deletion docs/guides/dynamo_deploy/model_caching_with_fluid.md
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ spec:

- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://min.io/docs/)
- [MinIO Documentation](https://docs.min.io/)
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
12 changes: 6 additions & 6 deletions docs/guides/dynamo_deploy/multinode-deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ These systems provide enhanced scheduling capabilities including topology-aware

LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.

- **LWS**: [LWS Installation](https://github.com/NVIDIA/LWS#installation)
- **Volcano**: [Volcano Installation](https://volcano.sh/docs/installation/install-volcano/)
- **LWS**: [LWS Installation](https://github.com/kubernetes-sigs/lws#installation)
- **Volcano**: [Volcano Installation](https://volcano.sh/en/docs/installation/)

Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.

Expand Down Expand Up @@ -110,8 +110,8 @@ args:

For additional support and examples, see the working multinode configurations in:

- **SGLang**: [components/backends/sglang/deploy/](../../components/backends/sglang/deploy/)
- **TensorRT-LLM**: [components/backends/trtllm/deploy/](../../components/backends/trtllm/deploy/)
- **vLLM**: [components/backends/vllm/deploy/](../../components/backends/vllm/deploy/)
- **SGLang**: [components/backends/sglang/deploy/](../../../components/backends/sglang/deploy/)
- **TensorRT-LLM**: [components/backends/trtllm/deploy/](../../../components/backends/trtllm/deploy/)
- **vLLM**: [components/backends/vllm/deploy/](../../../components/backends/vllm/deploy/)

These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
These examples demonstrate proper usage of the `multinode` section with corresponding `gpu` limits and correct `tp-size` configuration.
2 changes: 1 addition & 1 deletion docs/guides/dynamo_deploy/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Ensure you have the source code checked out and are in the `dynamo` directory:

### Set Environment Variables

Our examples use the [`nvcr.io`](https://nvcr.io/nvidia/ai-dynamo/) but you can setup your own values if you use another docker registry.
Our examples use the [`nvcr.io`](https://catalog.ngc.nvidia.com) but you can setup your own values if you use another docker registry.

```bash
export NAMESPACE=dynamo-cloud # or whatever you prefer.
Expand Down
6 changes: 3 additions & 3 deletions examples/basics/disaggregated_serving/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ docker compose -f deploy/metrics/docker-compose.yml up -d

## Components

- [Frontend](../../../components/frontend/README) - HTTP API endpoint that receives requests and forwards them to the decode worker
- [vLLM Prefill Worker](../../../components/backends/vllm/README) - Specialized worker for prefill phase execution
- [vLLM Decode Worker](../../../components/backends/vllm/README) - Specialized worker that handles requests and decides between local/remote prefill
- [Frontend](../../../components/frontend/README.md) - HTTP API endpoint that receives requests and forwards them to the decode worker
- [vLLM Prefill Worker](../../../components/backends/vllm/README.md) - Specialized worker for prefill phase execution
- [vLLM Decode Worker](../../../components/backends/vllm/README.md) - Specialized worker that handles requests and decides between local/remote prefill

```mermaid
---
Expand Down
Loading