diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md index 9272d2e91e..46a6b039f1 100644 --- a/components/backends/sglang/README.md +++ b/components/backends/sglang/README.md @@ -46,7 +46,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | Feature | SGLang | Notes | |---------------------|--------|--------------------------------------------------------------| | **WideEP** | ✅ | Full support on H100s/GB200 | -| **DP Rank Routing** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker | +| **Attention DP** | 🚧 | Direct routing supported. Dynamo KV router does not router to DP worker | | **GB200 Support** | ✅ | | diff --git a/components/backends/trtllm/README.md b/components/backends/trtllm/README.md index 6b710cfee3..e5cf0eb172 100644 --- a/components/backends/trtllm/README.md +++ b/components/backends/trtllm/README.md @@ -61,7 +61,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | Feature | TensorRT-LLM | Notes | |--------------------|--------------|-----------------------------------------------------------------------| | **WideEP** | ✅ | | -| **DP Rank Routing**| ✅ | | +| **Attention DP** | ✅ | | | **GB200 Support** | ✅ | | ## Quick Start diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md index f7d4019bc5..74d82e6db2 100644 --- a/components/backends/vllm/README.md +++ b/components/backends/vllm/README.md @@ -47,7 +47,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) | Feature | vLLM | Notes | |--------------------|------|-----------------------------------------------------------------------| | **WideEP** | ✅ | Support for PPLX / DeepEP not verified | -| **DP Rank Routing**| ✅ | Supported via external control of DP ranks | +| **Attention DP** | ✅ | Supported via external control of DP ranks | | **GB200 Support** | 🚧 | Container functional on main | ## Quick Start diff --git a/docs/API/nixl_connect/README.md b/docs/API/nixl_connect/README.md index 741b943847..8952548eff 100644 --- a/docs/API/nixl_connect/README.md +++ b/docs/API/nixl_connect/README.md @@ -85,68 +85,6 @@ flowchart LR e2@{ animate: true; } ``` -### Multimodal Example - -In the case of the [Dynamo Multimodal Disaggregated Example](../../examples/multimodal/README.md): - - 1. The HTTP frontend accepts a text prompt and a URL to an image. - - 2. The prompt and URL are then enqueued with the Processor before being dispatched to the first available Decode Worker. - - 3. Decode Worker then requests a Prefill Worker to provide key-value data for the LLM powering the Decode Worker. - - 4. Prefill Worker then requests that the image be processed and provided as embeddings by the Encode Worker. - - 5. Encode Worker acquires the image, processes it, performs inference on the image using a specialized vision model, and finally provides the embeddings to Prefill Worker. - - 6. Prefill Worker receives the embeddings from Encode Worker and generates a key-value cache (KV$) update for Decode Worker's LLM and writes the update directly to the GPU memory reserved for the data. - - 7. Finally, Decode Worker performs the requested inference. - -```mermaid ---- -title: Multimodal Disaggregated Workflow ---- -flowchart LR - p0[HTTP Frontend] i0@--"text prompt"-->p1[Processor] - p0 i1@--"url"-->p1 - p1 i2@--"prompt"-->dw[Decode Worker] - p1 i3@--"url"-->dw - dw i4@--"prompt"-->pw[Prefill Worker] - dw i5@--"url"-->pw - pw i6@--"url"-->ew[Encode Worker] - ew o0@=="image embeddings"==>pw - pw o1@=="kv_cache updates"==>dw - dw o2@--"inference results"-->p0 - - i0@{ animate: true; } - i1@{ animate: true; } - i2@{ animate: true; } - i3@{ animate: true; } - i4@{ animate: true; } - i5@{ animate: true; } - i6@{ animate: true; } - o0@{ animate: true; } - o1@{ animate: true; } - o2@{ animate: true; } -``` - -> [!Note] -> In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library. -> The KV Cache transfer between Decode Worker and Prefill Worker utilizes the NIXL base RDMA subsystem directly without using the Dynamo NIXL Connect library. - -#### Code Examples - -See [prefill_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/prefill_worker.py#L199) or [decode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/decode_worker.py#L239) from our Multimodal example, -for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable_operation.md), -sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data. - -See [encode_worker](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal/components/encode_worker.py#L190) from our Multimodal example, -for how the resulting embeddings are registered with the RDMA subsystem by creating a [`Descriptor`](descriptor.md), -a [`WriteOperation`](write_operation.md) is created using the metadata provided by the requesting worker, -and the worker awaits for the data transfer to complete for yielding a response. - - ## Python Classes - [Connector](connector.md) @@ -154,7 +92,6 @@ and the worker awaits for the data transfer to complete for yielding a response. - [Device](device.md) - [ReadOperation](read_operation.md) - [ReadableOperation](readable_operation.md) - - [SerializedRequest](serialized_request.md) - [WritableOperation](writable_operation.md) - [WriteOperation](write_operation.md) @@ -164,5 +101,4 @@ and the worker awaits for the data transfer to complete for yielding a response. - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo) - [NVIDIA Dynamo NIXL Connect](https://github.com/ai-dynamo/dynamo/tree/main/docs/runtime/nixl_connect) - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl) - - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal) - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect) diff --git a/docs/architecture/dynamo_flow.md b/docs/architecture/dynamo_flow.md index 32146e1188..e4d060340d 100644 --- a/docs/architecture/dynamo_flow.md +++ b/docs/architecture/dynamo_flow.md @@ -67,7 +67,7 @@ Coordination and messaging support: ## Technical Implementation Details -### NIXL (NVIDIA Interchange Library): +### NIXL (NVIDIA Inference Xfer Library): - Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe - Decode Worker publishes GPU metadata to ETCD for coordination - PrefillWorker loads metadata to establish direct communication channels diff --git a/docs/architecture/kv_cache_routing.md b/docs/architecture/kv_cache_routing.md index a78feef9f5..35e5095b59 100644 --- a/docs/architecture/kv_cache_routing.md +++ b/docs/architecture/kv_cache_routing.md @@ -21,7 +21,7 @@ The KV-aware routing arguments: - `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked. -- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events. +- `--kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events. ## Architecture diff --git a/docs/architecture/planner_intro.rst b/docs/architecture/planner_intro.rst index 07d91b1132..dfafe2af69 100644 --- a/docs/architecture/planner_intro.rst +++ b/docs/architecture/planner_intro.rst @@ -19,13 +19,13 @@ Planner The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. -Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size: +Currently, the planner can scale the number of vLLM workers up and down based on the kv cache load and prefill queue size: Key features include: * **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions * **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets -* **Multi-backend support** for both local (Circus) and Kubernetes environments +* **Multi-backend support** for Kubernetes environments * **Graceful scaling** that ensures no requests are dropped during scale-down operations .. list-table:: @@ -50,9 +50,6 @@ Key features include: * - - ❌ - SGLang - * - - - ❌ - - llama.cpp * - **Serving Type** - ✅ - Aggregated diff --git a/docs/examples/README.md b/docs/examples/README.md index 560360cd62..16091a26e4 100644 --- a/docs/examples/README.md +++ b/docs/examples/README.md @@ -45,9 +45,9 @@ Consult the examples below for the CRs for your specific inference backend. [View SGLang k8s](../../components/backends/sglang/deploy/README.md) -[View vLLM K8s](../../components/backends/vllm/deploy/README.md) +[View vLLM K8s](../../components/backends/vllm/README.md#kubernetes-deployment) -[View TRTLLM k8s](../../components/backends/trtllm/deploy/README.md) +[View TRTLLM k8s](../components/backends/trtllm/README.md#kubernetes-deployment) **Note 1** Example Image diff --git a/docs/guides/dynamo_deploy/dynamo_operator.md b/docs/guides/dynamo_deploy/dynamo_operator.md index 4d3c2a04eb..9e52384da9 100644 --- a/docs/guides/dynamo_deploy/dynamo_operator.md +++ b/docs/guides/dynamo_deploy/dynamo_operator.md @@ -75,7 +75,7 @@ spec: ## GitOps Deployment with FluxCD -This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../components/backends/vllm/README.md) to demonstrate the workflow. +This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../components/backends/vllm/README.md) to demonstrate the workflow. ### Prerequisites diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md index 9ced98cc86..df6b3b8a39 100644 --- a/docs/guides/metrics.md +++ b/docs/guides/metrics.md @@ -25,11 +25,11 @@ Dynamo provides built-in metrics capabilities through the `MetricsRegistry` trai Dynamo automatically exposes metrics with the `dynamo_` name prefixes. It also adds the following labels `dynamo_namespace`, `dynamo_component`, and `dynamo_endpoint` to indicate which component is providing the metric. -**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of frontend metrics. +**Frontend Metrics**: When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name. These cover request handling, token processing, and latency measurements. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of frontend metrics. -**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for the complete list of component metrics. +**Component Metrics**: The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework. These include request counts, processing times, byte transfers, and system uptime metrics. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for the complete list of component metrics. -**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](../../deploy/metrics/README.md#available-metrics) for details on specialized component metrics. +**Specialized Component Metrics**: Components can also expose additional metrics specific to their functionality. For example, a `preprocessor` component exposes metrics with the `dynamo_preprocessor_*` prefix. See the [Available Metrics section](./metrics_deployment.md#available-metrics) for details on specialized component metrics. ## Coming Soon @@ -49,7 +49,7 @@ This hierarchical structure allows you to create metrics at the appropriate leve ## Getting Started -For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](../../deploy/metrics/README.md#getting-started) in the deploy metrics documentation. +For a complete setup guide including Docker Compose configuration, Prometheus setup, and Grafana dashboards, see the [Getting Started section](./metrics_deployment.md#getting-started) in the metrics deployment guide. The quick start includes: - Docker Compose setup for Prometheus and Grafana @@ -59,7 +59,7 @@ The quick start includes: ## Implementation Examples -See [Implementation Examples](../../deploy/metrics/README.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels. +See [Implementation Examples](./metrics_deployment.md#implementation-examples) for detailed examples of creating metrics at different hierarchy levels and using dynamic labels. ### Grafana Dashboards @@ -99,5 +99,5 @@ The metrics system includes a pre-configured Grafana dashboard for visualizing s - [Distributed Runtime Architecture](../architecture/distributed_runtime.md) - [Dynamo Architecture Overview](../architecture/architecture.md) - [Backend Guide](backend.md) -- [Metrics Implementation Examples](../../deploy/metrics/README.md#implementation-examples) -- [Complete Metrics Setup Guide](../../deploy/metrics/README.md) \ No newline at end of file +- [Metrics Implementation Examples](./metrics_deployment.md#implementation-examples) +- [Complete Metrics Setup Guide](./metrics_deployment.md) \ No newline at end of file diff --git a/docs/guides/metrics_deployment.md b/docs/guides/metrics_deployment.md new file mode 100644 index 0000000000..7ee5f286eb --- /dev/null +++ b/docs/guides/metrics_deployment.md @@ -0,0 +1,366 @@ + + +# Metrics Visualization with Prometheus and Grafana + +This guide contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana. + +> [!NOTE] +> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](./metrics.md). + +## Overview + +### Components + +- **Prometheus Server**: Collects and stores metrics from Dynamo services and other components. +- **Grafana**: Provides dashboards by querying the Prometheus Server. + +### Topology + +Default Service Relationship Diagram: +```mermaid +graph TD + BROWSER[Browser] -->|:3001| GRAFANA[Grafana :3001] + subgraph DockerComposeNetwork [Network inside Docker Compose] + NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222] + PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380] + PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401] + PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP + PROMETHEUS -->|:8080/metrics| DYNAMOFE[Dynamo HTTP FE :8080] + PROMETHEUS -->|:8081/metrics| DYNAMOBACKEND[Dynamo backend :8081] + DYNAMOFE --> DYNAMOBACKEND + GRAFANA -->|:9090/query API| PROMETHEUS + end +``` + +The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM. + +As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build containers with `--framework VLLM` or `--framework TENSORRTLLM`. + +## Available Metrics + +### Component Metrics + +The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework: + +- `dynamo_component_concurrent_requests`: Requests currently being processed (gauge) +- `dynamo_component_request_bytes_total`: Total bytes received in requests (counter) +- `dynamo_component_request_duration_seconds`: Request processing time (histogram) +- `dynamo_component_requests_total`: Total requests processed (counter) +- `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter) +- `dynamo_component_system_uptime_seconds`: DistributedRuntime uptime (gauge) + +### Specialized Component Metrics + +Some components expose additional metrics specific to their functionality: + +- `dynamo_preprocessor_*`: Metrics specific to preprocessor components + +### Frontend Metrics + +When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name: + +- `dynamo_frontend_inflight_requests`: Inflight requests (gauge) +- `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram) +- `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram) +- `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram) +- `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram) +- `dynamo_frontend_requests_total`: Total LLM requests (counter) +- `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram) + +### Required Files + +The following configuration files should be present in the deploy/metrics directory: +- `docker-compose.yml`: Defines the Prometheus and Grafana services +- `prometheus.yml`: Contains Prometheus scraping configuration +- `grafana-datasources.yml`: Contains Grafana datasource configuration +- `grafana_dashboards/grafana-dashboard-providers.yml`: Contains Grafana dashboard provider configuration +- `grafana_dashboards/grafana-dynamo-dashboard.json`: A general Dynamo Dashboard for both SW and HW metrics. +- `grafana_dashboards/grafana-dcgm-metrics.json`: Contains Grafana dashboard configuration for DCGM GPU metrics +- `grafana_dashboards/grafana-llm-metrics.json`: This file, which is being phased out, contains the Grafana dashboard configuration for LLM-specific metrics. It requires an additional `metrics` component to operate concurrently. A new version is under development. + +## Getting Started + +### Prerequisites + +1. Make sure Docker and Docker Compose are installed on your system + +### Quick Start + +1. Start Dynamo dependencies. Assume you're at the root dynamo path: + + ```bash + # Start the basic services (etcd & natsd), along with Prometheus and Grafana + docker compose -f deploy/docker-compose.yml --profile metrics up -d + + # Minimum components for Dynamo (will not have Prometheus and Grafana): etcd/nats/dcgm-exporter + docker compose -f deploy/docker-compose.yml up -d + ``` + + Optional: To target specific GPU(s), export the variable below before running Docker Compose + ```bash + export CUDA_VISIBLE_DEVICES=0,2 + ``` + +2. Web servers started. The ones that end in /metrics are in Prometheus format: + - Grafana: `http://localhost:3001` (default login: dynamo/dynamo) + - Prometheus Server: `http://localhost:9090` + - NATS Server: `http://localhost:8222` (monitoring endpoints: /varz, /healthz, etc.) + - NATS Prometheus Exporter: `http://localhost:7777/metrics` + - etcd Server: `http://localhost:2379/metrics` + - DCGM Exporter: `http://localhost:9401/metrics` + + + - Start the components/metrics application to begin monitoring for metric events from dynamo workers and aggregating them on a Prometheus metrics endpoint: `http://localhost:9091/metrics`. + - Uncomment the appropriate lines in prometheus.yml to poll port 9091. + - Start worker(s) that publishes KV Cache metrics. + +### Configuration + +#### Prometheus + +The Prometheus configuration is specified in `deploy/metrics/prometheus.yml`. This file is set up to collect metrics from the metrics aggregation service endpoint. + +Please be aware that you might need to modify the target settings to align with your specific host configuration and network environment. + +After making changes to prometheus.yml, it is necessary to reload the configuration using the command below. Simply sending a kill -HUP signal will not suffice due to the caching of the volume that contains the prometheus.yml file. + +``` +docker compose -f deploy/docker-compose.yml up prometheus -d --force-recreate +``` + +#### Grafana + +Grafana is pre-configured with: +- Prometheus datasource +- Sample dashboard for visualizing service metrics +![grafana image](../../deploy/metrics/grafana-dynamo-composite.png) + +### Troubleshooting + +1. Verify services are running: + ```bash + docker compose ps + ``` + +2. Check logs: + ```bash + docker compose logs prometheus + docker compose logs grafana + ``` + +3. For issues with the legacy metrics component (being phased out), see components/metrics/README.md for details on the exposed metrics and troubleshooting steps. + +## Implementation Examples + +### Creating Metrics at Different Hierarchy Levels + +#### Runtime-Level Metrics + +```rust +use dynamo_runtime::DistributedRuntime; + +let runtime = DistributedRuntime::new()?; +let namespace = runtime.namespace("my_namespace")?; +let component = namespace.component("my_component")?; +let endpoint = component.endpoint("my_endpoint")?; + +// Create endpoint-level counters (this is a Prometheus Counter type) +let total_requests = endpoint.create_counter( + "total_requests", + "Total requests across all namespaces", + &[] +)?; + +let active_connections = endpoint.create_gauge( + "active_connections", + "Number of active client connections", + &[] +)?; +``` + +#### Namespace-Level Metrics + +```rust +let namespace = runtime.namespace("my_model")?; + +// Namespace-scoped metrics +let model_requests = namespace.create_counter( + "model_requests", + "Requests for this specific model", + &[] +)?; + +let model_latency = namespace.create_histogram( + "model_latency_seconds", + "Model inference latency", + &[], + &[0.001, 0.01, 0.1, 1.0, 10.0] +)?; +``` + +#### Component-Level Metrics + +```rust +let component = namespace.component("backend")?; + +// Component-specific metrics +let backend_requests = component.create_counter( + "backend_requests", + "Requests handled by this backend component", + &[] +)?; + +let gpu_memory_usage = component.create_gauge( + "gpu_memory_bytes", + "GPU memory usage in bytes", + &[] +)?; +``` + +#### Endpoint-Level Metrics + +```rust +let endpoint = component.endpoint("generate")?; + +// Endpoint-specific metrics +let generate_requests = endpoint.create_counter( + "generate_requests", + "Generate endpoint requests", + &[] +)?; + +let generate_latency = endpoint.create_histogram( + "generate_latency_seconds", + "Generate endpoint latency", + &[], + &[0.001, 0.01, 0.1, 1.0, 10.0] +)?; +``` + +### Creating Vector Metrics with Dynamic Labels + +Use vector metrics when you need to track metrics with different label values: + +```rust +// Counter with labels +let requests_by_model = endpoint.create_counter_vec( + "requests_by_model", + "Requests by model type", + &["model_type", "model_size"] +)?; + +// Increment with specific labels +requests_by_model.with_label_values(&["llama", "7b"]).inc(); +requests_by_model.with_label_values(&["gpt", "13b"]).inc(); + +// Gauge with labels +let memory_by_gpu = component.create_gauge_vec( + "gpu_memory_bytes", + "GPU memory usage by device", + &["gpu_id", "memory_type"] +)?; + +memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0); +memory_by_gpu.with_label_values(&["0", "cached"]).set(4096.0); +``` + +### Creating Histograms + +Histograms are useful for measuring distributions of values like latency: + +```rust +let latency_histogram = endpoint.create_histogram( + "request_latency_seconds", + "Request latency distribution", + &[], + &[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0] +)?; + +// Record latency values +latency_histogram.observe(0.023); // 23ms +latency_histogram.observe(0.156); // 156ms +``` + +### Transitioning from Plain Prometheus + +If you're currently using plain Prometheus metrics, transitioning to Dynamo's `MetricsRegistry` is straightforward: + +#### Before (Plain Prometheus) + +```rust +use prometheus::{Counter, Opts, Registry}; + +// Create a registry to hold metrics +let registry = Registry::new(); +let counter_opts = Opts::new("my_counter", "My custom counter"); +let counter = Counter::with_opts(counter_opts).unwrap(); +registry.register(Box::new(counter.clone())).unwrap(); + +// Use the counter +counter.inc(); + +// To expose metrics, you'd need to set up an HTTP server manually +// and implement the /metrics endpoint yourself +``` + +#### After (Dynamo MetricsRegistry) + +```rust +let counter = endpoint.create_counter( + "my_counter", + "My custom counter", + &[] +)?; + +counter.inc(); +``` + +**Note:** The metric is automatically registered when created via the endpoint's `create_counter` factory method. + +**Benefits of Dynamo's approach:** +- **Automatic registration**: Metrics created via endpoint's `create_*` factory methods are automatically registered with the system +- Automatic labeling with namespace, component, and endpoint information +- Consistent metric naming with `dynamo_` prefix +- Built-in HTTP metrics endpoint when enabled with `DYN_SYSTEM_ENABLED=true` +- Hierarchical metric organization + +### Advanced Features + +#### Custom Buckets for Histograms + +```rust +// Define custom buckets for your use case +let custom_buckets = vec![0.001, 0.01, 0.1, 1.0, 10.0]; +let latency = endpoint.create_histogram( + "api_latency_seconds", + "API latency in seconds", + &[], + &custom_buckets +)?; +``` + +#### Metric Aggregation + +```rust +// Aggregate metrics across multiple endpoints +let total_requests = namespace.create_counter( + "total_requests", + "Total requests across all endpoints", + &[] +)?; +``` \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index c751f0d819..c16f70be58 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -17,7 +17,9 @@ Welcome to NVIDIA Dynamo ======================== -The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale. +The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve generative AI and reasoning models—across any framework, architecture, or deployment scale. Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach. + +Large language models are quickly outgrowing the memory and compute budget of any single GPU. Tensor-parallelism solves the capacity problem by spreading each layer across many GPUs—and sometimes many servers—but it creates a new one: how do you coordinate those shards, route requests, and share KV cache fast enough to feel like one accelerator? This orchestration gap is exactly what NVIDIA Dynamo is built to close. .. admonition:: 💎 Discover the latest developments! :class: seealso @@ -25,20 +27,66 @@ The NVIDIA Dynamo Platform is a high-performance, low-latency inference framewor This guide is a snapshot of the `Dynamo GitHub Repository `_ at a specific point in time. For the latest information and examples, see: - `Dynamo README `_ - - `Architecture and features doc `_ - - `Usage guides `_ - - `Dynamo examples repo `_ + - `Architecture and Features `_ + - `Usage Guides `_ + - `Dynamo Examples `_ Quick Start ----------------- -Follow the :doc:`Quick Guide to install Dynamo Platform `. +Local Deployment +~~~~~~~~~~~~~~~~ -Dive in: Examples ------------------ +Get started with Dynamo locally in just a few commands: + +**1. Install Dynamo** + +.. code-block:: bash + + # Install uv (recommended Python package manager) + curl -LsSf https://astral.sh/uv/install.sh | sh + + # Create virtual environment and install Dynamo + uv venv venv + source venv/bin/activate + uv pip install "ai-dynamo[sglang]" # or [vllm], [trtllm] + +**2. Start etcd/NATS** + +.. code-block:: bash + + # Start etcd and NATS using Docker Compose + docker compose -f deploy/docker-compose.yml up -d + +**3. Run Dynamo** -The examples below assume you build the latest image yourself from source. If using a prebuilt image follow the examples from the corresponding branch. +.. code-block:: bash + + # Start the OpenAI compatible frontend + python -m dynamo.frontend + + # In another terminal, start an SGLang worker + python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B + +**4. Test your deployment** + +.. code-block:: bash + + curl localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 50}' + +Kubernetes Deployment +~~~~~~~~~~~~~~~~~~~~~ + +For deployments on Kubernetes, follow the :doc:`Dynamo Platform Quickstart Guide `. + + +Dive in: Dynamo Examples +----------------- .. grid:: 1 2 2 2 :gutter: 3 @@ -49,25 +97,25 @@ The examples below assume you build the latest image yourself from source. If us :link: examples/runtime/hello_world/README :link-type: doc - Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph + Demonstrates the basic concepts of Dynamo by creating a simple GPU-unaware graph using Python bindings. - .. grid-item-card:: :doc:`LLM Serving with VLLM ` + .. grid-item-card:: :doc:`LLM Serving with vLLM ` :link: components/backends/vllm/README :link-type: doc - Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations with VLLM. + Examples and reference implementations for deploying LLM inference workflows in various configurations with vLLM. - .. grid-item-card:: :doc:`Multinode with SGLang ` - :link: components/backends/sglang/docs/multinode-examples + .. grid-item-card:: :doc:`Deploy DeepSeek R1 Disaggregated with WideEP using SGLang ` + :link: components/backends/sglang/docs/dsr1-wideep-gb200.md :link-type: doc - Demonstrates disaggregated serving on several nodes. + Demonstrates disaggregated serving of DeepSeek R1 using Wide Expert Parallelism using SGLang. - .. grid-item-card:: :doc:`TensorRT-LLM ` + .. grid-item-card:: :doc:`Deploy with TensorRT-LLM ` :link: components/backends/trtllm/README :link-type: doc - Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations. + Presents TensorRT-LLM examples and reference implementations for deploying LLMs in various configurations. .. toctree:: @@ -92,42 +140,21 @@ The examples below assume you build the latest image yourself from source. If us :hidden: :caption: Using Dynamo - Running Inference Graphs Locally (dynamo-run) - Deploying Inference Graphs - -.. toctree:: - :hidden: - :caption: Usage Guides - Writing Python Workers in Dynamo Disaggregation and Performance Tuning KV Cache Router Performance Tuning Working with Dynamo Kubernetes Operator + Configuring Metrics for Observability .. toctree:: :hidden: :caption: Deployment Guides - Dynamo Deploy Quickstart - Dynamo Cloud Kubernetes Platform + Deploying Dynamo on Kubernetes Manual Helm Deployment - GKE Setup Guide Minikube Setup Guide Model Caching with Fluid -.. toctree:: - :hidden: - :caption: Benchmarking - - Planner Benchmark Example - - -.. toctree:: - :hidden: - :caption: API - - NIXL Connect API - .. toctree:: :hidden: :caption: Examples @@ -136,13 +163,14 @@ The examples below assume you build the latest image yourself from source. If us LLM Deployment Examples using VLLM Multinode Examples using SGLang LLM Deployment Examples using TensorRT-LLM + Planner Benchmark Example .. toctree:: :hidden: :caption: Reference - Glossary + NIXL Connect API KVBM Reading