You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This directory contains configuration for visualizing metrics from the metrics aggregation service using Prometheus and Grafana.
4
4
5
-
## Components
5
+
> [!NOTE]
6
+
> For detailed information about Dynamo's metrics system, including hierarchical metrics, automatic labeling, and usage examples, see the [Metrics Guide](../../docs/guides/metrics.md).
7
+
8
+
## Overview
9
+
10
+
### Components
6
11
7
12
-**Prometheus Server**: Collects and stores metrics from Dynamo services and other components.
8
13
-**Grafana**: Provides dashboards by querying the Prometheus Server.
9
14
10
-
## Topology
15
+
###Topology
11
16
12
17
Default Service Relationship Diagram:
13
18
```mermaid
@@ -29,17 +34,63 @@ The dcgm-exporter service in the Docker Compose network is configured to use por
29
34
30
35
As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build containers with `--framework VLLM` or `--framework TENSORRTLLM`.
31
36
37
+
### Available Metrics
38
+
39
+
#### Component Metrics
40
+
41
+
The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
42
+
43
+
-`dynamo_component_concurrent_requests`: Requests currently being processed (gauge)
44
+
-`dynamo_component_request_bytes_total`: Total bytes received in requests (counter)
45
+
-`dynamo_component_request_duration_seconds`: Request processing time (histogram)
46
+
-`dynamo_component_requests_total`: Total requests processed (counter)
47
+
-`dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
Some components expose additional metrics specific to their functionality:
53
+
54
+
-`dynamo_preprocessor_*`: Metrics specific to preprocessor components
55
+
56
+
#### Frontend Metrics
57
+
58
+
When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TENSORRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:
-[grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
76
+
-[grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
77
+
-[grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): This file, which is being phased out, contains the Grafana dashboard configuration for LLM-specific metrics. It requires an additional `metrics` component to operate concurrently. A new version is under development.
78
+
32
79
## Getting Started
33
80
81
+
### Prerequisites
82
+
34
83
1. Make sure Docker and Docker Compose are installed on your system
35
84
36
-
2. Start Dynamo dependencies. Assume you're at the root dynamo path:
85
+
### Quick Start
86
+
87
+
1. Start Dynamo dependencies. Assume you're at the root dynamo path:
37
88
38
89
```bash
39
90
# Start the basic services (etcd & natsd), along with Prometheus and Grafana
40
91
docker compose -f deploy/docker-compose.yml --profile metrics up -d
41
92
42
-
# Minimum components for Dynamo: etcd/nats/dcgm-exporter
93
+
# Minimum components for Dynamo (will not have Prometheus and Grafana): etcd/nats/dcgm-exporter
43
94
docker compose -f deploy/docker-compose.yml up -d
44
95
```
45
96
@@ -48,24 +99,22 @@ As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build container
48
99
export CUDA_VISIBLE_DEVICES=0,2
49
100
```
50
101
51
-
3. Web servers started. The ones that end in /metrics are in Prometheus format:
102
+
2. Web servers started. The ones that end in /metrics are in Prometheus format:
4. Optionally, if you want to experiment further, look through components/metrics/README.md for more details on launching a metrics server (subscribes to nats), mock_worker (publishes to nats), and real workers.
60
110
61
111
- Start the [components/metrics](../../components/metrics/README.md) application to begin monitoring for metric events from dynamo workers and aggregating them on a Prometheus metrics endpoint: `http://localhost:9091/metrics`.
62
112
- Uncomment the appropriate lines in prometheus.yml to poll port 9091.
63
113
- Start worker(s) that publishes KV Cache metrics: [lib/runtime/examples/service_metrics/README.md](../../lib/runtime/examples/service_metrics/README.md) can populate dummy KV Cache metrics.
64
114
115
+
### Configuration
65
116
66
-
## Configuration
67
-
68
-
### Prometheus
117
+
#### Prometheus
69
118
70
119
The Prometheus configuration is specified in [prometheus.yml](./prometheus.yml). This file is set up to collect metrics from the metrics aggregation service endpoint.
71
120
@@ -77,29 +126,233 @@ After making changes to prometheus.yml, it is necessary to reload the configurat
77
126
docker compose -f deploy/docker-compose.yml up prometheus -d --force-recreate
78
127
```
79
128
80
-
### Grafana
129
+
####Grafana
81
130
82
131
Grafana is pre-configured with:
83
132
- Prometheus datasource
84
133
- Sample dashboard for visualizing service metrics
85
134

86
135
87
-
##Required Files
136
+
### Troubleshooting
88
137
89
-
The following configuration files should be present in this directory:
90
-
-[docker-compose.yml](./docker-compose.yml): Defines the Prometheus and Grafana services
-[grafana_dashboards/grafana-dynamo-dashboard.json](./grafana_dashboards/grafana-dynamo-dashboard.json): A general Dynamo Dashboard for both SW and HW metrics.
95
-
-[grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
96
-
-[grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): This file, which is being phased out, contains the Grafana dashboard configuration for LLM-specific metrics. It requires an additional `metrics` component to operate concurrently. A new version is under development.
138
+
1. Verify services are running:
139
+
```bash
140
+
docker compose ps
141
+
```
97
142
98
-
## Running the deprecated `metrics` component
143
+
2. Check logs:
144
+
```bash
145
+
docker compose logs prometheus
146
+
docker compose logs grafana
147
+
```
148
+
149
+
3. For issues with the legacy metrics component (being phased out), see [components/metrics/README.md](../../components/metrics/README.md) for details on the exposed metrics and troubleshooting steps.
150
+
151
+
## Developer Guide
152
+
153
+
### Creating Metrics at Different Hierarchy Levels
## Running the deprecated `components/metrics` program
99
352
100
353
⚠️ **DEPRECATION NOTICE** ⚠️
101
354
102
-
When you run the example [components/metrics](../../components/metrics/README.md)component, it exposes a Prometheus /metrics endpoint with the following metrics (defined in [components/metrics/src/lib.rs](../../components/metrics/src/lib.rs)):
355
+
When you run the example [components/metrics](../../components/metrics/README.md)program, it exposes a Prometheus /metrics endpoint with the following metrics (defined in [components/metrics/src/lib.rs](../../components/metrics/src/lib.rs)):
103
356
104
357
**⚠️ The following `llm_kv_*` metrics are deprecated:**
105
358
@@ -123,3 +376,5 @@ When you run the example [components/metrics](../../components/metrics/README.md
123
376
docker compose logs prometheus
124
377
docker compose logs grafana
125
378
```
379
+
380
+
3. For issues with the legacy metrics component (being phased out), see [components/metrics/README.md](../../components/metrics/README.md) for details on the exposed metrics and troubleshooting steps.
0 commit comments