Update the links, more consistent naming.

ai-dynamo · keivenchang · Aug 1, 2025 · Jul 28, 2025 · Jul 29, 2025 · Jul 29, 2025
commit 758c57a25e0945158b92651a99173477f53182f3
diff --git a/docs/guides/metrics.md b/docs/guides/metrics.md
@@ -15,175 +15,236 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# Dynamo MetricsRegistry Guide
+# Dynamo `MetricsRegistry`
 
-This guide covers the MetricsRegistry in Dynamo, which provides hierarchical Prometheus metrics with automatic labeling and namespace organization. The MetricsRegistry enables observability across distributed inference workloads.
+## Overview
 
-**MetricsRegistry is a common universal Dynamo built-in** that provides standardized observability capabilities across all Dynamo components and services. It's automatically available whenever you use the DistributedRuntime framework.
+Dynamo provides built-in metrics capabilities through the `MetricsRegistry` trait, which is automatically available whenever you use the `DistributedRuntime` framework. This guide explains how to use metrics for observability and monitoring across all Dynamo components and services.
 
-## Overview
+The `MetricsRegistry` trait is implemented by `DistributedRuntime`, `Namespace`, `Component`, and `Endpoint`, providing a hierarchical approach to metric collection that matches Dynamo's distributed architecture:
 
-Dynamo's MetricsRegistry is built around a hierarchical registry framework that automatically organizes metrics by namespace, component, and endpoint. This provides structured observability across the distributed runtime system.
+- `DistributedRuntime`: Global metrics across the entire runtime
+- `Namespace`: Metrics scoped to a specific namespace
+- `Component`: Metrics for a specific component within a namespace
+- `Endpoint`: Metrics for individual endpoints within a component
 
-**MetricsRegistry is a trait** that is implemented by the core distributed runtime components:
-- **DistributedRuntime**: Root level metrics registry
-- **Namespace**: Namespace-level metrics registry
-- **Component**: Component-level metrics registry
-- **Endpoint**: Endpoint-level metrics registry
+This hierarchical structure allows you to create metrics at the appropriate level of granularity for your monitoring needs.
 
-Each level in the hierarchy implements the MetricsRegistry trait, allowing you to create and manage metrics at the appropriate level while maintaining automatic namespace prefixing and labeling.
+### Automatic Metrics
 
-## Transitioning from Raw Prometheus
+When you enable the metrics HTTP endpoint with `DYN_SYSTEM_ENABLED=true`, Dynamo automatically adds:
 
-One of the key benefits of Dynamo's MetricsRegistry is how easy it is to transition from raw Prometheus metrics to the distributed runtime's Prometheus constructors. The transition is seamless and requires minimal code changes.
+- `dynamo_system_uptime_seconds`: System uptime counter
+- HTTP server metrics for the metrics endpoint itself
 
-### Before: Raw Prometheus
+## Environment Configuration
 
-```rust
-use prometheus::{Counter, Opts};
+Enable the metrics HTTP endpoint:
 
-let opts = Opts::new("my_counter", "A simple counter");
-let counter = Counter::with_opts(opts).unwrap();  // Prometheus counter
+```bash
+export DYN_SYSTEM_ENABLED=true
+export DYN_SYSTEM_PORT=8081  # Use 0 for random port assignment
 ```
 
-### After: Dynamo MetricsRegistry
+The `DYN_SYSTEM_PORT=0` assigns a random port, which is useful for integration testing to avoid port conflicts.
+
+## Creating Metrics at Different Hierarchy Levels
+
+### Runtime-Level Metrics
 
 ```rust
-use dynamo_runtime::MetricsRegistry;
+use dynamo_runtime::DistributedRuntime;
 
-let counter = endpoint.create_intcounter("my_counter", "A simple counter", &[])?;  // Prometheus counter
-```
+let runtime = DistributedRuntime::new()?;
+let namespace = runtime.namespace("my_namespace")?;
+let component = namespace.component("my_component")?;
+let endpoint = component.endpoint("my_endpoint")?;
 
-**All the rest of your code can remain the same!** The counter still has the same API for incrementing, but now it's automatically:
+// Create endpoint-level counters (this is a Prometheus Counter type)
+let total_requests = endpoint.create_counter(
+    "total_requests",
+    "Total requests across all namespaces",
+    &[]
+)?;
 
-- Exposed on the HTTP metrics endpoint
-- Prefixed with the namespace (`dynamo_`)
-- Labeled with namespace, component, and endpoint information
-- Integrated into the distributed runtime's metrics collection
+let active_connections = endpoint.create_gauge(
+    "active_connections",
+    "Number of active client connections",
+    &[]
+)?;
+```
 
-### Enabling the HTTP Metrics Endpoint
+### Namespace-Level Metrics
 
-To expose your metrics via HTTP, simply enable the system endpoint with environment variables:
+```rust
+let namespace = runtime.namespace("my_model")?;
 
-```bash
-$ DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 python -m dynamo.vllm ... ... &
+// Namespace-scoped metrics
+let model_requests = namespace.create_counter(
+    "model_requests",
+    "Requests for this specific model",
+    &[]
+)?;
+
+let model_latency = namespace.create_histogram(
+    "model_latency_seconds",
+    "Model inference latency",
+    &[],
+    &[0.001, 0.01, 0.1, 1.0, 10.0]
+)?;
 ```
 
-Then access your metrics:
+### Component-Level Metrics
 
-```bash
-$ curl http://localhost:8081/metrics
-# HELP dynamo_my_counter A simple counter
-# TYPE dynamo_my_counter counter
-dynamo_my_counter{component="example_component",endpoint="example_endpoint",namespace="dynamo"} 123
-```
+```rust
+let component = namespace.component("backend")?;
 
-The metrics endpoint port can be configured using the `DYN_SYSTEM_PORT` environment variable. If set to 0, the system will assign a random available port, which is useful for integration testing to avoid port conflicts.
+// Component-specific metrics
+let backend_requests = component.create_counter(
+    "backend_requests",
+    "Requests handled by this backend component",
+    &[]
+)?;
 
-## Architecture
+let gpu_memory_usage = component.create_gauge(
+    "gpu_memory_bytes",
+    "GPU memory usage in bytes",
+    &[]
+)?;
+```
 
-### Hierarchical Metrics Registry
+### Endpoint-Level Metrics
 
-The MetricsRegistry follows a hierarchical structure:
+```rust
+let endpoint = component.endpoint("generate")?;
 
-```
-DistributedRuntime (DRT)
-├── Namespace1
-│   ├── Component1
-│   │   └── Endpoint1
-│   └── Component2
-│       └── Endpoint2
-└── Namespace2
-    └── Component3
-        └── Endpoint3
-        ...
-        └── EndpointN
+// Endpoint-specific metrics
+let generate_requests = endpoint.create_counter(
+    "generate_requests",
+    "Generate endpoint requests",
+    &[]
+)?;
+
+let generate_latency = endpoint.create_histogram(
+    "generate_latency_seconds",
+    "Generate endpoint latency",
+    &[],
+    &[0.001, 0.01, 0.1, 1.0, 10.0]
+)?;
 ```
 
-### Automatic Labeling
+## Creating Vector Metrics with Dynamic Labels
 
-The MetricsRegistry automatically adds labels based on the hierarchy, such as namespace, component, and endpoint. In addition, the `dynamo_system_uptime_seconds` metric is also automatically added to track system uptime.
+Use vector metrics when you need to track metrics with different label values:
 
-## Code Examples
+```rust
+// Counter with labels
+let requests_by_model = endpoint.create_counter_vec(
+    "requests_by_model",
+    "Requests by model type",
+    &["model_type", "model_size"]
+)?;
 
-### Creating Metrics at Different Hierarchy Levels
+// Increment with specific labels
+requests_by_model.with_label_values(&["llama", "7b"]).inc();
+requests_by_model.with_label_values(&["gpt", "13b"]).inc();
 
-```rust
-use dynamo_runtime::{DistributedRuntime, Runtime, Result};
-
-// Create a distributed runtime
-let rt = Runtime::from_current().unwrap();
-let drt = DistributedRuntime::from_settings(rt).await.unwrap();
-
-let namespace = drt.namespace("dynamo").unwrap();
-let component = namespace.component("auth_service").unwrap();
-let endpoint = component.endpoint("login");
-// Create two endpoint-level metrics:
-let login_attempts = endpoint.create_counter("login_attempts", "Login attempts", &[])?;
-let login_success_rate = endpoint.create_gauge("login_success_rate", "Login success rate", &[])?;
+// Gauge with labels
+let memory_by_gpu = component.create_gauge_vec(
+    "gpu_memory_bytes",
+    "GPU memory usage by device",
+    &["gpu_id", "memory_type"]
+)?;
+
+memory_by_gpu.with_label_values(&["0", "allocated"]).set(8192.0);
+memory_by_gpu.with_label_values(&["0", "cached"]).set(4096.0);
 ```
 
-### Creating Vector Metrics with Dynamic Labels
+## Creating Histograms
+
+Histograms are useful for measuring distributions of values like latency:
 
 ```rust
-// Create a CounterVec with dynamic labels
-let http_requests = component.create_countervec(
-    "http_requests_total",
-    "Total HTTP requests",
-    &["method", "status_code"],
-    &[]
+let latency_histogram = endpoint.create_histogram(
+    "request_latency_seconds",
+    "Request latency distribution",
+    &[],
+    &[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
 )?;
 
-// Use the vector with specific label values
-http_requests.with_label_values(&["GET", "200"]).inc();
-http_requests.with_label_values(&["POST", "404"]).inc();
+// Record latency values
+latency_histogram.observe(0.023); // 23ms
+latency_histogram.observe(0.156); // 156ms
 ```
 
-### Creating Histograms
+## Transitioning from Plain Prometheus
+
+If you're currently using plain Prometheus metrics, transitioning to Dynamo's `MetricsRegistry` is straightforward:
+
+### Before (Plain Prometheus)
 
 ```rust
-// Create a histogram for request duration
-let request_duration = endpoint.create_histogram(
-    "request_duration_seconds",
-    "Request duration in seconds",
-    &[],
-    &[0.1, 0.5, 1.0, 2.0, 5.0]
-)?;
+use prometheus::{Counter, Opts, Registry};
 
-// Record observations
-request_duration.observe(0.25);
+// Create a registry to hold metrics
+let registry = Registry::new();
+let counter_opts = Opts::new("my_counter", "My custom counter");
+let counter = Counter::with_opts(counter_opts).unwrap();
+registry.register(Box::new(counter.clone())).unwrap();
+
+// Use the counter
+counter.inc();
+
+// To expose metrics, you'd need to set up an HTTP server manually
+// and implement the /metrics endpoint yourself
 ```
 
-## Base Metrics
+### After (Dynamo MetricsRegistry)
+
+```rust
+let counter = endpoint.create_counter(
+    "my_counter",
+    "My custom counter",
+    &[]
+)?;
 
-Dynamo automatically provides base metrics for all endpoints:
+counter.inc();
+```
 
-- `dynamo_requests_total`: Total number of requests
-- `dynamo_request_duration_seconds`: Request duration histogram
-- `dynamo_errors_total`: Total number of errors by type
-- `dynamo_system_uptime_seconds`: System uptime
+**Note:** The metric is automatically registered when created via the endpoint's `create_counter` factory method.
 
-These base metrics are automatically created when using the Distributedruntime code that have request handlers. When an endpoint calls the request handler function, these metrics are automatically measured and updated. Additional base metrics are being added to the system to expand the default observability coverage.
+**Benefits of Dynamo's approach:**
+- **Automatic registration**: Metrics created via endpoint's `create_*` factory methods are automatically registered with the system
+- Automatic labeling with namespace, component, and endpoint information
+- Consistent metric naming with `dynamo_` prefix
+- Built-in HTTP metrics endpoint when enabled with `DYN_SYSTEM_ENABLED=true`
+- Hierarchical metric organization
 
 ## Prometheus Output Example
 
-```prometheus
-# HELP dynamo_requests_total Total requests
-# TYPE dynamo_requests_total counter
-dynamo_requests_total{component="backend",endpoint="generate",namespace="dynamo"} 1000
+To enable metrics, launch your Dynamo service with the required environment variables:
+
+```bash
+# Launch dynamo.vllm with metrics enabled (example):
+DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model-path /path/to/model
+```
+
+Then when you make the curl call to the metrics endpoint:
+
+```bash
+curl http://localhost:8081/metrics
+```
 
-# HELP dynamo_request_duration_seconds Request duration
-# TYPE dynamo_request_duration_seconds histogram
-dynamo_request_duration_seconds_bucket{component="backend",endpoint="generate",namespace="dynamo",le="0.1"} 800
-dynamo_request_duration_seconds_bucket{component="backend",endpoint="generate",namespace="dynamo",le="0.5"} 950
-dynamo_request_duration_seconds_bucket{component="backend",endpoint="generate",namespace="dynamo",le="1.0"} 1000
+You'll see output like this:
 
-# HELP dynamo_errors_total Total errors by type
-# TYPE dynamo_errors_total counter
-dynamo_errors_total{component="backend",endpoint="generate",error_type="generate",namespace="dynamo"} 2
+```
+# HELP dynamo_my_counter My custom counter
+# TYPE dynamo_my_counter counter
+dynamo_my_counter{namespace="dynamo",component="backend",endpoint="generate"} 42
 
-# HTTP server uptime metric
-dynamo_system_server_uptime_seconds{namespace="dynamo"} 3600
+# HELP dynamo_system_uptime_seconds System uptime
+# TYPE dynamo_system_uptime_seconds counter
+dynamo_system_uptime_seconds{namespace="dynamo"} 42
 ```
 
 ## Monitoring and Visualization
@@ -252,7 +313,7 @@ println!("Metrics: {}", metrics);
 
 3. **Test metrics endpoint:**
    ```bash
-   curl http://localhost:8080/metrics
+   curl http://localhost:8081/metrics
    ```
 
 ## Advanced Features
@@ -283,6 +344,9 @@ let total_requests = namespace.create_counter(
 
 ## Related Documentation
 
-- [Distributed Runtime Guide](../distributed.md)
-- [HTTP Service Guide](../http.md)
-- [Component Architecture](../components.md)
+- [Distributed Runtime Architecture](../architecture/distributed_runtime.md)
+- [Dynamo Architecture Overview](../architecture/architecture.md)
+- [Dynamo Flow](../architecture/dynamo_flow.md)
+- [Backend Guide](backend.md)
+- [Dynamo Run Guide](dynamo_run.md)
+- [Performance Tuning Guides](kv_router_perf_tuning.md)