Database and Cache Metrics Quick Reference

Quick Start

View All Metrics

curl http://localhost:8080/metrics | grep -E "(db_query|cache_)"

View Specific Metric

curl http://localhost:8080/metrics | grep "db_query_duration_seconds"

Available Metrics

Database Metrics

db_query_duration_seconds (Histogram)

Measures query execution time in seconds.

Labels:

operation: get, list, create, update, delete
table: models, users, nodes, model_replicas, audit_log_entries

Buckets: 0.001s, 0.002s, 0.004s, 0.008s, 0.016s, 0.032s, 0.064s, 0.128s, 0.256s, 0.512s, 1.024s, +Inf

Example:

db_query_duration_seconds_bucket{operation="get",table="models",le="0.001"} 45
db_query_duration_seconds_bucket{operation="get",table="models",le="0.002"} 48
db_query_duration_seconds_sum{operation="get",table="models"} 0.156
db_query_duration_seconds_count{operation="get",table="models"} 50

db_queries_total (Counter)

Total number of database queries executed.

Labels:

operation: get, list, create, update, delete
table: models, users, nodes, model_replicas, audit_log_entries
status: success, error

Example:

db_queries_total{operation="get",status="success",table="models"} 50
db_queries_total{operation="get",status="error",table="models"} 2

Cache Metrics

cache_hits_total (Counter)

Number of successful cache retrievals.

Labels:

cache_type: redis
table: models

Example:

cache_hits_total{cache_type="redis",table="models"} 35

cache_misses_total (Counter)

Number of failed cache retrievals (not found in cache).

Labels:

cache_type: redis
table: models

Example:

cache_misses_total{cache_type="redis",table="models"} 15

cache_operation_duration_seconds (Histogram)

Measures cache operation execution time in seconds.

Labels:

operation: get, set, delete
cache_type: redis
table: models

Buckets: 0.0001s, 0.0002s, 0.0004s, 0.0008s, 0.0016s, 0.0032s, 0.0064s, 0.0128s, 0.0256s, 0.0512s, 0.1024s, +Inf

Example:

cache_operation_duration_seconds_bucket{cache_type="redis",operation="get",table="models",le="0.001"} 48
cache_operation_duration_seconds_sum{cache_type="redis",operation="get",table="models"} 0.012
cache_operation_duration_seconds_count{cache_type="redis",operation="get",table="models"} 50

Common PromQL Queries

Query Performance

Average Query Duration

rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m])

Average Query Duration by Table

rate(db_query_duration_seconds_sum[5m]) by (table) / rate(db_query_duration_seconds_count[5m]) by (table)

Average Query Duration by Operation

rate(db_query_duration_seconds_sum[5m]) by (operation) / rate(db_query_duration_seconds_count[5m]) by (operation)

95th Percentile Query Latency

histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m]))

99th Percentile Query Latency

histogram_quantile(0.99, rate(db_query_duration_seconds_bucket[5m]))

Top 5 Slowest Operations

topk(5, histogram_quantile(0.99, sum(rate(db_query_duration_seconds_bucket[5m])) by (operation, table, le)))

Query Throughput

Queries per Second

rate(db_queries_total[5m])

Queries per Second by Table

sum(rate(db_queries_total[5m])) by (table)

Queries per Second by Operation

sum(rate(db_queries_total[5m])) by (operation)

Error Monitoring

Query Error Rate

rate(db_queries_total{status="error"}[5m]) / rate(db_queries_total[5m])

Query Error Rate by Table

rate(db_queries_total{status="error"}[5m]) by (table) / rate(db_queries_total[5m]) by (table)

Total Errors in Last Hour

increase(db_queries_total{status="error"}[1h])

Cache Performance

Cache Hit Rate

rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Cache Hit Rate Percentage

(rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))) * 100

Cache Miss Rate

rate(cache_misses_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

Average Cache Operation Latency

rate(cache_operation_duration_seconds_sum[5m]) / rate(cache_operation_duration_seconds_count[5m])

Average Cache Latency by Operation

rate(cache_operation_duration_seconds_sum[5m]) by (operation) / rate(cache_operation_duration_seconds_count[5m]) by (operation)

Cache Operations per Second

sum(rate(cache_operation_duration_seconds_count[5m])) by (operation)

Grafana Panel Examples

Query Duration Heatmap

sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)

Visualization: Heatmap X-axis: Time Y-axis: Latency buckets

Cache Hit Rate Gauge

(sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))) * 100

Visualization: Gauge Min: 0 Max: 100 Thresholds: Red < 70%, Yellow < 85%, Green >= 85%

Query Throughput Graph

sum(rate(db_queries_total[5m])) by (operation)

Visualization: Time series graph Legend: {{operation}}

Top Slow Queries Table

topk(10, histogram_quantile(0.99, sum(rate(db_query_duration_seconds_bucket[5m])) by (operation, table, le)))

Visualization: Table Columns: Operation, Table, P99 Latency

Error Rate Graph

rate(db_queries_total{status="error"}[5m])

Visualization: Time series graph Y-axis: Errors per second

Alert Rule Examples

High Query Error Rate

- alert: HighDatabaseErrorRate
  expr: rate(db_queries_total{status="error"}[5m]) / rate(db_queries_total[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database error rate above 5%"
    description: "{{ $labels.table }} has {{ $value | humanizePercentage }} error rate"

Slow Database Queries

- alert: SlowDatabaseQueries
  expr: histogram_quantile(0.95, rate(db_query_duration_seconds_bucket[5m])) > 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "95th percentile query latency above 1s"
    description: "{{ $labels.operation }} on {{ $labels.table }} is slow: {{ $value }}s"

Very Slow Database Queries

- alert: VerySlowDatabaseQueries
  expr: histogram_quantile(0.99, rate(db_query_duration_seconds_bucket[5m])) > 5
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "99th percentile query latency above 5s"
    description: "{{ $labels.operation }} on {{ $labels.table }} is very slow: {{ $value }}s"

Low Cache Hit Rate

- alert: LowCacheHitRate
  expr: (rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))) < 0.7
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Cache hit rate below 70%"
    description: "{{ $labels.table }} cache hit rate is {{ $value | humanizePercentage }}"

Very Low Cache Hit Rate

- alert: VeryLowCacheHitRate
  expr: (rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))) < 0.5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Cache hit rate below 50%"
    description: "{{ $labels.table }} cache is ineffective: {{ $value | humanizePercentage }}"

High Query Volume

- alert: HighQueryVolume
  expr: rate(db_queries_total[5m]) > 1000
  for: 10m
  labels:
    severity: info
  annotations:
    summary: "Query volume above 1000 qps"
    description: "Current rate: {{ $value }} queries/second"

Dashboard Layout Suggestion

Row 1: Overview

Total QPS (Stat panel)
Average Latency (Stat panel)
Error Rate (Stat panel)
Cache Hit Rate (Gauge)

Row 2: Query Performance

Query Latency Heatmap (Heatmap)
P95/P99 Latency (Time series)

Row 3: Throughput

Queries by Operation (Time series)
Queries by Table (Time series)

Row 4: Cache Performance

Cache Hit/Miss Rate (Time series)
Cache Operation Latency (Time series)

Row 5: Errors

Error Rate by Table (Time series)
Top Errors (Table)

Common Analysis Scenarios

1. Identify Slow Queries

# Find operations with P99 > 100ms
histogram_quantile(0.99, rate(db_query_duration_seconds_bucket[5m])) > 0.1

2. Find Most Frequent Operations

# Top 5 operations by query count
topk(5, rate(db_queries_total[5m]))

3. Analyze Cache Effectiveness

# Cache hit rate for each table
(rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))) by (table)

4. Detect Performance Degradation

# Compare current vs 1 hour ago
rate(db_query_duration_seconds_sum[5m]) / rate(db_query_duration_seconds_count[5m])
/
rate(db_query_duration_seconds_sum[5m] offset 1h) / rate(db_query_duration_seconds_count[5m] offset 1h)

5. Calculate Total Database Time

# Total seconds spent in database queries per second
sum(rate(db_query_duration_seconds_sum[5m]))

Exporting Metrics

Prometheus Configuration

scrape_configs:
  - job_name: 'ollamamax'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

cURL Export

# Export all metrics
curl -s http://localhost:8080/metrics > metrics.txt

# Export database metrics only
curl -s http://localhost:8080/metrics | grep "^db_" > db_metrics.txt

# Export cache metrics only
curl -s http://localhost:8080/metrics | grep "^cache_" > cache_metrics.txt

Troubleshooting

No Metrics Appearing

Check server is running: curl http://localhost:8080/health
Verify metrics endpoint: curl http://localhost:8080/metrics
Check logs for errors

Metrics Not Updating

Trigger database operations
Wait for scrape interval (default 15s)
Check Prometheus targets page

High Latency Values

Check database connection pool
Review query plans
Analyze cache hit rates
Check for lock contention

Low Cache Hit Rate

Review cache TTL settings
Check Redis memory usage
Analyze access patterns
Consider increasing cache size

Best Practices

Set appropriate scrape intervals: 15-30s for production
Use recording rules: Pre-aggregate expensive queries
Set retention policies: Balance storage vs. historical data
Create dashboards: Visualize key metrics
Configure alerts: Proactive monitoring
Regular reviews: Weekly metric analysis
Document baselines: Know normal behavior

Support

For issues or questions:

Check Prometheus documentation: https://prometheus.io/docs/
Review Grafana guides: https://grafana.com/docs/
Check application logs

Version

Implementation Date: 2025-10-27
Metrics Version: 1.0
Compatible Prometheus Version: 2.0+

FilesExpand file tree

METRICS_QUICK_REFERENCE.md

Latest commit

History

METRICS_QUICK_REFERENCE.md

File metadata and controls

Database and Cache Metrics Quick Reference

Quick Start

View All Metrics

View Specific Metric

Available Metrics

Database Metrics

db_query_duration_seconds (Histogram)

db_queries_total (Counter)

Cache Metrics

cache_hits_total (Counter)

cache_misses_total (Counter)

cache_operation_duration_seconds (Histogram)

Common PromQL Queries

Query Performance

Average Query Duration

Average Query Duration by Table

Average Query Duration by Operation

95th Percentile Query Latency

99th Percentile Query Latency

Top 5 Slowest Operations

Query Throughput

Queries per Second

Queries per Second by Table

Queries per Second by Operation

Error Monitoring

Query Error Rate

Query Error Rate by Table

Total Errors in Last Hour

Cache Performance

Cache Hit Rate

Cache Hit Rate Percentage

Cache Miss Rate

Average Cache Operation Latency

Average Cache Latency by Operation

Cache Operations per Second

Grafana Panel Examples

Query Duration Heatmap

Cache Hit Rate Gauge

Query Throughput Graph

Top Slow Queries Table

Error Rate Graph

Alert Rule Examples

High Query Error Rate

Slow Database Queries

Very Slow Database Queries

Low Cache Hit Rate

Very Low Cache Hit Rate

High Query Volume

Dashboard Layout Suggestion

Row 1: Overview

Row 2: Query Performance

Row 3: Throughput

Row 4: Cache Performance

Row 5: Errors

Common Analysis Scenarios

1. Identify Slow Queries

2. Find Most Frequent Operations

3. Analyze Cache Effectiveness

4. Detect Performance Degradation

5. Calculate Total Database Time

Exporting Metrics

Prometheus Configuration

cURL Export

Troubleshooting

No Metrics Appearing

Metrics Not Updating

High Latency Values

Low Cache Hit Rate

Best Practices

Related Documentation

Support

Version