feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics by oten91 · Pull Request #513 · pokt-network/path

oten91 · 2026-05-01T13:52:05Z

Summary

Two metrics that were previously only observable via /ready introspection or direct Redis access are now first-class Prometheus gauges:

path_circuit_breaker_state{service_id, domain} — 1 if domain is currently locked out, 0 otherwise
path_endpoints_in_cooldown{domain, rpc_type, service_id} — count of endpoints currently in strike cooldown

Both metrics fill real operational visibility gaps that came up during PR #512 work — operators could see broken-domain effects in error-rate graphs but couldn't directly answer "which domains are circuit-broken right now?" without Redis access.

What this enables on dashboards

Once Grafana picks these up:

"Currently Broken Domains" table — list-of-domains view with service_id + domain + state, sortable by service.
"Broken Domains" stat tile — single number for at-a-glance health (0 = healthy, >5 = widespread infra issues).
"Cooldown" column on the Supplier Quality table — operators can see "how many of my endpoints are in cooldown right now?" alongside RPS, success%, etc.

These dashboard panels are not in this PR (kept to the metric-only diff for review clarity); they'll go in a small follow-up commit that updates local/observability/dashboards/*.json once this metric is deployed.

Implementation notes

`path_circuit_breaker_state`

State transitions in gateway/domain_circuit_breaker.go:

Trigger	Gauge becomes
`MarkBroken`	1
`ClearService`	0 (for every cleared domain)
`refreshLocal` finds expired entries	0 (for each expired domain)
`refreshFromRedis` finds expired entries	0 (for each), and re-asserts 1 for currently-broken domains so fresh pods that lazily pick up Redis state stay consistent without going through MarkBroken locally

All gauge sets happen outside cb.mu to avoid taking the metrics lock under the cache mutex.

There is a small inherent staleness window: a circuit-breaker entry whose TTL just expired remains at gauge=1 until the cache TTL elapses (5s default) and refreshLocal runs. That's bounded and fine for a dashboard.

`path_endpoints_in_cooldown`

New LeaderboardDataProvider.GetCooldownCountData(ctx) method, implemented on Shannon's Protocol. Walks active sessions, fetches each endpoint's reputation score, increments per-(domain, service_id, rpc_type) when score.IsInCooldown() returns true.

Published every 10s alongside the existing leaderboard / mean score / supplier score metrics. Resets between snapshots so a domain dropping to zero cooldown'd endpoints shows zero (rather than sticking at its last value via Prometheus' 5-min staleness window).

Test plan

go test ./... — all green
go vet ./... — clean
New test: TestDomainCircuitBreaker_MetricGaugeTransitions covers MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0
Canary deploy: verify path_circuit_breaker_state and path_endpoints_in_cooldown series appear in Prometheus
Trigger a circuit break (or use /admin/circuit-breaker/clear/{serviceId} to test the clear path) and confirm the gauge transitions
Verify staleness behavior: entry expires → gauge drops to 0 within ~5s

Cardinality

Both new metrics are bounded by services × domains — already low cardinality (~50-200 unique domains × ~80 services on mainnet). No risk of cardinality explosion.

🤖 Generated with Claude Code

… Prometheus metrics Two metrics that were previously only observable via /ready introspection or direct Redis access are now first-class gauges: path_circuit_breaker_state{service_id, domain} 1 = domain currently locked out, 0 = healthy/recovered. Set on MarkBroken; dropped to 0 on ClearService and on TTL-expiry refresh. Both refreshLocal and refreshFromRedis now drop the gauge for expired entries; refreshFromRedis additionally re-asserts gauge=1 for currently- broken domains so a fresh pod that lazily picks up Redis state stays consistent without going through MarkBroken locally. path_endpoints_in_cooldown{domain, rpc_type, service_id} Per-domain count of endpoints currently in strike cooldown (Score.IsInCooldown() == true). Cooldown is a transient state imposed by accumulated critical strikes — independent from "score below threshold" which is already covered by path_reputation_endpoint_leaderboard with tier_threshold="0". Published every 10s via the leaderboard publisher; Reset between snapshots so a domain dropping to zero cooldown'd endpoints actually shows zero instead of sticking at its last value via Prometheus' staleness window. New LeaderboardDataProvider method GetCooldownCountData implemented on Shannon's Protocol. Test coverage in gateway/domain_circuit_breaker_test.go: - MarkBroken → gauge=1 - ClearService → gauge=0 - TTL expiry + refresh → gauge=0 Closes the metric gap operators have been hitting when asking "which domains are circuit-broken right now?" — answer used to require Redis access; now it's a Prometheus query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pairs with the existing path_circuit_breaker_state gauge (current 0/1 view) to give operators decomposed visibility into WHY domains are breaking. New metric: path_circuit_breaker_events_total{service_id, domain, reason_category, event} Counter incremented on each broken/recovered transition. event ∈ {"broken", "recovered"} reason_category is a bounded prefix bucket extracted from the free-text reason passed to MarkBroken (which contains response snippets and error messages — too high cardinality for direct labelling). Categories: retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown Wiring: - MarkBroken: emit "broken" event with classified reason - refreshLocal / refreshFromRedis: emit "recovered" for entries dropping out of the local cache (TTL-driven natural recovery) - ClearService: emit "recovered" for admin-cleared domains - refreshFromRedis re-asserts gauge=1 for currently-broken (idempotent state resync) but deliberately does NOT emit a "broken" event — those weren't new transitions, just metric resyncs of state already counted at the originating MarkBroken call site. Helper classifyCircuitBreakReason covers all current MarkBroken call sites in gateway/http_request_context_handle_request.go. Order-sensitive prefix matching (parallel_retry checked before retry). Tests: - TestClassifyCircuitBreakReason — exhaustive prefix mapping - TestCircuitBreakerEventsCounter — verifies broken+recovered increments on MarkBroken → ClearService transitions, with correct reason_category - Existing TestDomainCircuitBreaker_MetricGaugeTransitions continues to pass (gauge wiring unchanged) Cardinality: services × domains × ~6 reasons × 2 events ≈ 50K series upper bound across all pods. Safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

oten91 · 2026-05-01T15:24:52Z

Update: added break/recovery reason visibility

Building on the gauge-only first commit, this commit (06f3767) adds:

New metric path_circuit_breaker_events_total{service_id, domain, reason_category, event}

Counter incremented on each broken/recovered transition
reason_category is a bounded bucket: retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown
Extracted via classifyCircuitBreakReason from the free-text reason that already flows into MarkBroken (no new logging or call-site changes — same reasons, just bucketed for label use)

Wiring

MarkBroken → emits broken event with classified reason
refreshLocal / refreshFromRedis → emits recovered for natural TTL-driven recovery
ClearService → emits recovered for admin-cleared transitions
refreshFromRedis resync path deliberately does NOT emit broken events (those would be double-counted with the originating pod's MarkBroken call)

Tests

TestClassifyCircuitBreakReason — prefix-mapping correctness, order-sensitive
TestCircuitBreakerEventsCounter — verifies counter increments on MarkBroken → ClearService transitions
All existing tests still pass

Dashboard panels added (local files only, not in this PR)

"Circuit-Breaker Breaks per Second (by Reason)" — stacked timeseries showing rate of breaks decomposed by reason_category. Spike in batch_transport = upstream connection issues; spike in heuristic = bad response patterns from suppliers.
"Top Broken Domains (last 1h, by reason)" — topk(15) table over increase(events_total[1h]) for diagnostic attribution.

Cardinality ≈ services × domains × 6 reasons × 2 events ≈ 50K series upper bound. Bounded.

Promotion checklist from earlier still applies; this just adds the events counter alongside the gauge. Same risk profile, same canary soak gives equal coverage. Tests are green, vet clean.

…n REST Before: any JSON-RPC envelope received in response to a REST-shaped request was classified as `rest_protocol_mismatch` with confidence 0.95 and routed through `isDeceptiveResponsePattern` → CRITICAL signal → strike accumulation. That is correct for canned successes like `{"jsonrpc":"2.0","result":[]}` returned regardless of request shape, but it falsely punishes operators whose backends only speak JSON-RPC: when PATH routes a REST request to them they correctly reply with their native error format (e.g. -32601 Method not found). That is a capability mismatch, not gaming, yet the heuristic treats both identically. Operationally this surfaced as repeated 5-minute cooldowns across ~16 services on operators with JSON-RPC-only nodes — five honest-error events per session was enough to cross the critical-strike threshold even though their endpoints were otherwise healthy. Fix: split the detection into two reasons: - `rest_protocol_mismatch` — has `result` field (and not `result:null`): canned success, still deceptive, still triggers a critical signal. - `rest_protocol_mismatch_error` — has `error` only (or `result:null` alongside `error`, the Geth/Bor/Erigon spec quirk): honest capability mismatch, NOT in the deceptive-pattern list, routed to major signal (no strike accumulation, no cooldown). Both still ShouldRetry against a REST-capable peer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The strike threshold (5 critical errors → cooldown) was paired with a 1-strike-per-success decay, which is harsh for high-traffic suppliers: their absolute critical-error count grows with volume, and a transient burst (e.g. a ~1-minute 5xx wave on otherwise healthy nodes) can push a 99%-success endpoint over the threshold because each intervening success only erases a single strike. Sequence matters more than average error rate under that rule, so a few-second burst at high error rate during otherwise-clean traffic still produces a 5–60 minute cooldown. Bump the per-success decay from 1 to 3: - Strikes only persist when error rate exceeds ~25%, so genuinely failing endpoints still accumulate and trip the threshold. - Burst tolerance roughly doubles: a sustained run of failures still trips the cooldown, but transient flickers wash out. - Deceptive suppliers that pass some requests still cannot instantly wipe their strike history with a single success — recovery still requires a sustained success rate. The detection side is unchanged; only the recovery curve is gentler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Earlier commit split rest_protocol_mismatch into a separate honest-error reason and removed it from the deceptive-pattern list, so the strike system stops accumulating critical strikes for it. But the domain-level circuit breaker lives on a separate path: shouldCircuitBreak only exempts responses whose MatchedPattern is "capability_limitation", and the new reason had an empty MatchedPattern, so the circuit breaker still broke the domain on every retry. Canary observation after the previous fix: rm01.kalorius.tech still getting circuit-broken on tron with reason="...rest_protocol_mismatch_ error..." despite the strike-system change. Tag MatchedPattern="capability_limitation" on the new reason. Mirrors how non_json_capability_limitation (Tron lite fullnodes returning plain text "API closed") is already handled — both branches of the existing guard then exempt it: reputation penalty skipped, circuit breaker skipped, ShouldRetry preserved so the request still rolls to a REST-capable peer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Canary soak after the previous commit showed 5/33 honest-error events still triggering circuit breaks. Cause: shouldCircuitBreak has two code paths — one with the structured heuristicResult, one with only the wrapped lastErr string. The hedge_failed retry path goes through the second; the heuristicResult is dropped during error propagation, and the lastErr fallback only matches archival/over-serviced patterns by substring. The new rest_protocol_mismatch_error reason had no entry in that substring list, so it fell through to circuit-break. Add "rest_protocol_mismatch_error" to capabilityLimitationSubstrings. Belt-and-braces with the previous MatchedPattern tag: structured-result path catches it via the matched pattern, hedge_failed path catches it via the substring. Substring is the literal reason name and includes the "_error" suffix, so it cannot match the gaming variant "rest_protocol_mismatch" (which still must circuit-break). Strike-decay change is meanwhile working as intended — zero new cooldowns observed in a 30-min soak window post-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

oten91 force-pushed the feat/circuit-breaker-and-cooldown-metrics branch from 3aba77d to bfe2477 Compare May 1, 2026 13:56

oten91 and others added 4 commits May 5, 2026 16:08

oten91 merged commit 64f1fa9 into main May 6, 2026
10 of 13 checks passed

oten91 deleted the feat/circuit-breaker-and-cooldown-metrics branch May 6, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics#513

feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics#513
oten91 merged 6 commits intomainfrom
feat/circuit-breaker-and-cooldown-metrics

oten91 commented May 1, 2026

Uh oh!

oten91 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oten91 commented May 1, 2026

Summary

What this enables on dashboards

Implementation notes

path_circuit_breaker_state

path_endpoints_in_cooldown

Test plan

Cardinality

Uh oh!

oten91 commented May 1, 2026

Update: added break/recovery reason visibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`path_circuit_breaker_state`

`path_endpoints_in_cooldown`