Skip to content

feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics#513

Merged
oten91 merged 6 commits intomainfrom
feat/circuit-breaker-and-cooldown-metrics
May 6, 2026
Merged

feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics#513
oten91 merged 6 commits intomainfrom
feat/circuit-breaker-and-cooldown-metrics

Conversation

@oten91
Copy link
Copy Markdown
Contributor

@oten91 oten91 commented May 1, 2026

Summary

Two metrics that were previously only observable via /ready introspection or direct Redis access are now first-class Prometheus gauges:

  • path_circuit_breaker_state{service_id, domain} — 1 if domain is currently locked out, 0 otherwise
  • path_endpoints_in_cooldown{domain, rpc_type, service_id} — count of endpoints currently in strike cooldown

Both metrics fill real operational visibility gaps that came up during PR #512 work — operators could see broken-domain effects in error-rate graphs but couldn't directly answer "which domains are circuit-broken right now?" without Redis access.

What this enables on dashboards

Once Grafana picks these up:

  • "Currently Broken Domains" table — list-of-domains view with service_id + domain + state, sortable by service.
  • "Broken Domains" stat tile — single number for at-a-glance health (0 = healthy, >5 = widespread infra issues).
  • "Cooldown" column on the Supplier Quality table — operators can see "how many of my endpoints are in cooldown right now?" alongside RPS, success%, etc.

These dashboard panels are not in this PR (kept to the metric-only diff for review clarity); they'll go in a small follow-up commit that updates local/observability/dashboards/*.json once this metric is deployed.

Implementation notes

path_circuit_breaker_state

State transitions in gateway/domain_circuit_breaker.go:

Trigger Gauge becomes
MarkBroken 1
ClearService 0 (for every cleared domain)
refreshLocal finds expired entries 0 (for each expired domain)
refreshFromRedis finds expired entries 0 (for each), and re-asserts 1 for currently-broken domains so fresh pods that lazily pick up Redis state stay consistent without going through MarkBroken locally

All gauge sets happen outside cb.mu to avoid taking the metrics lock under the cache mutex.

There is a small inherent staleness window: a circuit-breaker entry whose TTL just expired remains at gauge=1 until the cache TTL elapses (5s default) and refreshLocal runs. That's bounded and fine for a dashboard.

path_endpoints_in_cooldown

New LeaderboardDataProvider.GetCooldownCountData(ctx) method, implemented on Shannon's Protocol. Walks active sessions, fetches each endpoint's reputation score, increments per-(domain, service_id, rpc_type) when score.IsInCooldown() returns true.

Published every 10s alongside the existing leaderboard / mean score / supplier score metrics. Resets between snapshots so a domain dropping to zero cooldown'd endpoints shows zero (rather than sticking at its last value via Prometheus' 5-min staleness window).

Test plan

  • go test ./... — all green
  • go vet ./... — clean
  • New test: TestDomainCircuitBreaker_MetricGaugeTransitions covers MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0
  • Canary deploy: verify path_circuit_breaker_state and path_endpoints_in_cooldown series appear in Prometheus
  • Trigger a circuit break (or use /admin/circuit-breaker/clear/{serviceId} to test the clear path) and confirm the gauge transitions
  • Verify staleness behavior: entry expires → gauge drops to 0 within ~5s

Cardinality

Both new metrics are bounded by services × domains — already low cardinality (~50-200 unique domains × ~80 services on mainnet). No risk of cardinality explosion.

🤖 Generated with Claude Code

… Prometheus metrics

Two metrics that were previously only observable via /ready introspection or
direct Redis access are now first-class gauges:

  path_circuit_breaker_state{service_id, domain}
    1 = domain currently locked out, 0 = healthy/recovered.
    Set on MarkBroken; dropped to 0 on ClearService and on TTL-expiry refresh.
    Both refreshLocal and refreshFromRedis now drop the gauge for expired
    entries; refreshFromRedis additionally re-asserts gauge=1 for currently-
    broken domains so a fresh pod that lazily picks up Redis state stays
    consistent without going through MarkBroken locally.

  path_endpoints_in_cooldown{domain, rpc_type, service_id}
    Per-domain count of endpoints currently in strike cooldown
    (Score.IsInCooldown() == true). Cooldown is a transient state imposed
    by accumulated critical strikes — independent from "score below
    threshold" which is already covered by path_reputation_endpoint_leaderboard
    with tier_threshold="0".

  Published every 10s via the leaderboard publisher; Reset between snapshots
  so a domain dropping to zero cooldown'd endpoints actually shows zero
  instead of sticking at its last value via Prometheus' staleness window.

New LeaderboardDataProvider method GetCooldownCountData implemented on
Shannon's Protocol.

Test coverage in gateway/domain_circuit_breaker_test.go:
  - MarkBroken → gauge=1
  - ClearService → gauge=0
  - TTL expiry + refresh → gauge=0

Closes the metric gap operators have been hitting when asking "which
domains are circuit-broken right now?" — answer used to require Redis
access; now it's a Prometheus query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oten91 oten91 force-pushed the feat/circuit-breaker-and-cooldown-metrics branch from 3aba77d to bfe2477 Compare May 1, 2026 13:56
Pairs with the existing path_circuit_breaker_state gauge (current 0/1 view)
to give operators decomposed visibility into WHY domains are breaking.

New metric:
  path_circuit_breaker_events_total{service_id, domain, reason_category, event}
    Counter incremented on each broken/recovered transition.
    event ∈ {"broken", "recovered"}
    reason_category is a bounded prefix bucket extracted from the free-text
    reason passed to MarkBroken (which contains response snippets and error
    messages — too high cardinality for direct labelling). Categories:
      retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown

Wiring:
  - MarkBroken: emit "broken" event with classified reason
  - refreshLocal / refreshFromRedis: emit "recovered" for entries dropping out
    of the local cache (TTL-driven natural recovery)
  - ClearService: emit "recovered" for admin-cleared domains
  - refreshFromRedis re-asserts gauge=1 for currently-broken (idempotent state
    resync) but deliberately does NOT emit a "broken" event — those weren't
    new transitions, just metric resyncs of state already counted at the
    originating MarkBroken call site.

Helper classifyCircuitBreakReason covers all current MarkBroken call sites in
gateway/http_request_context_handle_request.go. Order-sensitive prefix matching
(parallel_retry checked before retry).

Tests:
  - TestClassifyCircuitBreakReason — exhaustive prefix mapping
  - TestCircuitBreakerEventsCounter — verifies broken+recovered increments
    on MarkBroken → ClearService transitions, with correct reason_category
  - Existing TestDomainCircuitBreaker_MetricGaugeTransitions continues to
    pass (gauge wiring unchanged)

Cardinality: services × domains × ~6 reasons × 2 events ≈ 50K series upper
bound across all pods. Safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oten91
Copy link
Copy Markdown
Contributor Author

oten91 commented May 1, 2026

Update: added break/recovery reason visibility

Building on the gauge-only first commit, this commit (06f3767) adds:

New metric path_circuit_breaker_events_total{service_id, domain, reason_category, event}

  • Counter incremented on each broken/recovered transition
  • reason_category is a bounded bucket: retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown
  • Extracted via classifyCircuitBreakReason from the free-text reason that already flows into MarkBroken (no new logging or call-site changes — same reasons, just bucketed for label use)

Wiring

  • MarkBroken → emits broken event with classified reason
  • refreshLocal / refreshFromRedis → emits recovered for natural TTL-driven recovery
  • ClearService → emits recovered for admin-cleared transitions
  • refreshFromRedis resync path deliberately does NOT emit broken events (those would be double-counted with the originating pod's MarkBroken call)

Tests

  • TestClassifyCircuitBreakReason — prefix-mapping correctness, order-sensitive
  • TestCircuitBreakerEventsCounter — verifies counter increments on MarkBroken → ClearService transitions
  • All existing tests still pass

Dashboard panels added (local files only, not in this PR)

  1. "Circuit-Breaker Breaks per Second (by Reason)" — stacked timeseries showing rate of breaks decomposed by reason_category. Spike in batch_transport = upstream connection issues; spike in heuristic = bad response patterns from suppliers.
  2. "Top Broken Domains (last 1h, by reason)"topk(15) table over increase(events_total[1h]) for diagnostic attribution.

Cardinality ≈ services × domains × 6 reasons × 2 events ≈ 50K series upper bound. Bounded.

Promotion checklist from earlier still applies; this just adds the events counter alongside the gauge. Same risk profile, same canary soak gives equal coverage. Tests are green, vet clean.

oten91 and others added 4 commits May 5, 2026 16:08
…n REST

Before: any JSON-RPC envelope received in response to a REST-shaped request
was classified as `rest_protocol_mismatch` with confidence 0.95 and routed
through `isDeceptiveResponsePattern` → CRITICAL signal → strike accumulation.

That is correct for canned successes like `{"jsonrpc":"2.0","result":[]}`
returned regardless of request shape, but it falsely punishes operators
whose backends only speak JSON-RPC: when PATH routes a REST request to
them they correctly reply with their native error format (e.g. -32601
Method not found). That is a capability mismatch, not gaming, yet the
heuristic treats both identically.

Operationally this surfaced as repeated 5-minute cooldowns across ~16
services on operators with JSON-RPC-only nodes — five honest-error events
per session was enough to cross the critical-strike threshold even though
their endpoints were otherwise healthy.

Fix: split the detection into two reasons:
- `rest_protocol_mismatch` — has `result` field (and not `result:null`):
  canned success, still deceptive, still triggers a critical signal.
- `rest_protocol_mismatch_error` — has `error` only (or `result:null`
  alongside `error`, the Geth/Bor/Erigon spec quirk): honest capability
  mismatch, NOT in the deceptive-pattern list, routed to major signal
  (no strike accumulation, no cooldown).

Both still ShouldRetry against a REST-capable peer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The strike threshold (5 critical errors → cooldown) was paired with a
1-strike-per-success decay, which is harsh for high-traffic suppliers:
their absolute critical-error count grows with volume, and a transient
burst (e.g. a ~1-minute 5xx wave on otherwise healthy nodes) can push a
99%-success endpoint over the threshold because each intervening success
only erases a single strike. Sequence matters more than average error
rate under that rule, so a few-second burst at high error rate during
otherwise-clean traffic still produces a 5–60 minute cooldown.

Bump the per-success decay from 1 to 3:
- Strikes only persist when error rate exceeds ~25%, so genuinely failing
  endpoints still accumulate and trip the threshold.
- Burst tolerance roughly doubles: a sustained run of failures still
  trips the cooldown, but transient flickers wash out.
- Deceptive suppliers that pass some requests still cannot instantly
  wipe their strike history with a single success — recovery still
  requires a sustained success rate.

The detection side is unchanged; only the recovery curve is gentler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commit split rest_protocol_mismatch into a separate honest-error
reason and removed it from the deceptive-pattern list, so the strike
system stops accumulating critical strikes for it. But the domain-level
circuit breaker lives on a separate path: shouldCircuitBreak only
exempts responses whose MatchedPattern is "capability_limitation", and
the new reason had an empty MatchedPattern, so the circuit breaker still
broke the domain on every retry.

Canary observation after the previous fix: rm01.kalorius.tech still
getting circuit-broken on tron with reason="...rest_protocol_mismatch_
error..." despite the strike-system change.

Tag MatchedPattern="capability_limitation" on the new reason. Mirrors
how non_json_capability_limitation (Tron lite fullnodes returning plain
text "API closed") is already handled — both branches of the existing
guard then exempt it: reputation penalty skipped, circuit breaker skipped,
ShouldRetry preserved so the request still rolls to a REST-capable peer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary soak after the previous commit showed 5/33 honest-error events
still triggering circuit breaks. Cause: shouldCircuitBreak has two code
paths — one with the structured heuristicResult, one with only the
wrapped lastErr string. The hedge_failed retry path goes through the
second; the heuristicResult is dropped during error propagation, and
the lastErr fallback only matches archival/over-serviced patterns by
substring. The new rest_protocol_mismatch_error reason had no entry in
that substring list, so it fell through to circuit-break.

Add "rest_protocol_mismatch_error" to capabilityLimitationSubstrings.
Belt-and-braces with the previous MatchedPattern tag: structured-result
path catches it via the matched pattern, hedge_failed path catches it
via the substring. Substring is the literal reason name and includes
the "_error" suffix, so it cannot match the gaming variant
"rest_protocol_mismatch" (which still must circuit-break).

Strike-decay change is meanwhile working as intended — zero new
cooldowns observed in a 30-min soak window post-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oten91 oten91 merged commit 64f1fa9 into main May 6, 2026
10 of 13 checks passed
@oten91 oten91 deleted the feat/circuit-breaker-and-cooldown-metrics branch May 6, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant