Conversation
… Prometheus metrics
Two metrics that were previously only observable via /ready introspection or
direct Redis access are now first-class gauges:
path_circuit_breaker_state{service_id, domain}
1 = domain currently locked out, 0 = healthy/recovered.
Set on MarkBroken; dropped to 0 on ClearService and on TTL-expiry refresh.
Both refreshLocal and refreshFromRedis now drop the gauge for expired
entries; refreshFromRedis additionally re-asserts gauge=1 for currently-
broken domains so a fresh pod that lazily picks up Redis state stays
consistent without going through MarkBroken locally.
path_endpoints_in_cooldown{domain, rpc_type, service_id}
Per-domain count of endpoints currently in strike cooldown
(Score.IsInCooldown() == true). Cooldown is a transient state imposed
by accumulated critical strikes — independent from "score below
threshold" which is already covered by path_reputation_endpoint_leaderboard
with tier_threshold="0".
Published every 10s via the leaderboard publisher; Reset between snapshots
so a domain dropping to zero cooldown'd endpoints actually shows zero
instead of sticking at its last value via Prometheus' staleness window.
New LeaderboardDataProvider method GetCooldownCountData implemented on
Shannon's Protocol.
Test coverage in gateway/domain_circuit_breaker_test.go:
- MarkBroken → gauge=1
- ClearService → gauge=0
- TTL expiry + refresh → gauge=0
Closes the metric gap operators have been hitting when asking "which
domains are circuit-broken right now?" — answer used to require Redis
access; now it's a Prometheus query.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3aba77d to
bfe2477
Compare
Pairs with the existing path_circuit_breaker_state gauge (current 0/1 view)
to give operators decomposed visibility into WHY domains are breaking.
New metric:
path_circuit_breaker_events_total{service_id, domain, reason_category, event}
Counter incremented on each broken/recovered transition.
event ∈ {"broken", "recovered"}
reason_category is a bounded prefix bucket extracted from the free-text
reason passed to MarkBroken (which contains response snippets and error
messages — too high cardinality for direct labelling). Categories:
retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown
Wiring:
- MarkBroken: emit "broken" event with classified reason
- refreshLocal / refreshFromRedis: emit "recovered" for entries dropping out
of the local cache (TTL-driven natural recovery)
- ClearService: emit "recovered" for admin-cleared domains
- refreshFromRedis re-asserts gauge=1 for currently-broken (idempotent state
resync) but deliberately does NOT emit a "broken" event — those weren't
new transitions, just metric resyncs of state already counted at the
originating MarkBroken call site.
Helper classifyCircuitBreakReason covers all current MarkBroken call sites in
gateway/http_request_context_handle_request.go. Order-sensitive prefix matching
(parallel_retry checked before retry).
Tests:
- TestClassifyCircuitBreakReason — exhaustive prefix mapping
- TestCircuitBreakerEventsCounter — verifies broken+recovered increments
on MarkBroken → ClearService transitions, with correct reason_category
- Existing TestDomainCircuitBreaker_MetricGaugeTransitions continues to
pass (gauge wiring unchanged)
Cardinality: services × domains × ~6 reasons × 2 events ≈ 50K series upper
bound across all pods. Safe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
Update: added break/recovery reason visibilityBuilding on the gauge-only first commit, this commit (06f3767) adds: New metric
Wiring
Tests
Dashboard panels added (local files only, not in this PR)
Cardinality ≈ services × domains × 6 reasons × 2 events ≈ 50K series upper bound. Bounded. Promotion checklist from earlier still applies; this just adds the events counter alongside the gauge. Same risk profile, same canary soak gives equal coverage. Tests are green, vet clean. |
…n REST
Before: any JSON-RPC envelope received in response to a REST-shaped request
was classified as `rest_protocol_mismatch` with confidence 0.95 and routed
through `isDeceptiveResponsePattern` → CRITICAL signal → strike accumulation.
That is correct for canned successes like `{"jsonrpc":"2.0","result":[]}`
returned regardless of request shape, but it falsely punishes operators
whose backends only speak JSON-RPC: when PATH routes a REST request to
them they correctly reply with their native error format (e.g. -32601
Method not found). That is a capability mismatch, not gaming, yet the
heuristic treats both identically.
Operationally this surfaced as repeated 5-minute cooldowns across ~16
services on operators with JSON-RPC-only nodes — five honest-error events
per session was enough to cross the critical-strike threshold even though
their endpoints were otherwise healthy.
Fix: split the detection into two reasons:
- `rest_protocol_mismatch` — has `result` field (and not `result:null`):
canned success, still deceptive, still triggers a critical signal.
- `rest_protocol_mismatch_error` — has `error` only (or `result:null`
alongside `error`, the Geth/Bor/Erigon spec quirk): honest capability
mismatch, NOT in the deceptive-pattern list, routed to major signal
(no strike accumulation, no cooldown).
Both still ShouldRetry against a REST-capable peer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The strike threshold (5 critical errors → cooldown) was paired with a 1-strike-per-success decay, which is harsh for high-traffic suppliers: their absolute critical-error count grows with volume, and a transient burst (e.g. a ~1-minute 5xx wave on otherwise healthy nodes) can push a 99%-success endpoint over the threshold because each intervening success only erases a single strike. Sequence matters more than average error rate under that rule, so a few-second burst at high error rate during otherwise-clean traffic still produces a 5–60 minute cooldown. Bump the per-success decay from 1 to 3: - Strikes only persist when error rate exceeds ~25%, so genuinely failing endpoints still accumulate and trip the threshold. - Burst tolerance roughly doubles: a sustained run of failures still trips the cooldown, but transient flickers wash out. - Deceptive suppliers that pass some requests still cannot instantly wipe their strike history with a single success — recovery still requires a sustained success rate. The detection side is unchanged; only the recovery curve is gentler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commit split rest_protocol_mismatch into a separate honest-error reason and removed it from the deceptive-pattern list, so the strike system stops accumulating critical strikes for it. But the domain-level circuit breaker lives on a separate path: shouldCircuitBreak only exempts responses whose MatchedPattern is "capability_limitation", and the new reason had an empty MatchedPattern, so the circuit breaker still broke the domain on every retry. Canary observation after the previous fix: rm01.kalorius.tech still getting circuit-broken on tron with reason="...rest_protocol_mismatch_ error..." despite the strike-system change. Tag MatchedPattern="capability_limitation" on the new reason. Mirrors how non_json_capability_limitation (Tron lite fullnodes returning plain text "API closed") is already handled — both branches of the existing guard then exempt it: reputation penalty skipped, circuit breaker skipped, ShouldRetry preserved so the request still rolls to a REST-capable peer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canary soak after the previous commit showed 5/33 honest-error events still triggering circuit breaks. Cause: shouldCircuitBreak has two code paths — one with the structured heuristicResult, one with only the wrapped lastErr string. The hedge_failed retry path goes through the second; the heuristicResult is dropped during error propagation, and the lastErr fallback only matches archival/over-serviced patterns by substring. The new rest_protocol_mismatch_error reason had no entry in that substring list, so it fell through to circuit-break. Add "rest_protocol_mismatch_error" to capabilityLimitationSubstrings. Belt-and-braces with the previous MatchedPattern tag: structured-result path catches it via the matched pattern, hedge_failed path catches it via the substring. Substring is the literal reason name and includes the "_error" suffix, so it cannot match the gaming variant "rest_protocol_mismatch" (which still must circuit-break). Strike-decay change is meanwhile working as intended — zero new cooldowns observed in a 30-min soak window post-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two metrics that were previously only observable via
/readyintrospection or direct Redis access are now first-class Prometheus gauges:path_circuit_breaker_state{service_id, domain}— 1 if domain is currently locked out, 0 otherwisepath_endpoints_in_cooldown{domain, rpc_type, service_id}— count of endpoints currently in strike cooldownBoth metrics fill real operational visibility gaps that came up during PR #512 work — operators could see broken-domain effects in error-rate graphs but couldn't directly answer "which domains are circuit-broken right now?" without Redis access.
What this enables on dashboards
Once Grafana picks these up:
These dashboard panels are not in this PR (kept to the metric-only diff for review clarity); they'll go in a small follow-up commit that updates
local/observability/dashboards/*.jsononce this metric is deployed.Implementation notes
path_circuit_breaker_stateState transitions in
gateway/domain_circuit_breaker.go:MarkBrokenClearServicerefreshLocalfinds expired entriesrefreshFromRedisfinds expired entriesAll gauge sets happen outside
cb.muto avoid taking the metrics lock under the cache mutex.There is a small inherent staleness window: a circuit-breaker entry whose TTL just expired remains at gauge=1 until the cache TTL elapses (5s default) and
refreshLocalruns. That's bounded and fine for a dashboard.path_endpoints_in_cooldownNew
LeaderboardDataProvider.GetCooldownCountData(ctx)method, implemented on Shannon'sProtocol. Walks active sessions, fetches each endpoint's reputation score, increments per-(domain, service_id, rpc_type) whenscore.IsInCooldown()returns true.Published every 10s alongside the existing leaderboard / mean score / supplier score metrics. Resets between snapshots so a domain dropping to zero cooldown'd endpoints shows zero (rather than sticking at its last value via Prometheus' 5-min staleness window).
Test plan
go test ./...— all greengo vet ./...— cleanTestDomainCircuitBreaker_MetricGaugeTransitionscovers MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0path_circuit_breaker_stateandpath_endpoints_in_cooldownseries appear in Prometheus/admin/circuit-breaker/clear/{serviceId}to test the clear path) and confirm the gauge transitionsCardinality
Both new metrics are bounded by
services × domains— already low cardinality (~50-200 unique domains × ~80 services on mainnet). No risk of cardinality explosion.🤖 Generated with Claude Code