Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics #513
Changes from 1 commit
bfe247706f37672dc300ef45c55d16b4f3e99a93abFile filter
Filter by extension
Conversations
Uh oh!
There was an error while loading. Please reload this page.
Jump to
Uh oh!
There was an error while loading. Please reload this page.
… Prometheus metrics Two metrics that were previously only observable via /ready introspection or direct Redis access are now first-class gauges: path_circuit_breaker_state{service_id, domain} 1 = domain currently locked out, 0 = healthy/recovered. Set on MarkBroken; dropped to 0 on ClearService and on TTL-expiry refresh. Both refreshLocal and refreshFromRedis now drop the gauge for expired entries; refreshFromRedis additionally re-asserts gauge=1 for currently- broken domains so a fresh pod that lazily picks up Redis state stays consistent without going through MarkBroken locally. path_endpoints_in_cooldown{domain, rpc_type, service_id} Per-domain count of endpoints currently in strike cooldown (Score.IsInCooldown() == true). Cooldown is a transient state imposed by accumulated critical strikes — independent from "score below threshold" which is already covered by path_reputation_endpoint_leaderboard with tier_threshold="0". Published every 10s via the leaderboard publisher; Reset between snapshots so a domain dropping to zero cooldown'd endpoints actually shows zero instead of sticking at its last value via Prometheus' staleness window. New LeaderboardDataProvider method GetCooldownCountData implemented on Shannon's Protocol. Test coverage in gateway/domain_circuit_breaker_test.go: - MarkBroken → gauge=1 - ClearService → gauge=0 - TTL expiry + refresh → gauge=0 Closes the metric gap operators have been hitting when asking "which domains are circuit-broken right now?" — answer used to require Redis access; now it's a Prometheus query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Uh oh!
There was an error while loading. Please reload this page.
There are no files selected for viewing
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.