Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics #513
Changes from 1 commit
bfe247706f37672dc300ef45c55d16b4f3e99a93abFile filter
Filter by extension
Conversations
Uh oh!
There was an error while loading. Please reload this page.
Jump to
Uh oh!
There was an error while loading. Please reload this page.
Pairs with the existing path_circuit_breaker_state gauge (current 0/1 view) to give operators decomposed visibility into WHY domains are breaking. New metric: path_circuit_breaker_events_total{service_id, domain, reason_category, event} Counter incremented on each broken/recovered transition. event ∈ {"broken", "recovered"} reason_category is a bounded prefix bucket extracted from the free-text reason passed to MarkBroken (which contains response snippets and error messages — too high cardinality for direct labelling). Categories: retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown Wiring: - MarkBroken: emit "broken" event with classified reason - refreshLocal / refreshFromRedis: emit "recovered" for entries dropping out of the local cache (TTL-driven natural recovery) - ClearService: emit "recovered" for admin-cleared domains - refreshFromRedis re-asserts gauge=1 for currently-broken (idempotent state resync) but deliberately does NOT emit a "broken" event — those weren't new transitions, just metric resyncs of state already counted at the originating MarkBroken call site. Helper classifyCircuitBreakReason covers all current MarkBroken call sites in gateway/http_request_context_handle_request.go. Order-sensitive prefix matching (parallel_retry checked before retry). Tests: - TestClassifyCircuitBreakReason — exhaustive prefix mapping - TestCircuitBreakerEventsCounter — verifies broken+recovered increments on MarkBroken → ClearService transitions, with correct reason_category - Existing TestDomainCircuitBreaker_MetricGaugeTransitions continues to pass (gauge wiring unchanged) Cardinality: services × domains × ~6 reasons × 2 events ≈ 50K series upper bound across all pods. Safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Uh oh!
There was an error while loading. Please reload this page.
There are no files selected for viewing
Check warning on line 348 in metrics/metrics.go
[misspell] metrics/metrics.go#L348
Raw output
Uh oh!
There was an error while loading. Please reload this page.