Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
feat: surface circuit-breaker break/recovery reasons via events counter
Pairs with the existing path_circuit_breaker_state gauge (current 0/1 view)
to give operators decomposed visibility into WHY domains are breaking.

New metric:
  path_circuit_breaker_events_total{service_id, domain, reason_category, event}
    Counter incremented on each broken/recovered transition.
    event ∈ {"broken", "recovered"}
    reason_category is a bounded prefix bucket extracted from the free-text
    reason passed to MarkBroken (which contains response snippets and error
    messages — too high cardinality for direct labelling). Categories:
      retry, batch_transport, batch_heuristic, parallel_retry, heuristic, unknown

Wiring:
  - MarkBroken: emit "broken" event with classified reason
  - refreshLocal / refreshFromRedis: emit "recovered" for entries dropping out
    of the local cache (TTL-driven natural recovery)
  - ClearService: emit "recovered" for admin-cleared domains
  - refreshFromRedis re-asserts gauge=1 for currently-broken (idempotent state
    resync) but deliberately does NOT emit a "broken" event — those weren't
    new transitions, just metric resyncs of state already counted at the
    originating MarkBroken call site.

Helper classifyCircuitBreakReason covers all current MarkBroken call sites in
gateway/http_request_context_handle_request.go. Order-sensitive prefix matching
(parallel_retry checked before retry).

Tests:
  - TestClassifyCircuitBreakReason — exhaustive prefix mapping
  - TestCircuitBreakerEventsCounter — verifies broken+recovered increments
    on MarkBroken → ClearService transitions, with correct reason_category
  - Existing TestDomainCircuitBreaker_MetricGaugeTransitions continues to
    pass (gauge wiring unchanged)

Cardinality: services × domains × ~6 reasons × 2 events ≈ 50K series upper
bound across all pods. Safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  • Loading branch information
oten91 and claude committed May 1, 2026
commit 06f37679d91b59ef82ad064e52751e1abc95d733
96 changes: 78 additions & 18 deletions gateway/domain_circuit_breaker.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,34 @@ import (

const defaultMaxTTL = 30 * time.Minute

// classifyCircuitBreakReason maps the free-text reason string passed to
// MarkBroken into a bounded category, suitable for use as a Prometheus label.
// The raw reason contains response snippets, error messages, and status codes
// (high cardinality) — for metrics we only need the prefix bucket.
//
// Keep in sync with the metrics.CircuitBreakReason* constants and with the
// reason strings produced by MarkBroken call sites in
// gateway/http_request_context_handle_request.go.
func classifyCircuitBreakReason(reason string) string {
// Match by prefix; reasons are typically formatted as "<category>: <details>"
// or "<category>_<detail>: ...". Order matters for prefixes that share leading
// substrings (parallel_retry vs retry).
switch {
case strings.HasPrefix(reason, "parallel_retry"):
return metrics.CircuitBreakReasonParallelRetry
case strings.HasPrefix(reason, "batch_transport"):
return metrics.CircuitBreakReasonBatchTransport
case strings.HasPrefix(reason, "batch_heuristic"):
return metrics.CircuitBreakReasonBatchHeuristic
case strings.HasPrefix(reason, "retry"):
return metrics.CircuitBreakReasonRetry
case strings.HasPrefix(reason, "heuristic"):
return metrics.CircuitBreakReasonHeuristic
default:
return metrics.CircuitBreakReasonUnknown
}
}

// DomainCircuitBreaker tracks broken domains across pods via Redis.
// When any pod discovers a domain is returning errors, it marks it broken
// so all pods skip that domain on initial attempts for the TTL window.
Expand Down Expand Up @@ -117,6 +145,13 @@ func (cb *DomainCircuitBreaker) MarkBroken(ctx context.Context, serviceID, domai
// domains" without needing Redis access. Idempotent — re-setting to 1 is fine.
metrics.SetCircuitBreakerState(serviceID, domain, true)

// Record a "broken" event with reason category so dashboards can show rate
// of breaks decomposed by cause (retry / batch_transport / batch_heuristic /
// parallel_retry / heuristic / unknown). Counter increments on every call
// — including hit-count escalations — which is the right semantic: each
// MarkBroken is a discrete trigger event.
metrics.RecordCircuitBreakerEvent(serviceID, domain, classifyCircuitBreakReason(reason), metrics.CircuitBreakerEventBroken)

// Always log circuit break events at error level for production visibility.
// This is critical for diagnosing why domains get locked out.
cb.logger.Error().
Expand Down Expand Up @@ -184,13 +219,17 @@ func (cb *DomainCircuitBreaker) refreshLocal(serviceID string) map[string]bool {
return nil
}

// Remove expired entries; collect their names so we can drop the gauge to 0
// outside the lock (avoid taking metrics lock under cb.mu).
// Remove expired entries; collect their names + reasons so we can drop the
// gauge to 0 and record a "recovered" event outside the lock.
now := time.Now()
var expired []string
type expiredEntry struct {
domain string
reason string
}
var expired []expiredEntry
for domain, state := range entry.domains {
if !state.expiry.After(now) {
expired = append(expired, domain)
expired = append(expired, expiredEntry{domain: domain, reason: state.reason})
delete(entry.domains, domain)
}
}
Expand All @@ -202,8 +241,9 @@ func (cb *DomainCircuitBreaker) refreshLocal(serviceID string) map[string]bool {
}
cb.mu.Unlock()

for _, d := range expired {
metrics.SetCircuitBreakerState(serviceID, d, false)
for _, e := range expired {
metrics.SetCircuitBreakerState(serviceID, e.domain, false)
metrics.RecordCircuitBreakerEvent(serviceID, e.domain, classifyCircuitBreakReason(e.reason), metrics.CircuitBreakerEventRecovered)
}
return result
}
Expand Down Expand Up @@ -281,7 +321,11 @@ func (cb *DomainCircuitBreaker) refreshFromRedis(ctx context.Context, serviceID
// Capture pre-merge cache contents so we can drop the gauge to 0 for any
// domain that was previously broken but is no longer present after merge.
cb.mu.Lock()
var previouslyBroken []string
type expiredLocal struct {
domain string
reason string
}
var previouslyBroken []expiredLocal
existingEntry, ok := cb.cache[serviceID]
if ok {
for domain, state := range existingEntry.domains {
Expand All @@ -290,7 +334,7 @@ func (cb *DomainCircuitBreaker) refreshFromRedis(ctx context.Context, serviceID
domains[domain] = state
}
} else {
previouslyBroken = append(previouslyBroken, domain)
previouslyBroken = append(previouslyBroken, expiredLocal{domain: domain, reason: state.reason})
}
}
}
Expand All @@ -302,15 +346,23 @@ func (cb *DomainCircuitBreaker) refreshFromRedis(ctx context.Context, serviceID

// Drop gauge for previously-broken-but-now-expired domains, plus the explicit
// expiredFields list we already collected from Redis. Done outside the lock.
// We don't have a reason for expiredFields entries (parsed Redis value gave us
// one, but for compactness we don't carry it through here) — they get the
// "unknown" reason bucket on recovery, which is acceptable.
for _, d := range expiredFields {
metrics.SetCircuitBreakerState(serviceID, d, false)
metrics.RecordCircuitBreakerEvent(serviceID, d, metrics.CircuitBreakReasonUnknown, metrics.CircuitBreakerEventRecovered)
}
for _, d := range previouslyBroken {
metrics.SetCircuitBreakerState(serviceID, d, false)
for _, e := range previouslyBroken {
metrics.SetCircuitBreakerState(serviceID, e.domain, false)
metrics.RecordCircuitBreakerEvent(serviceID, e.domain, classifyCircuitBreakReason(e.reason), metrics.CircuitBreakerEventRecovered)
}
// Re-assert gauge=1 for currently-broken domains. Idempotent and ensures the
// metric is correct on a fresh pod that lazily picks up Redis state without
// going through MarkBroken locally.
// going through MarkBroken locally. Note: we deliberately do NOT record a
// "broken" event here — these aren't new transitions, just a metric resync
// for a state that was already counted at MarkBroken time on whichever pod
// originally fired it.
for d := range domains {
metrics.SetCircuitBreakerState(serviceID, d, true)
}
Expand All @@ -330,12 +382,16 @@ func (cb *DomainCircuitBreaker) ClearService(ctx context.Context, serviceID stri
cb.mu.Lock()
entry, ok := cb.cache[serviceID]
count := 0
var clearedDomains []string
type clearedEntry struct {
domain string
reason string
}
var cleared []clearedEntry
if ok {
count = len(entry.domains)
clearedDomains = make([]string, 0, count)
for d := range entry.domains {
clearedDomains = append(clearedDomains, d)
cleared = make([]clearedEntry, 0, count)
for d, state := range entry.domains {
cleared = append(cleared, clearedEntry{domain: d, reason: state.reason})
}
delete(cb.cache, serviceID)
}
Expand All @@ -345,9 +401,13 @@ func (cb *DomainCircuitBreaker) ClearService(ctx context.Context, serviceID stri
cb.redisClient.Del(ctx, cb.keyPrefix+serviceID)
}

// Drop the gauge for every domain we cleared so the dashboard reflects reality.
for _, d := range clearedDomains {
metrics.SetCircuitBreakerState(serviceID, d, false)
// Drop the gauge AND record a recovered event for every cleared domain. Each
// admin-clear represents an explicit "operator forced these domains back to
// healthy" action, distinct from natural TTL-driven recovery — the recovery
// counter still captures it because operationally it's the same transition.
for _, c := range cleared {
metrics.SetCircuitBreakerState(serviceID, c.domain, false)
metrics.RecordCircuitBreakerEvent(serviceID, c.domain, classifyCircuitBreakReason(c.reason), metrics.CircuitBreakerEventRecovered)
}

cb.logger.Info().
Expand Down
63 changes: 63 additions & 0 deletions gateway/domain_circuit_breaker_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,69 @@ func TestParseRedisValue_InvalidFormat(t *testing.T) {
}
}

// TestClassifyCircuitBreakReason verifies that the free-text reason strings
// passed to MarkBroken get bucketed into the correct bounded category for use
// as a Prometheus label. Order-sensitive prefix matches must be stable —
// "parallel_retry" must NOT match the "retry" branch.
func TestClassifyCircuitBreakReason(t *testing.T) {
cases := []struct {
raw string
expected string
}{
{"retry: heuristic_html | status=502 | response=<html>...", metrics.CircuitBreakReasonRetry},
{"batch_transport_error: connection refused", metrics.CircuitBreakReasonBatchTransport},
{"batch_heuristic: empty_array | status=200 | response=[]", metrics.CircuitBreakReasonBatchHeuristic},
{"parallel_retry: timeout after 5s", metrics.CircuitBreakReasonParallelRetry},
{"heuristic: structured error response", metrics.CircuitBreakReasonHeuristic},
{"completely unknown reason format", metrics.CircuitBreakReasonUnknown},
{"", metrics.CircuitBreakReasonUnknown},
}
for _, c := range cases {
got := classifyCircuitBreakReason(c.raw)
if got != c.expected {
t.Errorf("classifyCircuitBreakReason(%q) = %q, want %q", c.raw, got, c.expected)
}
}
}

// TestCircuitBreakerEventsCounter verifies that broken/recovered transitions
// increment the events counter with the correct reason_category bucket.
func TestCircuitBreakerEventsCounter(t *testing.T) {
cb := NewDomainCircuitBreaker(nil, testCircuitBreakerLogger())
cb.defaultTTL = 50 * time.Millisecond
ctx := context.Background()
const serviceID = "events-test-svc"
const domain = "events-test.example.com"

// Capture pre-state of the counter for the expected (service, domain, retry, broken)
// label combination so the test is hermetic against other tests' increments.
preBroken := testutil.ToFloat64(metrics.DomainCircuitBreakerEventsTotal.WithLabelValues(
serviceID, domain, metrics.CircuitBreakReasonRetry, metrics.CircuitBreakerEventBroken,
))
preRecovered := testutil.ToFloat64(metrics.DomainCircuitBreakerEventsTotal.WithLabelValues(
serviceID, domain, metrics.CircuitBreakReasonRetry, metrics.CircuitBreakerEventRecovered,
))

// MarkBroken with a "retry: ..." reason should record one broken event with
// reason_category="retry"
cb.MarkBroken(ctx, serviceID, domain, "retry: heuristic_html | status=502 | response=oops")
postBroken := testutil.ToFloat64(metrics.DomainCircuitBreakerEventsTotal.WithLabelValues(
serviceID, domain, metrics.CircuitBreakReasonRetry, metrics.CircuitBreakerEventBroken,
))
if postBroken-preBroken != 1 {
t.Fatalf("expected broken counter to increment by 1, got delta=%v", postBroken-preBroken)
}

// ClearService should record one recovered event with the same reason_category
cb.ClearService(ctx, serviceID)
postRecovered := testutil.ToFloat64(metrics.DomainCircuitBreakerEventsTotal.WithLabelValues(
serviceID, domain, metrics.CircuitBreakReasonRetry, metrics.CircuitBreakerEventRecovered,
))
if postRecovered-preRecovered != 1 {
t.Fatalf("expected recovered counter to increment by 1, got delta=%v", postRecovered-preRecovered)
}
}

// TestDomainCircuitBreaker_MetricGaugeTransitions verifies that the circuit-breaker
// state gauge moves between 0 and 1 correctly: set to 1 on MarkBroken, dropped to 0
// on ClearService, and dropped to 0 by refreshLocal once a TTL expires. We test
Expand Down
55 changes: 55 additions & 0 deletions metrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,61 @@
DomainCircuitBreakerState.WithLabelValues(serviceID, domain).Set(v)
}

// =============================================================================
// Domain Circuit Breaker Events (Counter)
// Labels: service_id, domain, reason_category, event
// Purpose: Surface WHY domains break and at what rate. Pairs with
// path_circuit_breaker_state (current 0/1 view) — events_total gives the
// historical decomposition by reason and event type (broken vs recovered).
//
// reason_category is a low-cardinality bucket extracted from the free-text
// reason passed to MarkBroken (which itself contains response snippets and
// error messages — too high-cardinality for direct labelling). See

Check warning on line 348 in metrics/metrics.go

View workflow job for this annotation

GitHub Actions / misspell

[misspell] metrics/metrics.go#L348

"labelling" is a misspelling of "labeling"
Raw output
./metrics/metrics.go:348:54: "labelling" is a misspelling of "labeling"
// CircuitBreakerReasonCategory.
//
// event ∈ {"broken", "recovered"} so a single counter captures both transitions.
// =============================================================================

const (
LabelReasonCategory = "reason_category"
LabelCircuitBreakerEvent = "event"
CircuitBreakerEventBroken = "broken"
CircuitBreakerEventRecovered = "recovered"
)

// CircuitBreakerReasonCategory enumerates the bounded set of reason buckets
// surfaced via the events counter. Keeps cardinality bounded regardless of
// how variable the underlying reason strings are. Add new categories here
// when MarkBroken call sites grow new reason prefixes.
const (
CircuitBreakReasonRetry = "retry" // failure during retry path
CircuitBreakReasonBatchTransport = "batch_transport" // batch path: transport-level error
CircuitBreakReasonBatchHeuristic = "batch_heuristic" // batch path: heuristic flagged response
CircuitBreakReasonParallelRetry = "parallel_retry" // parallel-retry path failure
CircuitBreakReasonHeuristic = "heuristic" // top-level heuristic break
CircuitBreakReasonUnknown = "unknown" // anything we can't classify
)

var DomainCircuitBreakerEventsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: MetricPrefix + "circuit_breaker_events_total",
Help: "Domain circuit-breaker transitions over time. event ∈ {broken, recovered}. reason_category is a bounded prefix bucket extracted from the raw MarkBroken reason. Pairs with path_circuit_breaker_state for current snapshot.",
},
[]string{LabelServiceID, LabelDomain, LabelReasonCategory, LabelCircuitBreakerEvent},
)

// RecordCircuitBreakerEvent increments the per-(service, domain, reason, event)
// counter. Skipped silently when domain is empty.
func RecordCircuitBreakerEvent(serviceID, domain, reasonCategory, event string) {
if domain == "" {
return
}
if reasonCategory == "" {
reasonCategory = CircuitBreakReasonUnknown
}
DomainCircuitBreakerEventsTotal.WithLabelValues(serviceID, domain, reasonCategory, event).Inc()
}

// =============================================================================
// Endpoints In Cooldown (Gauge, published every 10s via leaderboard publisher)
// Labels: domain, rpc_type, service_id
Expand Down
Loading