-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Thanos, Prometheus and Golang version used:
Thanos version: 0.21.1
Prometheus version: 2.27.1
Alertmanager: 0.22.2
Object Storage Provider:
GCS
What happened:
We run our Thanos Ruler in HA with some recording rules and have alerts in place for some of these recording rules. We are experiencing some issues with alerting. We see that the alerts are getting closed and instantly getting created again.
For debugging we created a watchdog alert:
name: ruler-watchdog
rules:
- alert: ruler-watchdog
expr: sum(up{namespace="thanos",app_kubernetes_io_component="query"}) by (app_kubernetes_io_component) > 0
for: 1m
annotations:
summary: Watchdog for Thanos Ruler
message: Watchdog for Thanos Ruler
labels:
severity: warningThis one is always firing and we see the same thing happening.
We don't really see a pattern when these opening and closing of alerts happen. We first thought is was due to some restarts or something. The Alertmanager we use for Thanos Ruler is also used by Prometheus on the same cluster, which doesn't show this behavior.
What you expected to happen:
I would expect that once an alert is created, it's stays open until the alert is solved.
How to reproduce it (as minimally and precisely as possible):
- Ruler in HA with 3 instances
- Alertmanager in HA with 3 instances
- Create a watchdog alert
Anything else we need to know:
The ruler is created via the Prometheus Operator, but deployed in the same namespace as Thanos on the same cluster. The recording rules are working properly and we don't see any gaps in these metrics.