-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Thanos, Prometheus and Golang version used
thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5
What happened
In one of k8s clusters that we run thanos-query in it crashes every couple of minutes with "fatal error: concurrent map iteration and map write" or "fatal error: concurrent map writes"
What you expected to happen
No crash :-)
How to reproduce it (as minimally and precisely as possible):
I've no idea. I didn't manage to find anything that triggers it. Same problem was observed in 0.4.0. I'm not sure about 0.3.0.
thanos runs in GCP GKE cluster, query is deployed via our own helm chart. Crashing containers run:
thanos query
--log.level=debug
--query.replica-label=prometheus_replica
--grpc-server-tls-cert=/etc/certs/tls.crt
--grpc-server-tls-key=/etc/certs/tls.key
--store=dnssrv+_grpc._tcp.thanos-sidecars-prometheus.monitoring.svc
--selector-label=location="REDACTED"
--selector-label=stack="REDACTED"
--selector-label=REDACTED
Same deployment (differs in selector-label values) crashes less in other GKE cluster and almost not at all in yet another GKE cluster, while receiving similar (very low) traffic via GRPC.
Those query instances serve as GRPC endpoints for global thanos-query (that runs in another, "observability" cluster and does not crash) to return recent data (older data is served from bucket). They are behind GCP load balancer (using http2 to communicate LB <-> thanos in GKE)
Full logs to relevant components
Example after-crash dump is here: https://gist.github.com/bjakubski/18a98f6f1fc2922e5056df3106fe1477