Skip to content

thanos-query crashes with "concurrent map iteration and map write" #1272

@bjakubski

Description

@bjakubski

Thanos, Prometheus and Golang version used

thanos, version 0.5.0 (branch: HEAD, revision: 72820b3)
build user: circleci@eeac5eb36061
build date: 20190606-10:53:12
go version: go1.12.5

What happened

In one of k8s clusters that we run thanos-query in it crashes every couple of minutes with "fatal error: concurrent map iteration and map write" or "fatal error: concurrent map writes"

What you expected to happen

No crash :-)

How to reproduce it (as minimally and precisely as possible):

I've no idea. I didn't manage to find anything that triggers it. Same problem was observed in 0.4.0. I'm not sure about 0.3.0.
thanos runs in GCP GKE cluster, query is deployed via our own helm chart. Crashing containers run:

  thanos query
      --log.level=debug
      --query.replica-label=prometheus_replica
      --grpc-server-tls-cert=/etc/certs/tls.crt
      --grpc-server-tls-key=/etc/certs/tls.key
      --store=dnssrv+_grpc._tcp.thanos-sidecars-prometheus.monitoring.svc
      --selector-label=location="REDACTED"
      --selector-label=stack="REDACTED"
      --selector-label=REDACTED

Same deployment (differs in selector-label values) crashes less in other GKE cluster and almost not at all in yet another GKE cluster, while receiving similar (very low) traffic via GRPC.

Those query instances serve as GRPC endpoints for global thanos-query (that runs in another, "observability" cluster and does not crash) to return recent data (older data is served from bucket). They are behind GCP load balancer (using http2 to communicate LB <-> thanos in GKE)

Full logs to relevant components

Example after-crash dump is here: https://gist.github.com/bjakubski/18a98f6f1fc2922e5056df3106fe1477

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions