query: Add store.response-timeout#928
Conversation
87bc810 to
b36db84
Compare
mjd95
left a comment
There was a problem hiding this comment.
👍 looks like a sensible mitigation. Would be great to look in to the underlying store issue as follow-up.
4c822b4 to
2207f9d
Compare
a69a9c5 to
c230d23
Compare
bwplotka
left a comment
There was a problem hiding this comment.
LGTM just:
- flag name suggestion
- CI test suggestion
8221f9c to
cea8d0e
Compare
|
@bwplotka I've introduced |
fb7c0bd to
ca89c08
Compare
bwplotka
left a comment
There was a problem hiding this comment.
All good and I would be happy to merge if not the confusion around gRPC metric we add here. See my comment.
|
Also I am happy to add this periodic test in CircleCi for this env, can you add Github issue on me for that? |
|
Still before merging we need to remove the metrics from here I guess? |
d2bba2e to
0519103
Compare
Yup, also made a PR for metrics grpc-ecosystem/go-grpc-prometheus#71 |
|
I've squashed it |
Co-Authored-By: povilasv <p.versockas@gmail.com>
Co-Authored-By: povilasv <p.versockas@gmail.com>
0519103 to
e4b3530
Compare
Co-Authored-By: povilasv <p.versockas@gmail.com>
bwplotka
left a comment
There was a problem hiding this comment.
Nice, just one minor nit, not a blocker though. LGTM, thanks!
Changes
Improves handling of timed out data/
Adds
--store.response-timeoutin order to have a timeout on a single stream operation.Some times GRPC streams get stuck and don't send any data for long periods of time
E.G. Simple GRPC Store API, which just sleeps.
Right now the only option to time them out in GRPC is to set
context.WithTimeout()The issue is that it is a global timeout,
if we set it to 10s, we won't ever get results from Thanos Store (backed by object store)
if we set it to 120s, we will always be waiting 120s for the HTTP queries to finish, which is :/
As we will be waiting for the slowest one to finish
This fix adds a timeout per single data receive operation
Ref: #928
Thanos Query will log if timeout was reached:
Verification
Tested locally on our cluster, both turned on and turned off (default).
Unit tests.