Query: wait for initial endpoint discovery before becoming ready#8334
Merged
GiedriusS merged 4 commits intothanos-io:mainfrom Jul 31, 2025
Merged
Conversation
1c44041 to
b75bfdc
Compare
2 tasks
Contributor
Author
GiedriusS
previously approved these changes
Jul 15, 2025
70c0a5f to
aa151d5
Compare
Contributor
Author
|
We were testing this in our staging (and then production) env, and we found out we need to also trigger a DNS update before we start the store resolution, otherwise it is pointless. We should invest some effort into refactoring these things to share a single update loop, since they are so closely related. Added a new commit to fix that, this represents what we have now in production, and so far it has been working well for us. |
e91dc02 to
5d680e2
Compare
GiedriusS
reviewed
Jul 17, 2025
Member
There was a problem hiding this comment.
Should we do the same with the resolver that is created in dns.RegisterGRPCResolver(...)? Basically, call r.resolve() and do all the things (move all that
func() {
if err := r.resolve(); err != nil {
...
}Into a separate function and call that inside of Build()?
… ready Problem: We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query failures: - Query pods become ready immediately upon startup - Endpoint discovery happens asynchronously in the background - Queries arriving between readiness and endpoint discovery fail Solution: This commit modifies the Thanos Query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a Query pod reports ready, it has already attempted to discover and connect to available endpoints. Changes: 1. Added synchronization to EndpointSet: - Added firstUpdateOnce flag and firstUpdateChan channel to track first update completion - Added WaitForFirstUpdate() method to block until initial discovery completes 2. Modified Query startup sequence: - gRPC server now waits for WaitForFirstUpdate() before calling statusProber.Ready() - Leverages existing runutil.Repeat behavior which runs the update function immediately 3. Timeout protection: - Uses store response timeout or 30 seconds as default timeout - Logs warning if timeout occurs but still proceeds to ready state 4. Added comprehensive tests for the new WaitForFirstUpdate functionality Impact: - Positive: Eliminates the race condition where queries could be routed to Query pods that haven't discovered any endpoints yet - Negative: Slightly increases startup time as pods won't be ready until endpoint discovery completes (typically <1s in normal conditions) Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Add initial DNS resolution phase before starting periodic endpoint updates to fix race condition where Query could become ready with zero discovered endpoints. Previously, the first endpoint update could run before DNS resolution completed (both use runutil.Repeat which runs immediately), causing Query to be ready but unable to serve requests for up to 5 seconds. Now DNS resolution happens synchronously on startup, ensuring addresses are available when the first endpoint update runs. This eliminates the window where Query reports ready but has no endpoints discovered. Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Extract resolution logic into updateResolver() and call it synchronously in Build() to ensure endpoint groups are resolved on startup, not just during periodic updates. This prevents potential connection delays when the query component starts. Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
5d680e2 to
f4ee5cb
Compare
GiedriusS
previously approved these changes
Jul 31, 2025
…oints-on-startup Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS
approved these changes
Jul 31, 2025
GiedriusS
added a commit
to vinted/thanos
that referenced
this pull request
Aug 1, 2025
…-endpoints-on-startup Query: wait for initial endpoint discovery before becoming ready Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problem
We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query reporting incomplete results:
Solution:
This commit modifies the query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a query pod reports ready, it has already attempted to discover and connect to available endpoints.
Impact:
Verification
New tests.
Relates to #8332