Skip to content

Query: wait for initial endpoint discovery before becoming ready#8334

Merged
GiedriusS merged 4 commits intothanos-io:mainfrom
pedro-stanaka:feat/query-check-endpoints-on-startup
Jul 31, 2025
Merged

Query: wait for initial endpoint discovery before becoming ready#8334
GiedriusS merged 4 commits intothanos-io:mainfrom
pedro-stanaka:feat/query-check-endpoints-on-startup

Conversation

@pedro-stanaka
Copy link
Contributor

@pedro-stanaka pedro-stanaka commented Jun 25, 2025

Problem

We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query reporting incomplete results:

  • Query pods become ready immediately upon startup
  • Endpoint discovery happens asynchronously in the background
  • Queries arriving between readiness and endpoint discovery will report result from part of the endpoints

Solution:
This commit modifies the query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a query pod reports ready, it has already attempted to discover and connect to available endpoints.

Impact:

  • Positive: Eliminates the race condition where queries could be routed to Query pods that haven't discovered any endpoints yet
  • Negative: Slightly increases startup time as pods won't be ready until endpoint discovery completes (typically up to a few seconds in normal conditions)
  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Verification

New tests.

Relates to #8332

@pedro-stanaka pedro-stanaka changed the title Make Thanos Query wait for initial endpoint discovery before becoming ready Make Query wait for initial endpoint discovery before becoming ready Jun 25, 2025
@pedro-stanaka pedro-stanaka changed the title Make Query wait for initial endpoint discovery before becoming ready Query: wait for initial endpoint discovery before becoming ready Jun 25, 2025
@pedro-stanaka pedro-stanaka force-pushed the feat/query-check-endpoints-on-startup branch 4 times, most recently from 1c44041 to b75bfdc Compare June 25, 2025 18:45
@pedro-stanaka pedro-stanaka marked this pull request as ready for review June 25, 2025 18:52
@pedro-stanaka
Copy link
Contributor Author

Failed test seems a flake.
image

GiedriusS
GiedriusS previously approved these changes Jul 15, 2025
@pedro-stanaka pedro-stanaka force-pushed the feat/query-check-endpoints-on-startup branch from 70c0a5f to aa151d5 Compare July 15, 2025 12:19
@pedro-stanaka
Copy link
Contributor Author

pedro-stanaka commented Jul 15, 2025

We were testing this in our staging (and then production) env, and we found out we need to also trigger a DNS update before we start the store resolution, otherwise it is pointless. We should invest some effort into refactoring these things to share a single update loop, since they are so closely related.

Added a new commit to fix that, this represents what we have now in production, and so far it has been working well for us.

@pedro-stanaka pedro-stanaka force-pushed the feat/query-check-endpoints-on-startup branch 2 times, most recently from e91dc02 to 5d680e2 Compare July 15, 2025 12:28
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same with the resolver that is created in dns.RegisterGRPCResolver(...)? Basically, call r.resolve() and do all the things (move all that

func() {
			if err := r.resolve(); err != nil {
			...
}

Into a separate function and call that inside of Build()?

… ready

Problem:
We observed a race condition where Thanos Query components were marking themselves as ready before discovering any endpoints. This created a timing gap that could lead to query failures:

- Query pods become ready immediately upon startup
- Endpoint discovery happens asynchronously in the background
- Queries arriving between readiness and endpoint discovery fail

Solution:
This commit modifies the Thanos Query readiness behavior to wait for the initial endpoint discovery to complete before marking the pod as ready. This ensures that when a Query pod reports ready, it has already attempted to discover and connect to available endpoints.

Changes:
1. Added synchronization to EndpointSet:
   - Added firstUpdateOnce flag and firstUpdateChan channel to track first update completion
   - Added WaitForFirstUpdate() method to block until initial discovery completes

2. Modified Query startup sequence:
   - gRPC server now waits for WaitForFirstUpdate() before calling statusProber.Ready()
   - Leverages existing runutil.Repeat behavior which runs the update function immediately

3. Timeout protection:
   - Uses store response timeout or 30 seconds as default timeout
   - Logs warning if timeout occurs but still proceeds to ready state

4. Added comprehensive tests for the new WaitForFirstUpdate functionality

Impact:
- Positive: Eliminates the race condition where queries could be routed to Query pods that haven't discovered any endpoints yet
- Negative: Slightly increases startup time as pods won't be ready until endpoint discovery completes (typically <1s in normal conditions)

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Add initial DNS resolution phase before starting periodic endpoint
updates to fix race condition where Query could become ready with
zero discovered endpoints.

Previously, the first endpoint update could run before DNS resolution
completed (both use runutil.Repeat which runs immediately), causing
Query to be ready but unable to serve requests for up to 5 seconds.

Now DNS resolution happens synchronously on startup, ensuring addresses
are available when the first endpoint update runs. This eliminates the
window where Query reports ready but has no endpoints discovered.

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Extract resolution logic into updateResolver() and call it synchronously
in Build() to ensure endpoint groups are resolved on startup, not just
during periodic updates. This prevents potential connection delays when
the query component starts.

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
@pedro-stanaka pedro-stanaka force-pushed the feat/query-check-endpoints-on-startup branch from 5d680e2 to f4ee5cb Compare July 23, 2025 14:21
GiedriusS
GiedriusS previously approved these changes Jul 31, 2025
…oints-on-startup

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
@GiedriusS GiedriusS enabled auto-merge July 31, 2025 07:46
@GiedriusS GiedriusS merged commit 0acf54d into thanos-io:main Jul 31, 2025
21 of 22 checks passed
GiedriusS added a commit to vinted/thanos that referenced this pull request Aug 1, 2025
…-endpoints-on-startup

Query: wait for initial endpoint discovery before becoming ready
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants