Add proposal for store instance high availability#404
Add proposal for store instance high availability#404bwplotka merged 2 commits intothanos-io:masterfrom
Conversation
The format of this proposal mirrors an existing draft proposal: thanos-io#387
|
|
||
| ## Open issues | ||
|
|
||
| ### Querying from multiple buckets or object stores |
There was a problem hiding this comment.
I think I personally prefer this approach, would allow us to run queries that span multiple buckets.
Example - we collect info from two tenants each of which gets uploaded to their own bucket tenantA and tenantB if we wanted to query the metrics for both adding the a Identifier to the InfoResponse we could have the bucket name on there and it would be valid to wait for both responses.
A further use case of this could be to have buckets that are historical / archive data and some that are within x days and would mean that you could spin up store nodes quicker for the recent bucket as would require less overhead at start.
There was a problem hiding this comment.
Agree here with @domgreen. Maybe slight change in gateway: bool into something like gateway string with bucket name would allow us for basic functional sharding of stores in addition HA. What do you think @mattbostock ?
There was a problem hiding this comment.
it makes sense having a string instead of bool
There was a problem hiding this comment.
I agree, I think using a string identifier is the way forward.
The Protobuf definition could be:
bool deduplicated = 4
string store_group_id = 5
The deduplicated boolean would determine how the query instance handles retrieving the data (i.e. whether to wait for all results). I think I prefer deduplicated over gateway as it's more specific to the functionality we care about in this case. Sidecar instances would have deduplicated set to false. Bucket stores would have deduplicated set to true.
It'd be up to the specific implementation to determine what the store_group_id string is. For bucket stores, I think concatenating the object store URL and bucket name would make sense. For all other stores, I'd leave it empty for now until we have a good reason to use it.
The logic would work like this:
if deduplicated:
query_and_use_fastest_response_per_store_group
else
query_and_wait_for_all_responses
Stores would be grouped by store_group_id. Stores having an empty store_group_id field would be considered all part of the same group.
Since boolean fields are false by default in Protobufs, deduplicated would be false and existing instances would be treated as before for backwards compatibility (i.e. the query instance will wait for all responses, subject to timeouts).
If you agree, I'll update the proposal to match.
There was a problem hiding this comment.
do we really need a redundant deduplicated option? Isn't it enough having specified store_group_id as a object store+bucket_name as identifier? Thus, we don't have to wait for all responses if stores have the same store_group_id.
There was a problem hiding this comment.
we can account only non-zero string in store_group_id for the deduplication, hence older version would work as before
There was a problem hiding this comment.
I think relying on an empty string is too implicit as it alters the behaviour of how responses are treated.
There was a problem hiding this comment.
I don't think this's a disadvantage, if you describe the behaviour in the documentation and the help. Otherwise you have to bear in mind both options to be working as expected.
There was a problem hiding this comment.
Sorry for the massive delay in reviews and thanks for this (:
store_group_id makes sense to me. deduplicated is tricky. I don't like it because both store gateways with query_and_use_fastest_response_per_store_group and sidecars with query_and_wait_for_all_responses logic are kind of duplicated, so we can work on naming definitely.
However, what about another idea:
In Info let's add:
store_group_idpeer_type(like it is in gossipcluster.PeerState.Type).
I am proposing this because I am trying to figure out how to make deduplication flow consistent across ALL components. Including not yet solved: Ruler deduplication!
Basically, with this we can deduct correct behavior everywhere:
- Store gateway:
store_group_idwould be as @mattbostock suggested and peerType=store so query can performquery_and_use_fastest_response_per_store_grouplogic. - Sidecars:
store_group_idwould be all external labels as it is ID of those. peerType=source so query can assumequery_and_wait_for_all_responsesflow with somereplicalabel deduplication. - Ruler: The tricky part is that we might want these in HA, but it's hard to define external labels for these, so having some arbitrary
store_group_idwould be perfect. Plus ruler ispeerType=sourceso same replica deduplication might apply here as well.
I am not saying we want solve ruler with this proposal, just that we might want to reuse same thing and move ALL to store_group_id for deduplication logic for consistency.
Tricky part here:
4 Rules (in 2 different groups) - it would be not trivial to implement this, because the query_and_wait_for_all_responses + dedup is done truly by removing (or not) replica label. But that is something not for proposal anyway.
What do you think?
There was a problem hiding this comment.
Thanks for the comments all.
@Bplotka: I think your proposal makes sense, I'll update this PR to use that.
| failed but similar data was recorded by another Prometheus server in the same | ||
| failure domain. To support deduplication, Thanos must wait for all Thanos | ||
| sidecar servers to return their data (subject to timeouts) before returning a | ||
| response to a client. |
There was a problem hiding this comment.
It is worth to mention, that we DO not ignore these timeouts, even with the HA sidecar scenario we treat one replica being inaccessible as an "error case" - which might be wrong or confusing and definitely different to store gateway case (:
There was a problem hiding this comment.
Good shout, I'll make that explicit.
There was a problem hiding this comment.
On second thoughts, I'm not sure if we need to expand on this as I want to avoid confusing the reader by giving too much detail. Maybe I should remove the 'subject to timeouts' part?
| that do not explicitly set a value for `gateway` will not be affected. | ||
|
|
||
| If a store instance is a gateway, query instances will treat each store | ||
| instance in a label group as having access to the same data. Query instances |
There was a problem hiding this comment.
How to tell if stores are from same label groups if they do not put any labels in current logic?
There was a problem hiding this comment.
I think this is related to the alternative mentioned below. will comment there
|
|
||
| ## Open issues | ||
|
|
||
| ### Querying from multiple buckets or object stores |
There was a problem hiding this comment.
Agree here with @domgreen. Maybe slight change in gateway: bool into something like gateway string with bucket name would allow us for basic functional sharding of stores in addition HA. What do you think @mattbostock ?
|
Updated based on feedback, comments welcome. |
|
@mattbostock I don't know why we did not ask about this, but.... For HA, is that not enough to just put multiple replicas of store gateways behind some loadbalancer (for example 2 replica statefulset behind k8s service)? (: This will not work for gossip (as in gossip pods talks to pods), but wil work for static |
|
Thanks for your work here, but I think we need to merge it as However this proposal might be a good base for other features in soon future:
|
|
Merged, just to move it to rejected state in next PR and have it for future reference. |
Bumps [webpack](https://github.com/webpack/webpack) from 5.91.0 to 5.96.1. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](webpack/webpack@v5.91.0...v5.96.1) --- updated-dependencies: - dependency-name: webpack dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The format of this proposal mirrors an existing draft proposal:
#387
We should to arrive at consensus on the open issues described in the proposal (under the 'Open issues' section) before merging.
/cc @xjewer @fabxc @Bplotka