store/bucket: make getFor() work with interleaved resolutions#1146
store/bucket: make getFor() work with interleaved resolutions#1146bwplotka merged 25 commits intothanos-io:masterfrom
Conversation
Without this, we get max resolutions like 1, 2, 3 which do not mean anything to getFor(). With this, we get proper data from Thanos Store.
Makes a querier via queryableCreator and checks if the maxSourceResolution was passed properly.
domgreen
left a comment
There was a problem hiding this comment.
Looks legit to me :D just nit on naming
Makes it clearer that it's just a temporary variable for the loop.
|
Since this touches a critical part of the code I think it would be best to wait for @bwplotka to review this (and, of course, anyone else who wants :P) |
bwplotka
left a comment
There was a problem hiding this comment.
Thanks @GiedriusS I think this is our bug found & fixed!
Generally good, but let's revert the previous change #1154 and leave or better document the maxResolution tranformation?
I need to dive into this between thing, it's 1AM and want to ensure we don't miss anything further. For example not sure if divide and conquer recurrency here helps with readability or not.
pkg/query/querier.go
Outdated
| // Querier returns a new storage querier against the underlying proxy store API. | ||
| func (q *queryable) Querier(ctx context.Context, mint, maxt int64) (storage.Querier, error) { | ||
| return newQuerier(ctx, q.logger, mint, maxt, q.replicaLabel, q.proxy, q.deduplicate, int64(q.maxSourceResolution/time.Millisecond), q.partialResponse, q.warningReporter), nil | ||
| return newQuerier(ctx, q.logger, mint, maxt, q.replicaLabel, q.proxy, q.deduplicate, int64(q.maxSourceResolution), q.partialResponse, q.warningReporter), nil |
There was a problem hiding this comment.
Quite confusing - hard to tell for me if this makes sense (:
This has the same unit as mint, maxt int64 right? Maybe making the underlying variable maxSourceResolutionMiliseonconds or keeping it time.Duration as long as possible makes sense?
There was a problem hiding this comment.
Hm.. am I right that is PR was actually wrong? #1144 And we moved to miliseconds there but we are doing exactly the same here and we missed it? 🤔
If yes, let's remove https://github.com/improbable-eng/thanos/blob/2c4f64b1b96907295a7f8e99d8fd64697f0eb12a/pkg/query/api/v1.go#L227 I think it's wrong. All is in time.Duration so moving to miliseconds there does not make sense. It's here that we transition from duration to integer having arbitrary unit (miliseconds) so I think this should stay and last PR should be reverted.
There was a problem hiding this comment.
And yea, master is broken now in terms of downsampling if I am not wrong?
There was a problem hiding this comment.
Perhaps we should move the whole function to return int64 and in a form appropriate to the caller since as we can see it is hard to make a mental connection between such distant places in the code? Because that's what parsing should do IMHO - we shouldn't do more parsing down the road.
There was a problem hiding this comment.
And yea, master is broken now in terms of downsampling if I am not wrong?
I think that's a bit of an overstatement if it has never worked properly 😄
There was a problem hiding this comment.
Sure but now it does not work even more, so there is regression (:
Anyway, I just want to avoid confusion, sorry for being quite strict here ):
Now once we had no regression we can actually make that work and nice. Yes, I am fine with:
- sticking to
maxResolutionMilisec int64name everywhere. - parseResolution returning int64 straight away.
Thoughts? (:
There was a problem hiding this comment.
Indeed that's what I've done. Hopefully it is much more clearer now about what kind of units are being used and that no one in the future will run into this problem like me when I was trying to hunt down one problem but accidentally got caught up in this.
pkg/store/bucket.go
Outdated
| } | ||
|
|
||
| until := len(bs) - 1 | ||
| for j := 0; j < until; j++ { |
There was a problem hiding this comment.
It's tricky, need to dive more but generally makes sense. Explaining what we are seeing now. Thanks for tests catching this as well!
All of this assumes that blocks within different resolutions are aligned ideally right? Can we comment this somehow?
Also be aware of this: #1104 Hope you algorthim would be nicely extendable I guess.
cc @mjd95
There was a problem hiding this comment.
They always should be aligned ideally or on the same timestamps because in other cases we would get a problem known as "overlapping blocks".
This change should be easy to integrate into that PR.
There was a problem hiding this comment.
OK, It makes total sense BUT to me left := s.getFor(mint, bs[0].meta.MinTime, s.resolutions[i]) and right are just between 0 and first min and lastMax and maxt, right?
Can we remove this left and right and treat them properly in this loop as between? I think it would simplify flow and algorithm.
There was a problem hiding this comment.
The idea sounds good and I played around with it a bit but I doubt that this function can be made more succinct because we have three distinct cases of the append call: at the beginning of the array, between two elements, and at the end. Please correct me if I'm wrong.
There was a problem hiding this comment.
Hm, is it really too complex? I will propose something to your PR
There was a problem hiding this comment.
Proposed: GiedriusS#1
IMO it's way simpler, what do you think? @GiedriusS
// getFor returns a time-ordered list of blocks that cover date between mint and maxt.
// Blocks with the biggest resolution possible but not bigger than the given max resolution are returned.
func (s *bucketBlockSet) getFor(mint, maxt, maxResolutionMillis int64) (bs []*bucketBlock) {
if mint == maxt {
return nil
}
s.mtx.RLock()
defer s.mtx.RUnlock()
// Find first matching resolution.
i := 0
for ; i < len(s.resolutions) && s.resolutions[i] > maxResolutionMillis; i++ {
}
// Fill the given interval with the blocks for the current resolution.
// Our current resolution might not cover all data, so recursively fill the gaps with higher resolution blocks if there is any.
start := mint
for _, b := range s.blocks[i] {
if b.meta.MaxTime <= mint {
continue
}
if b.meta.MinTime >= maxt {
break
}
if i+1 < len(s.resolutions) {
bs = append(bs, s.getFor(start, b.meta.MinTime, s.resolutions[i+1])...)
}
bs = append(bs, b)
start = b.meta.MaxTime
}
if i+1 < len(s.resolutions) {
bs = append(bs, s.getFor(start, maxt, s.resolutions[i+1])...)
}
return bs
}* Convert parseDownsamplingParam() into parseDownsamplingParamMillis() which properly returns int64 for use in the querier code * Add parseDownsamplingMillis() tests
bwplotka
left a comment
There was a problem hiding this comment.
Thanks 👍 Some suggestions, but overall thanks for spotting this bug.
| } | ||
|
|
||
| return maxSourceResolution, nil | ||
| return int64(maxSourceResolution / time.Millisecond), nil |
| mint, maxt int64 | ||
| window int64 | ||
| } | ||
| input := []resBlock{ |
There was a problem hiding this comment.
What do you think about joinging those together in form of table test? essentiall add those as test cases to TestBucketBlockSet_addGet? might be more consistent and easier to read just "cases" as test itself feels similar TBH?
There was a problem hiding this comment.
These ones are a bit special because we are doing a property-based test instead of an ad-hoc one. I have removed the other test cases that I've added before and only left this one in - this already covers the "interleaved resolutions" case.
pkg/store/bucket.go
Outdated
| } | ||
|
|
||
| until := len(bs) - 1 | ||
| for j := 0; j < until; j++ { |
There was a problem hiding this comment.
OK, It makes total sense BUT to me left := s.getFor(mint, bs[0].meta.MinTime, s.resolutions[i]) and right are just between 0 and first min and lastMax and maxt, right?
Can we remove this left and right and treat them properly in this loop as between? I think it would simplify flow and algorithm.
This doesn't really matter as the tests show.
Only leave the property tests in place since they catch all of the errors.
Signed-off-by: Bartek Plotka <bwplotka@gmail.com>
bwplotka
left a comment
There was a problem hiding this comment.
LGTM for tests, but proposed bit simpler algorithm for implementation, WDYT: GiedriusS#1 @GiedriusS ?
Simplified goFor implementation.
bwplotka
left a comment
There was a problem hiding this comment.
LGTM, thanks for finding and fixing this bug, was so hard to find out this (: The property testing is something awesome to talk about in some Golang meetup!
Add interleaved resolutions test for getFor(). Ordinary users could run into such a case when, for example, Thanos Compact would downsample metrics but also the metrics with the better resolution would still be available. Fix this by making getFor() look for blocks "in the middle" as well.
Fix the getDownsamplingParam() interface (and rename it to getDownsamplingParamMillis()) so that it would return a value which will be ready to use by the querier code.
Add unit tests for all of that.
Verification:
go testand Grafana dashboards actually started showing all of the data.