-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile #21133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89742 has finished for PR 21133 at commit
|
juliuszsompolski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
|
||
| test("SPARK-24013: unneeded compress can cause performance issues with sorted input") { | ||
| failAfter(20 seconds) { | ||
| assert(sql("select approx_percentile(id, array(0.1)) from range(10000000)").count() == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you do .count(), column pruning removes the approx_percentile from the query, so the test does not execute approx_percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, thanks, I started using collect during my tests than I moved to count but it was a mistake, I am fixing it, thanks.
| // which may cause QuantileSummaries to occupy unbounded memory. We have to hack around here | ||
| // to make sure QuantileSummaries doesn't occupy infinite memory. | ||
| // TODO: Figure out why QuantileSummaries ignores construction parameter compressThresHold | ||
| if (summaries.sampled.length >= compressThresHoldBufferLength) compress() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested if this change doesn't cause compress() to not be called at all, and memory consumption to go ubounded, but it appears to be working good - the mem usage through jmap -histo:live when running sql("select approx_percentile(id, array(0.1)) from range(10000000000L)").collect() remains stable.
The compress() is being called from QuantileSummaries.insert(), so it seems that the above TODO got resolved at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the TODO was resolved in SPARK-17439. I thought I clearly stated it in the description, but if this is not the case or you have any suggestion about how to improve the description, I am happy to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's my fault of not reading the description attentively :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem at all, thanks for checking this :) I addressed you comment on the test. Any more comments?
| test("SPARK-24013: unneeded compress can cause performance issues with sorted input") { | ||
| failAfter(30 seconds) { | ||
| checkAnswer(sql("select approx_percentile(id, array(0.1)) from range(10000000)"), | ||
| Row(Array(999160))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
With the approx nature of the algorithm, could the exact answer not get flakty through some small changes in code or config? (like e.g. the split of range into tasks, and then different merging of partial aggrs producing slightly different results)
maybe just asserting on collect().length == 1 would do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not the only place where it is checked with an exact answer, so I don't think it is an issue, a small change would anyway require to change many test cases answers. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Yeah, looking at the other tests in this suite it's definitely fine :-).
|
Test build #89921 has finished for PR 21133 at commit
|
|
cc @cloud-fan |
| def add(value: Double): Unit = { | ||
| summaries = summaries.insert(value) | ||
| // The result of QuantileSummaries.insert is un-compressed | ||
| isCompressed = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove the following call of compress(), will this flag be still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, since we still compress in many places: in merge, getPercentiles and in quantileSummaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible insert can return whether it is compressed or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try and add a flag in the underlying class, in order to make it return whether it is compressed or not. I think this is the cleanest way.
| } | ||
|
|
||
| test("SPARK-24013: unneeded compress can cause performance issues with sorted input") { | ||
| failAfter(30 seconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test looks pretty weird. Can we add some kind of unit test and move this test to PR description and say the perf has improved a lot after this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is not the best UT, but I couldn't find any better way to test this. If anybody has any idea of a better test, I am happy to follow your right suggestion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a UT for ApproximatePercentile, and check that after calling add, isCompressed is still false.
|
Test build #89966 has finished for PR 21133 at commit
|
|
Maybe we could add the former test as a benchmark to |
|
@juliuszsompolski I am not sure. This is actually not a performance improvement (strictly speaking that would mean changing an algorithm/code block in order to perform better). Here we are just removing a useless statement which has been wrongly there for legacy reasons. Moreover it is also quite hard to get the benchmark data, since I have not been able to see the query finish without the fix... |
|
Above is my major comment #21133 (comment) cc @juliuszsompolski @cloud-fan Please see whether it makes sense. |
|
LGTM |
|
retest this please |
|
Test build #90060 has finished for PR 21133 at commit
|
|
Since the SparkR failure is not related to this PR, I merge it to master. Thanks! |
## What changes were proposed in this pull request? `ApproximatePercentile` contains a workaround logic to compress the samples since at the beginning `QuantileSummaries` was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation. This can create serious performance issues in queries like: ``` select approx_percentile(id, array(0.1)) from range(10000000) ``` ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21133 from mgaido91/SPARK-24013.
What changes were proposed in this pull request?
ApproximatePercentilecontains a workaround logic to compress the samples since at the beginningQuantileSummarieswas ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation.This can create serious performance issues in queries like:
How was this patch tested?
added UT