Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Apr 23, 2018

What changes were proposed in this pull request?

ApproximatePercentile contains a workaround logic to compress the samples since at the beginning QuantileSummaries was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation.

This can create serious performance issues in queries like:

select approx_percentile(id, array(0.1)) from range(10000000)

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89742 has finished for PR 21133 at commit 0ac3b4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@juliuszsompolski juliuszsompolski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
failAfter(20 seconds) {
assert(sql("select approx_percentile(id, array(0.1)) from range(10000000)").count() == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you do .count(), column pruning removes the approx_percentile from the query, so the test does not execute approx_percentile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks, I started using collect during my tests than I moved to count but it was a mistake, I am fixing it, thanks.

// which may cause QuantileSummaries to occupy unbounded memory. We have to hack around here
// to make sure QuantileSummaries doesn't occupy infinite memory.
// TODO: Figure out why QuantileSummaries ignores construction parameter compressThresHold
if (summaries.sampled.length >= compressThresHoldBufferLength) compress()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested if this change doesn't cause compress() to not be called at all, and memory consumption to go ubounded, but it appears to be working good - the mem usage through jmap -histo:live when running sql("select approx_percentile(id, array(0.1)) from range(10000000000L)").collect() remains stable.
The compress() is being called from QuantileSummaries.insert(), so it seems that the above TODO got resolved at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the TODO was resolved in SPARK-17439. I thought I clearly stated it in the description, but if this is not the case or you have any suggestion about how to improve the description, I am happy to improve it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's my fault of not reading the description attentively :-).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem at all, thanks for checking this :) I addressed you comment on the test. Any more comments?

test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
failAfter(30 seconds) {
checkAnswer(sql("select approx_percentile(id, array(0.1)) from range(10000000)"),
Row(Array(999160)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
With the approx nature of the algorithm, could the exact answer not get flakty through some small changes in code or config? (like e.g. the split of range into tasks, and then different merging of partial aggrs producing slightly different results)
maybe just asserting on collect().length == 1 would do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not the only place where it is checked with an exact answer, so I don't think it is an issue, a small change would anyway require to change many test cases answers. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Yeah, looking at the other tests in this suite it's definitely fine :-).

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89921 has finished for PR 21133 at commit 2fa8da7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

cc @cloud-fan

def add(value: Double): Unit = {
summaries = summaries.insert(value)
// The result of QuantileSummaries.insert is un-compressed
isCompressed = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the following call of compress(), will this flag be still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, since we still compress in many places: in merge, getPercentiles and in quantileSummaries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible insert can return whether it is compressed or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try and add a flag in the underlying class, in order to make it return whether it is compressed or not. I think this is the cleanest way.

}

test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
failAfter(30 seconds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks pretty weird. Can we add some kind of unit test and move this test to PR description and say the perf has improved a lot after this patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is not the best UT, but I couldn't find any better way to test this. If anybody has any idea of a better test, I am happy to follow your right suggestion...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a UT for ApproximatePercentile, and check that after calling add, isCompressed is still false.

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89966 has finished for PR 21133 at commit d47d9bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

Maybe we could add the former test as a benchmark to AggregateBenchmark?

@mgaido91
Copy link
Contributor Author

@juliuszsompolski I am not sure. This is actually not a performance improvement (strictly speaking that would mean changing an algorithm/code block in order to perform better). Here we are just removing a useless statement which has been wrongly there for legacy reasons. Moreover it is also quite hard to get the benchmark data, since I have not been able to see the query finish without the fix...

@gatorsmile
Copy link
Member

Above is my major comment #21133 (comment)

cc @juliuszsompolski @cloud-fan Please see whether it makes sense.

@cloud-fan
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90060 has finished for PR 21133 at commit aab21a7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Since the SparkR failure is not related to this PR, I merge it to master. Thanks!

@asfgit asfgit closed this in 8dbf56c May 2, 2018
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Dec 10, 2018
## What changes were proposed in this pull request?

`ApproximatePercentile` contains a workaround logic to compress the samples since at the beginning `QuantileSummaries` was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation.

This can create serious performance issues in queries like:
```
select approx_percentile(id, array(0.1)) from range(10000000)
```

## How was this patch tested?

added UT

Author: Marco Gaido <[email protected]>

Closes apache#21133 from mgaido91/SPARK-24013.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants