[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile #21133

mgaido91 · 2018-04-23T20:52:01Z

What changes were proposed in this pull request?

ApproximatePercentile contains a workaround logic to compress the samples since at the beginning QuantileSummaries was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation.

This can create serious performance issues in queries like:

select approx_percentile(id, array(0.1)) from range(10000000)

How was this patch tested?

added UT

SparkQA · 2018-04-24T00:19:02Z

Test build #89742 has finished for PR 21133 at commit 0ac3b4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski

Thanks!

juliuszsompolski · 2018-04-26T10:35:37Z

sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala

+
+  test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
+    failAfter(20 seconds) {
+      assert(sql("select approx_percentile(id, array(0.1)) from range(10000000)").count() == 1)


When you do .count(), column pruning removes the approx_percentile from the query, so the test does not execute approx_percentile.

nice catch, thanks, I started using collect during my tests than I moved to count but it was a mistake, I am fixing it, thanks.

juliuszsompolski · 2018-04-26T10:54:57Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

-      // which may cause QuantileSummaries to occupy unbounded memory. We have to hack around here
-      // to make sure QuantileSummaries doesn't occupy infinite memory.
-      // TODO: Figure out why QuantileSummaries ignores construction parameter compressThresHold
-      if (summaries.sampled.length >= compressThresHoldBufferLength) compress()


I tested if this change doesn't cause compress() to not be called at all, and memory consumption to go ubounded, but it appears to be working good - the mem usage through jmap -histo:live when running sql("select approx_percentile(id, array(0.1)) from range(10000000000L)").collect() remains stable.
The compress() is being called from QuantileSummaries.insert(), so it seems that the above TODO got resolved at some point.

Yes, the TODO was resolved in SPARK-17439. I thought I clearly stated it in the description, but if this is not the case or you have any suggestion about how to improve the description, I am happy to improve it.

Sorry, it's my fault of not reading the description attentively :-).

no problem at all, thanks for checking this :) I addressed you comment on the test. Any more comments?

juliuszsompolski · 2018-04-27T11:05:03Z

sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala

+  test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
+    failAfter(30 seconds) {
+      checkAnswer(sql("select approx_percentile(id, array(0.1)) from range(10000000)"),
+        Row(Array(999160)))


nit:
With the approx nature of the algorithm, could the exact answer not get flakty through some small changes in code or config? (like e.g. the split of range into tasks, and then different merging of partial aggrs producing slightly different results)
maybe just asserting on collect().length == 1 would do?

it is not the only place where it is checked with an exact answer, so I don't think it is an issue, a small change would anyway require to change many test cases answers. What do you think?

Ok. Yeah, looking at the other tests in this suite it's definitely fine :-).

SparkQA · 2018-04-27T14:31:06Z

Test build #89921 has finished for PR 21133 at commit 2fa8da7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-27T14:39:45Z

cc @cloud-fan

gatorsmile · 2018-04-27T18:57:15Z

...c/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

    def add(value: Double): Unit = {
      summaries = summaries.insert(value)
      // The result of QuantileSummaries.insert is un-compressed
      isCompressed = false


If we remove the following call of compress(), will this flag be still valid?

I think so, since we still compress in many places: in merge, getPercentiles and in quantileSummaries.

Is that possible insert can return whether it is compressed or not?

I try and add a flag in the underlying class, in order to make it return whether it is compressed or not. I think this is the cleanest way.

cloud-fan · 2018-04-28T03:28:28Z

sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala

  }
+
+  test("SPARK-24013: unneeded compress can cause performance issues with sorted input") {
+    failAfter(30 seconds) {


this test looks pretty weird. Can we add some kind of unit test and move this test to PR description and say the perf has improved a lot after this patch?

I agree that this is not the best UT, but I couldn't find any better way to test this. If anybody has any idea of a better test, I am happy to follow your right suggestion...

We can add a UT for ApproximatePercentile, and check that after calling add, isCompressed is still false.

SparkQA · 2018-04-29T11:19:15Z

Test build #89966 has finished for PR 21133 at commit d47d9bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

juliuszsompolski · 2018-04-30T09:49:28Z

Maybe we could add the former test as a benchmark to AggregateBenchmark?

mgaido91 · 2018-04-30T09:58:45Z

@juliuszsompolski I am not sure. This is actually not a performance improvement (strictly speaking that would mean changing an algorithm/code block in order to perform better). Here we are just removing a useless statement which has been wrongly there for legacy reasons. Moreover it is also quite hard to get the benchmark data, since I have not been able to see the query finish without the fix...

gatorsmile · 2018-04-30T16:52:38Z

Above is my major comment #21133 (comment)

cc @juliuszsompolski @cloud-fan Please see whether it makes sense.

cloud-fan · 2018-05-02T14:40:02Z

LGTM

cloud-fan · 2018-05-02T14:40:08Z

retest this please

SparkQA · 2018-05-02T18:19:04Z

Test build #90060 has finished for PR 21133 at commit aab21a7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-02T18:58:07Z

Since the SparkR failure is not related to this PR, I merge it to master. Thanks!

## What changes were proposed in this pull request? `ApproximatePercentile` contains a workaround logic to compress the samples since at the beginning `QuantileSummaries` was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation. This can create serious performance issues in queries like: ``` select approx_percentile(id, array(0.1)) from range(10000000) ``` ## How was this patch tested? added UT Author: Marco Gaido <[email protected]> Closes apache#21133 from mgaido91/SPARK-24013.

[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile

0ac3b4f

juliuszsompolski reviewed Apr 26, 2018

View reviewed changes

address comment

2fa8da7

juliuszsompolski reviewed Apr 27, 2018

View reviewed changes

gatorsmile reviewed Apr 27, 2018

View reviewed changes

cloud-fan reviewed Apr 28, 2018

View reviewed changes

improve ut

d47d9bd

move compress to QuantileSummaries

aab21a7

asfgit closed this in 8dbf56c May 2, 2018

[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile #21133

[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile #21133

Uh oh!

Conversation

mgaido91 commented Apr 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 24, 2018

Uh oh!

juliuszsompolski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 27, 2018

Uh oh!

mgaido91 commented Apr 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 29, 2018

Uh oh!

juliuszsompolski commented Apr 30, 2018

Uh oh!

mgaido91 commented Apr 30, 2018

Uh oh!

gatorsmile commented Apr 30, 2018

Uh oh!

cloud-fan commented May 2, 2018

Uh oh!

cloud-fan commented May 2, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

gatorsmile commented May 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgaido91 commented Apr 23, 2018 •

edited

Loading