[SPARK-11275] [SQL] Incorrect results when using rollup/cube #9815

aray · 2015-11-18T19:19:49Z

Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result.

Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite:

build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'

This is an alternative to pr #9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it.

… that included grouping expressions would return the wrong (null) result. Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

SparkQA · 2015-11-18T20:51:12Z

Test build #46232 has finished for PR 9815 at commit 12914fa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2015-11-18T21:52:38Z

retest this please

SparkQA · 2015-11-19T00:17:13Z

Test build #46256 has finished for PR 9815 at commit 12914fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

aray · 2015-11-19T01:57:17Z

@yhuai can you take a look at this pr?

yhuai · 2015-11-19T03:58:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

Is this the only change we need to fix this problem?

No. This was a (minor) problem before that was not caught by any of the test cases, it's now more necessary since we duplicate all the grouping columns in the analyzer rule.

yhuai · 2015-11-19T04:01:01Z

@aray Thank you for the PR! Since we are in the QA period for 1.6 release, it will be great if we just fix the problem without any other changes. Is this the minimal fix for this issue?

aray · 2015-11-19T04:47:29Z

@yhuai I do think this is the minimal fix. However like I stated in the summary we are simplifying instead of making more exceptions that might themselves have bugs. Let me know if I can clarify anything else.

yhuai · 2015-11-19T06:19:23Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

these * overlapping * cases will fail without the fix, right?

SparkQA · 2015-11-19T16:39:54Z

Test build #46334 has finished for PR 9815 at commit 2162b6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-19T22:32:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

So, we will rely on our optimizer to remove this Project if it is not necessary, right?

yhuai · 2015-11-19T23:10:42Z

Thank you for the fix! I am merging it to master and branch 1.6.

Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result. Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer. Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite: ``` build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite' ``` This is an alternative to pr #9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it. Author: Andrew Ray <[email protected]> Closes #9815 from aray/groupingset-agg-fix. (cherry picked from commit 37cff1b) Signed-off-by: Yin Huai <[email protected]>

gatorsmile · 2015-11-22T17:49:02Z

Thank you @aray @yhuai ! The code changes look great!

Based on my test cases, the rollup and cube still return incorrect results when the table contains null. : (

If you do not have time, I can take a look at it.

gatorsmile · 2015-11-22T18:03:19Z

This might not be related to rollup logics. It is a bug of Dataframe. I will try to fix it soon.

Thanks!

yhuai · 2015-11-22T18:32:39Z

@gatorsmile Can you create a jira (with repro in the description) and ping me from that jira?

gatorsmile · 2015-11-22T18:37:55Z

Sorry, I think it is a test case issue.

testData = Seq((1, 2), (2, 2), (3, 4), (null.asInstanceOf[Int], 5)).toDF("a", "b")

Scala automatically converts null.asInstanceOf[Int] into zero. Thus, Spark treats it as zero. Never mind. I tried another way, like null.asInstanceOf[java.lang.Integer]. It works fine.

yhuai · 2015-11-22T18:40:38Z

oh, i see. Yeah, we need to use Integer to get null. Int is not nullable.

Fixes bug with grouping sets (including cube/rollup) where aggregates…

12914fa

… that included grouping expressions would return the wrong (null) result. Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

yhuai reviewed Nov 19, 2015
View reviewed changes

additional comments

2162b6c

yhuai reviewed Nov 19, 2015
View reviewed changes

asfgit closed this in 37cff1b Nov 19, 2015

yhuai mentioned this pull request Nov 19, 2015

[SPARK-11275][SQL] Rollup and Cube Generates the Incorrect Results when Aggregation Functions Use Group By Columns #9419

Closed

[SPARK-11275] [SQL] Incorrect results when using rollup/cube #9815

[SPARK-11275] [SQL] Incorrect results when using rollup/cube #9815

Uh oh!

Conversation

aray commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 18, 2015

Uh oh!

aray commented Nov 18, 2015

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

aray commented Nov 19, 2015

Uh oh!

yhuai Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

aray Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Nov 19, 2015

Uh oh!

aray commented Nov 19, 2015

Uh oh!

yhuai Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

aray Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

yhuai Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Nov 19, 2015

Uh oh!

gatorsmile commented Nov 22, 2015

Uh oh!

gatorsmile commented Nov 22, 2015

Uh oh!

yhuai commented Nov 22, 2015

Uh oh!

gatorsmile commented Nov 22, 2015

Uh oh!

yhuai commented Nov 22, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants