Skip to content

Conversation

@aray
Copy link
Contributor

@aray aray commented Nov 18, 2015

Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result.

Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite:

build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'

This is an alternative to pr #9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it.

… that included grouping expressions would return the wrong (null) result.

Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.
@SparkQA
Copy link

SparkQA commented Nov 18, 2015

Test build #46232 has finished for PR 9815 at commit 12914fa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Nov 18, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46256 has finished for PR 9815 at commit 12914fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Nov 19, 2015

@yhuai can you take a look at this pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only change we need to fix this problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This was a (minor) problem before that was not caught by any of the test cases, it's now more necessary since we duplicate all the grouping columns in the analyzer rule.

@yhuai
Copy link
Contributor

yhuai commented Nov 19, 2015

@aray Thank you for the PR! Since we are in the QA period for 1.6 release, it will be great if we just fix the problem without any other changes. Is this the minimal fix for this issue?

@aray
Copy link
Contributor Author

aray commented Nov 19, 2015

@yhuai I do think this is the minimal fix. However like I stated in the summary we are simplifying instead of making more exceptions that might themselves have bugs. Let me know if I can clarify anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these * overlapping * cases will fail without the fix, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@SparkQA
Copy link

SparkQA commented Nov 19, 2015

Test build #46334 has finished for PR 9815 at commit 2162b6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we will rely on our optimizer to remove this Project if it is not necessary, right?

@yhuai
Copy link
Contributor

yhuai commented Nov 19, 2015

Thank you for the fix! I am merging it to master and branch 1.6.

asfgit pushed a commit that referenced this pull request Nov 19, 2015
Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null) result.

Also simplifies the analyzer rule a bit and leaves column pruning to the optimizer.

Added multiple unit tests to DataFrameAggregateSuite and verified it passes hive compatibility suite:
```
build/sbt -Phive -Dspark.hive.whitelist='groupby.*_grouping.*' 'test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite'
```

This is an alternative to pr #9419 but I think its better as it simplifies the analyzer rule instead of adding another special case to it.

Author: Andrew Ray <[email protected]>

Closes #9815 from aray/groupingset-agg-fix.

(cherry picked from commit 37cff1b)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in 37cff1b Nov 19, 2015
@gatorsmile
Copy link
Member

Thank you @aray @yhuai ! The code changes look great!

Based on my test cases, the rollup and cube still return incorrect results when the table contains null. : (

If you do not have time, I can take a look at it.

@gatorsmile
Copy link
Member

This might not be related to rollup logics. It is a bug of Dataframe. I will try to fix it soon.

Thanks!

@yhuai
Copy link
Contributor

yhuai commented Nov 22, 2015

@gatorsmile Can you create a jira (with repro in the description) and ping me from that jira?

@gatorsmile
Copy link
Member

Sorry, I think it is a test case issue.

testData = Seq((1, 2), (2, 2), (3, 4), (null.asInstanceOf[Int], 5)).toDF("a", "b")

Scala automatically converts null.asInstanceOf[Int] into zero. Thus, Spark treats it as zero. Never mind. I tried another way, like null.asInstanceOf[java.lang.Integer]. It works fine.

@yhuai
Copy link
Contributor

yhuai commented Nov 22, 2015

oh, i see. Yeah, we need to use Integer to get null. Int is not nullable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants