-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11275][SQL] Rollup and Cube Generates the Incorrect Results when Aggregation Functions Use Group By Columns #9419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -208,21 +208,21 @@ class Analyzer( | |
| // We will insert another Projection if the GROUP BY keys contains the | ||
| // non-attribute expressions. And the top operators can references those | ||
| // expressions by its alias. | ||
| // e.g. SELECT key%5 as c1 FROM src GROUP BY key%5 ==> | ||
| // SELECT a as c1 FROM (SELECT key%5 AS a FROM src) GROUP BY a | ||
| // e.g. SELECT key%5 as c1 FROM src GROUP BY key%5 with rollup ==> | ||
| // SELECT a as c1 FROM (SELECT key%5 AS a FROM src) GROUP BY a with rollup | ||
|
|
||
| // find all of the non-attribute expressions in the GROUP BY keys | ||
| val nonAttributeGroupByExpressions = new ArrayBuffer[Alias]() | ||
|
|
||
| // The pair of (the original GROUP BY key, associated attribute) | ||
| val groupByExprPairs = x.groupByExprs.map(_ match { | ||
| val groupByExprPairs = x.groupByExprs.map { | ||
| case e: NamedExpression => (e, e.toAttribute) | ||
| case other => { | ||
| val alias = Alias(other, other.toString)() | ||
| nonAttributeGroupByExpressions += alias // add the non-attributes expression alias | ||
| (other, alias.toAttribute) | ||
| } | ||
| }) | ||
| } | ||
|
|
||
| // substitute the non-attribute expressions for aggregations. | ||
| val aggregation = x.aggregations.map(expr => expr.transformDown { | ||
|
|
@@ -232,18 +232,55 @@ class Analyzer( | |
| // substitute the group by expressions. | ||
| val newGroupByExprs = groupByExprPairs.map(_._2) | ||
|
|
||
| val child = if (nonAttributeGroupByExpressions.length > 0) { | ||
| val child = if (nonAttributeGroupByExpressions.nonEmpty) { | ||
| // insert additional projection if contains the | ||
| // non-attribute expressions in the GROUP BY keys | ||
| Project(x.child.output ++ nonAttributeGroupByExpressions, x.child) | ||
| } else { | ||
| x.child | ||
| } | ||
|
|
||
| // We will insert another Projection if the GROUP BY keys are contained in the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe in addition to describing what we do we should explain why we are inserting this projection here as well? |
||
| // aggregation. And the top operators can references those keys by its alias. | ||
| // e.g. SELECT a, b, sum(a) FROM src GROUP BY a, b with rollup ==> | ||
| // SELECT a, b, sum(a1) FROM (SELECT a, b, a AS a1 FROM src) GROUP BY a, b with rollup | ||
|
|
||
| // collect all the distinct attributes that are in both aggregation functions and group by clauses | ||
| val attrInAggregatedFuncAndGroupBy = aggregation.collect { | ||
| case aggFunc: Alias => aggFunc.collect {case a : Attribute if newGroupByExprs.contains(a) => a} | ||
| }.flatten.distinct | ||
|
|
||
| val alias4AttrInAggregatedFuncAndGroupBy = new ArrayBuffer[Alias]() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems we already know the initial size of this so lets at try and use that to reduce allocations. |
||
|
|
||
| // Generate alias for each attribute in attrInAggregatedFuncAndGroupBy | ||
| val attrInAggregatedFuncPairs = attrInAggregatedFuncAndGroupBy.map(a => { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is just being used for lookup, could we - do we expect this to always be pretty small? |
||
| val alias = Alias(a, a.toString)() | ||
| alias4AttrInAggregatedFuncAndGroupBy += alias | ||
| (a, alias.toAttribute) | ||
| }) | ||
|
|
||
| val nonAttributeGroupByExpressionsToAttribute = nonAttributeGroupByExpressions.map(a=>a.toAttribute) | ||
|
|
||
| val newAggregation = aggregation.map { | ||
| case a : Alias => a.transform { | ||
| // must avoid the alias replacement by the first step; otherwise, the following case will fail: | ||
| // select a + b, b, sum(a - b) from test group by a + b, b with cube | ||
| case e => attrInAggregatedFuncPairs.find(_._1==e && !nonAttributeGroupByExpressionsToAttribute.contains(e)). | ||
| map(_._2).getOrElse(e) | ||
| }.asInstanceOf[NamedExpression] | ||
| case other => other | ||
| } | ||
|
|
||
| val newChild = if (alias4AttrInAggregatedFuncAndGroupBy.nonEmpty) { | ||
| Project(child.output ++ alias4AttrInAggregatedFuncAndGroupBy, child) | ||
| } else { | ||
| child | ||
| } | ||
|
|
||
| Aggregate( | ||
| newGroupByExprs :+ VirtualColumn.groupingIdAttribute, | ||
| aggregation, | ||
| Expand(x.bitmasks, newGroupByExprs, gid, child)) | ||
| newAggregation, | ||
| Expand(x.bitmasks, newGroupByExprs, gid, newChild)) | ||
| } | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take this for example:
I think we probably need to add extra column(s) for the output of
Expand, (e.g.(a-b)in this case). Will that be more simple for this fixing?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @chenghao-intel ,
Could you explain it a little bit more?
So far, this query is correctly processed and returned a correct result. Since b is part of an aggregated function, the fix added extra columns for b. Below is the generated plan:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, actually there are so big difference as I mentioned.
But I got error when do the query like below, can you please take look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel , a good catch! Thank you!
I fixed this issue if you integrate the latest change. Also added two more test cases to cover it.