-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21580][SQL]Integers in aggregation expressions are wrongly taken as group-by ordinal #18779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A specified title for this might be better. Such as "Integers in aggregation expressions are wrongly taken as group-by ordinal". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use pattern matching here like:
ng match {
case a: Alias => if (!isIntLiteral(a.child)) filterGroups :+= ng
case _ => filterGroups :+= ng
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also fix the indent of code here. Like:
newGroups.foreach { ng =>
ng match {
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please leave a comment explaining why we need to filter the group-by exprs.
|
BTW, we should modify the PR description too. Please briefly describe the problem, what the fix is. Thanks. |
|
Thanks,i will update @viirya |
Group by ordinal|
Test build #80067 has finished for PR 18779 at commit
|
|
I checked the root cause of this and It seems the current fix is similar to |
|
@maropu I am not sure if I understand your idea correctly. Those int literals in group-by is intended to be wrapped in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also comment on why this filtering is safe from changing grouping result. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to trim all possible Aliases, e.g., Alias(Alias(1, 'a'), 'b').
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have they been trimmed in aggregateExpressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule is before CleanupAliases.
|
@viirya sorry for my ambiguous explanation and I mean; I think replacing |
|
@maropu Thanks for clarifying it. Although they looks similar, from semantics I'd treat them different rules. However, I don't have strong opinion on this. Btw, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to RemoveLiteralFromGroupExpressions, we shouldn't drop all grouping expressions if they are all int literals after resolved from UnresolvedOrdinal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks no problem here, although we drop all grouping expressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comments at
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Lines 1254 to 1257 in 75b168f
| // All grouping expressions are literals. We should not drop them all, because this can | |
| // change the return semantics when the input of the Aggregate is empty (SPARK-17114). We | |
| // instead replace this by single, easy to hash/sort, literal expression. | |
| a.copy(groupingExpressions = Seq(Literal(0, IntegerType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, We have supported GlobalAggregates and group by null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the aggregates that have an empty input, if we remove all grouping expressions, this triggers the ungrouped code path (which aways returns a single row). So the query semantics are changed.
You meant this is not an issue anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reimplement this by using ResolvedOrdinal, we probably need not consider the empty issue here, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu yap, I think so. Because RemoveLiteralFromGroupExpressions takes care of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'm not very familiar with this module , how to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@10110346 Please see previous comment: #18779 (comment)
The basic idea is, instead of resolve UnresolvedOrdinal to actual aggregate expressions, we resolve it to ResolvedOrdinal.
Then in the beginning of optimization, we replace all ResolvedOrdinal to actual aggregate expressions.
What do you think?
|
yea, just suggestion. |
|
@maropu As an alternative approach, we can avoid to resolve Instead, we can create a By doing this, we can fix this issue (because we don't replace the int literal with |
|
Aha, I feel better idea to add |
|
Test build #80070 has finished for PR 18779 at commit
|
|
Test build #80105 has finished for PR 18779 at commit
|
|
@viirya Only to |
|
@10110346 Can't we also do the same on order by ordinal? |
|
@viirya Maybe adding |
|
I have updated this PR, please help to review it again. @viirya also cc @cloud-fan @gatorsmile |
|
@10110346 Why? In the query Is there any problem? |
|
@viirya |
|
Actually I don't think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add order-by test too. Maybe add to DataFrameSuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to have change like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to keep this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not necessary, i will remove it.thanks
|
Test build #80237 has finished for PR 18779 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run this test. Even without transform -> resolveOperators, it still works. Can you check it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I see. When we input query like df.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), the query plan looks like:
Sort [7#22 ASC NULLS FIRST, a#5 ASC NULLS FIRST, b#6 ASC NULLS FIRST], true
+- Project [7 AS 7#22, a#5, b#6]
+- Project [_1#2 AS a#5, _2#3 AS b#6]
+- LocalRelation [_1#2, _2#3]
We have a Project below Sort. The ordinal 1 be replaced with the attribute 7#22. So we won't get an int literal 7 here. That is why it passes.
Can you have a test for ordinal order-by that show different behavior? If not, I think ordinal order-by should be safe from this bug. And we don't need to add this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe we still can keep this test for the case that someone changes the Sort/Project relationship in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've verified that the above tests will fail without this fix.
|
LGTM. cc @gatorsmile for final check. |
|
@10110346 Thanks for working this! Sorry I've confused you in previous comments. Current changes looks good to me. |
|
Test build #80244 has finished for PR 18779 at commit
|
|
Test build #80246 has finished for PR 18779 at commit
|
|
retest this please. |
|
Test build #80249 has finished for PR 18779 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment. You need to use withTempView
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please use upper cases for SQL keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
|
LGTM except a few minor comments. The fix looks great now! Thanks everyone! |
|
I learned a lot from you, thanks all. |
|
Test build #80277 has finished for PR 18779 at commit
|
|
Really appreciate your patience and your work! This is how we work in Spark SQL. Thanks! |
…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <[email protected]> Closes #18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <[email protected]>
|
Thanks! Merging to master/2.2 |
…ken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <[email protected]> Closes apache#18779 from 10110346/groupby. (cherry picked from commit 894d5a4) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
select 3, 4, sum(b) from data group by 1, 2;select 3 as c, 4 as d, sum(b) from data group by c, d;When running these two cases, the following exception occurred:
Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using
resolveOperatorsinSubstituteUnresolvedOrdinals.How was this patch tested?
Added unit test case