Skip to content

Conversation

@10110346
Copy link
Contributor

@10110346 10110346 commented Jul 31, 2017

What changes were proposed in this pull request?

create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);

select 3, 4, sum(b) from data group by 1, 2;
select 3 as c, 4 as d, sum(b) from data group by c, d;
When running these two cases, the following exception occurred:
Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10

The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.

The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using resolveOperators in SubstituteUnresolvedOrdinals.

How was this patch tested?

Added unit test case

@viirya
Copy link
Member

viirya commented Jul 31, 2017

A specified title for this might be better. Such as "Integers in aggregation expressions are wrongly taken as group-by ordinal".

Copy link
Member

@viirya viirya Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use pattern matching here like:

ng match {
    case a: Alias => if (!isIntLiteral(a.child)) filterGroups :+= ng
    case  _ => filterGroups :+= ng
}

Copy link
Member

@viirya viirya Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also fix the indent of code here. Like:

newGroups.foreach { ng =>
   ng match {
       ...
   }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment explaining why we need to filter the group-by exprs.

@viirya
Copy link
Member

viirya commented Jul 31, 2017

BTW, we should modify the PR description too. Please briefly describe the problem, what the fix is. Thanks.

@10110346
Copy link
Contributor Author

Thanks,i will update @viirya

@10110346 10110346 changed the title [SPARK-21580][SQL]There's a bug with Group by ordinal [SPARK-21580][SQL]Integers in aggregation expressions are wrongly taken as group-by ordinal Jul 31, 2017
@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80067 has finished for PR 18779 at commit 5319670.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 31, 2017

I checked the root cause of this and SubstituteUnresolvedOrdinals wrongly wraps again int literals with UnresolvedOrdinal.

17/07/31 16:54:50 TRACE HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals ===
!GlobalLimit 21                                                                           'GlobalLimit 21
!+- LocalLimit 21                                                                         +- 'LocalLimit 21
!   +- Aggregate [3, 4], [3 AS 3#56, 4 AS 4#57, sum(cast(b#1 as bigint)) AS sum(b)#58L]      +- 'Aggregate [unresolvedordinal(3), unresolvedordinal(4)], [3 AS 3#56, 4 AS 4#57, sum(cast(b#1 as bigint)) AS sum(b)#58L]
       +- SubqueryAlias data                                                                    +- SubqueryAlias data
          +- Project [a#0, b#1]                                                                    +- Project [a#0, b#1]
             +- SubqueryAlias data                                                                    +- SubqueryAlias data
                +- LocalRelation [a#0, b#1]                                                              +- LocalRelation [a#0, b#1]

It seems the current fix is similar to RemoveLiteralFromGroupExpressions in the optimizer, so we better move this optimizer rule to here?

@viirya
Copy link
Member

viirya commented Jul 31, 2017

@maropu I am not sure if I understand your idea correctly. Those int literals in group-by is intended to be wrapped in UnresolvedOrdinal. So they can be replaced with aggregation expressions at the corresponding ordinal position later by ResolveOrdinalInOrderByAndGroupBy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also comment on why this filtering is safe from changing grouping result. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to trim all possible Aliases, e.g., Alias(Alias(1, 'a'), 'b').

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have they been trimmed in aggregateExpressions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is before CleanupAliases.

@maropu
Copy link
Member

maropu commented Jul 31, 2017

@viirya sorry for my ambiguous explanation and I mean;

scala> sql("""select 3, 4, sum(b) from data group by 1, 2""").show
17/07/31 17:13:23 TRACE HiveSessionStateBuilder$$anon$1:

// 1. Replace literals in grouping exprs with unresolveordinal here
=== Applying Rule org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals ===
!'Aggregate [1, 2], [unresolvedalias(3, None), unresolvedalias(4, None), unresolvedalias('sum('b), None)]   'Aggregate [unresolvedordinal(1), unresolvedordinal(2)], [unresolvedalias(3, None), unresolvedalias(4, None), unresolvedalias('sum('b), None)]
 +- 'UnresolvedRelation `data`                                                                              +- 'UnresolvedRelation `data`
...

17/07/31 17:13:23 TRACE HiveSessionStateBuilder$$anon$1:

// 2. And then, resolve unresolvedordinal by using agg expressions here (but, the resolved expressions are literals)
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy ===
!'Aggregate [unresolvedordinal(1), unresolvedordinal(2)], [3 AS 3#4, 4 AS 4#5, sum(cast(b#1 as bigint)) AS sum(b)#6L]   Aggregate [3 AS 3#4, 4 AS 4#5], [3 AS 3#4, 4 AS 4#5, sum(cast(b#1 as bigint)) AS sum(b)#6L]
 +- SubqueryAlias data                                                                                                  +- SubqueryAlias data
    +- Project [a#0, b#1]                                                                                                  +- Project [a#0, b#1]
       +- SubqueryAlias data                                                                                                  +- SubqueryAlias data
          +- LocalRelation [a#0, b#1]                                                                                            +- LocalRelation [a#0, b#1]
...

17/07/31 17:13:23 TRACE HiveSessionStateBuilder$$anon$1:

// 3. the resolved expressions above are literals, so wrap them with unresolveordinal again here.
=== Applying Rule org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals ===
!GlobalLimit 21                                                                        'GlobalLimit 21
!+- LocalLimit 21                                                                      +- 'LocalLimit 21
!   +- Aggregate [3, 4], [3 AS 3#4, 4 AS 4#5, sum(cast(b#1 as bigint)) AS sum(b)#6L]      +- 'Aggregate [unresolvedordinal(3), unresolvedordinal(4)], [3 AS 3#4, 4 AS 4#5, sum(cast(b#1 as bigint)) AS sum(b)#6L]
       +- SubqueryAlias data                                                                 +- SubqueryAlias data
          +- Project [a#0, b#1]                                                                 +- Project [a#0, b#1]
             +- SubqueryAlias data                                                                 +- SubqueryAlias data
                +- LocalRelation [a#0, b#1]                                                           +- LocalRelation [a#0, b#1]

// 4. Then, it fails here.
org.apache.spark.sql.AnalysisException: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$43.apply(Analyzer.scala:1008)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy$$anonfun$apply$11$$anonfun$43.apply(Analyzer.scala:1004)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

I think replacing unresolvedordinals with literals in ResolveOrdinalInOrderByAndGroupBy cause this failure, so we need to avoid this (I know this pr fixes in that way). I feel this solution of this pr is similar to RemoveLiteralFromGroupExpressions in the optimizer, so I though we'd better to move this rule into the analyzer phase.

@viirya
Copy link
Member

viirya commented Jul 31, 2017

@maropu Thanks for clarifying it.

Although they looks similar, from semantics I'd treat them different rules. However, I don't have strong opinion on this.

Btw, RemoveLiteralFromGroupExpressions reminds me that we should not drop all literal grouping expressions here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to RemoveLiteralFromGroupExpressions, we shouldn't drop all grouping expressions if they are all int literals after resolved from UnresolvedOrdinal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks no problem here, although we drop all grouping expressions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments at

// All grouping expressions are literals. We should not drop them all, because this can
// change the return semantics when the input of the Aggregate is empty (SPARK-17114). We
// instead replace this by single, easy to hash/sort, literal expression.
a.copy(groupingExpressions = Seq(Literal(0, IntegerType)))
. We should not make the grouping expressions empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, We have supported GlobalAggregates and group by null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the aggregates that have an empty input, if we remove all grouping expressions, this triggers the ungrouped code path (which aways returns a single row). So the query semantics are changed.

You meant this is not an issue anymore?

Copy link
Member

@maropu maropu Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reimplement this by using ResolvedOrdinal, we probably need not consider the empty issue here, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu yap, I think so. Because RemoveLiteralFromGroupExpressions takes care of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'm not very familiar with this module , how to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@10110346 Please see previous comment: #18779 (comment)

The basic idea is, instead of resolve UnresolvedOrdinal to actual aggregate expressions, we resolve it to ResolvedOrdinal.

Then in the beginning of optimization, we replace all ResolvedOrdinal to actual aggregate expressions.

What do you think?

@maropu
Copy link
Member

maropu commented Jul 31, 2017

yea, just suggestion.

@viirya
Copy link
Member

viirya commented Jul 31, 2017

@maropu As an alternative approach, we can avoid to resolve UnresolvedOrdinal to actual expressions in this rule.

Instead, we can create a ResolvedOrdinal and replace it with actual agg expression in an optimization rule.

By doing this, we can fix this issue (because we don't replace the int literal with UnresolvedOrdinal again). We also don't need to duplicate the logic of remove literals from grouping expressions.

@10110346 @maropu What do you think?

@maropu
Copy link
Member

maropu commented Jul 31, 2017

Aha, I feel better idea to add ResolvedOrdinal.

@SparkQA
Copy link

SparkQA commented Jul 31, 2017

Test build #80070 has finished for PR 18779 at commit fae3190.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2017

Test build #80105 has finished for PR 18779 at commit a5667c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@10110346
Copy link
Contributor Author

10110346 commented Aug 2, 2017

@viirya Only to group-by ordinal, i think this is a good idea.
but this will also result in inconsistent processing between order-by ordinal and group-by ordinal.
and i feel that it's more complicated than the current changes

@viirya
Copy link
Member

viirya commented Aug 2, 2017

@10110346 Can't we also do the same on order by ordinal?

@10110346
Copy link
Contributor Author

10110346 commented Aug 2, 2017

@viirya Maybe adding ResolvedOrdinal is not very well.
I have another problem:
select a, 4 AS k, count(b) from data group by k, 1;
This test case has the same exception:
Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10

@10110346
Copy link
Contributor Author

10110346 commented Aug 2, 2017

I have updated this PR, please help to review it again. @viirya also cc @cloud-fan @gatorsmile

@viirya
Copy link
Member

viirya commented Aug 2, 2017

@10110346 Why?

In the query select a, 4 AS k, count(b) from data group by k, 1, 1 is resolved to ResolvedOrdinal(1) first. Then at the beginning of optimization, ResolvedOrdinal(1) is replaced by a.

Is there any problem?

@10110346
Copy link
Contributor Author

10110346 commented Aug 2, 2017

@viirya k is resolved to 4 in ResolveAggAliasInGroupBy,and then 4 is resolved to ResolvedOrdinal(4)

@viirya
Copy link
Member

viirya commented Aug 2, 2017

Actually I don't think SubstituteUnresolvedOrdinals rule should run with fixedPoint. It should be run with Once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add order-by test too. Maybe add to DataFrameSuite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have change like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to keep this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary, i will remove it.thanks

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80237 has finished for PR 18779 at commit 791bc33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run this test. Even without transform -> resolveOperators, it still works. Can you check it again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I see. When we input query like df.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), the query plan looks like:

Sort [7#22 ASC NULLS FIRST, a#5 ASC NULLS FIRST, b#6 ASC NULLS FIRST], true
+- Project [7 AS 7#22, a#5, b#6]
   +- Project [_1#2 AS a#5, _2#3 AS b#6]
      +- LocalRelation [_1#2, _2#3]

We have a Project below Sort. The ordinal 1 be replaced with the attribute 7#22. So we won't get an int literal 7 here. That is why it passes.

Can you have a test for ordinal order-by that show different behavior? If not, I think ordinal order-by should be safe from this bug. And we don't need to add this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we still can keep this test for the case that someone changes the Sort/Project relationship in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified that the above tests will fail without this fix.

@viirya
Copy link
Member

viirya commented Aug 4, 2017

LGTM. cc @gatorsmile for final check.

@viirya
Copy link
Member

viirya commented Aug 4, 2017

@10110346 Thanks for working this! Sorry I've confused you in previous comments. Current changes looks good to me.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80244 has finished for PR 18779 at commit 2dc3610.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80246 has finished for PR 18779 at commit c1594c7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 4, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80249 has finished for PR 18779 at commit c1594c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment. You need to use withTempView

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please use upper cases for SQL keywords.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

@gatorsmile
Copy link
Member

LGTM except a few minor comments. The fix looks great now! Thanks everyone!

@10110346
Copy link
Contributor Author

10110346 commented Aug 5, 2017

I learned a lot from you, thanks all.

@SparkQA
Copy link

SparkQA commented Aug 5, 2017

Test build #80277 has finished for PR 18779 at commit 2bf42b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Really appreciate your patience and your work! This is how we work in Spark SQL. Thanks!

asfgit pushed a commit that referenced this pull request Aug 5, 2017
…ken as group-by ordinal

## What changes were proposed in this pull request?

create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);

`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`

The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.

The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.

## How was this patch tested?
Added unit test case

Author: liuxian <[email protected]>

Closes #18779 from 10110346/groupby.

(cherry picked from commit 894d5a4)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2

@asfgit asfgit closed this in 894d5a4 Aug 5, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…ken as group-by ordinal

## What changes were proposed in this pull request?

create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);

`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`

The cause of this failure:
If an aggregateExpression is integer, after replaced with this aggregateExpression, the
groupExpression still considered as an ordinal.

The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.

## How was this patch tested?
Added unit test case

Author: liuxian <[email protected]>

Closes apache#18779 from 10110346/groupby.

(cherry picked from commit 894d5a4)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants