-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 #16246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
|
Test build #69967 has finished for PR 16246 at commit
|
|
Test build #69968 has finished for PR 16246 at commit
|
| // Block cases where GROUP BY columns are not part of the correlated columns | ||
| // of a scalar subquery. | ||
| sub collect { | ||
| case a @ Aggregate(grouping, _, _) if (isScalarSubq) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case a @ Aggregate(grouping, _, _) if (isScalarSubq)
->
case Aggregate(grouping, _, _) if isScalarSubq
|
Test build #70020 has finished for PR 16246 at commit
|
|
Test build #70025 has finished for PR 16246 at commit
|
| // Block cases where GROUP BY columns are not part of the correlated columns | ||
| // of a scalar subquery. | ||
| sub collect { | ||
| case Aggregate(grouping, _, _) if isScalarSubq => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break if you have nested aggregates in the subquery? I.e.:
create or replace temporary view t as
select id,
id % 10 val,
id % 100 fk
from range(1000);
select *
from t
where t.id = (select max(id)
from (select tt.val,
max(tt.id) as id
from t as tt
where t.fk = tt.fk
group by tt.val))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Thank you. It seems we need to have this part from CheckAnalysis included to walk the subquery plan:
153 // Skip projects and subquery aliases added by the Analyzer and the SQLBuilder.
154 def cleanQuery(p: LogicalPlan): LogicalPlan = p match {
155 case s: SubqueryAlias => cleanQuery(s.child)
156 case p: Project => cleanQuery(p.child)
157 case child => child
158 }
159
160 cleanQuery(query) match {
161 case a: Aggregate => checkAggregate(a)
162 case Filter(_, a: Aggregate) => checkAggregate(a)
163 case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail")
164 } |
I think we need to revisit the rule in CheckAnalysis, and make it
|
|
@hvanhovell I like your ideas. : ) This JIRA is in the |
|
I am working on the code based on @hvanhovell's proposal. |
|
Sure, thanks! @nsyca |
|
Test build #70094 has finished for PR 16246 at commit
|
| case s @ ScalarSubquery(query, conditions, _) if conditions.nonEmpty => | ||
|
|
||
| // Collect the columns from the subquery for further checking. | ||
| var subqueryColumns = conditions.flatMap(_.references).collect { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: conditions.flatMap(_.references).filter(query.output.contains)
| val invalidCols = groupByCols.diff(predicateCols) | ||
| // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns | ||
| // are not part of the correlated columns. | ||
| val groupByCols = ExpressionSet(agg.groupingExpressions.flatMap(_.references)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Using an AttributeSet is more natural (I should have seen that for the initial PR), and probably a little faster. You can perform a diff by subtracting the sets, i.e.: 'val invalidCols = groupByCols -- correlatedCols`
| subqueryColumns = subqueryColumns.map { | ||
| case xs => | ||
| p.projectList.collectFirst { | ||
| case e @ Alias(child : AttributeReference, _) if e.toAttribute equals xs => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It is quicker to compare the exprIds of the alias and the attribute, e.g.:
case e @ Alias(child: Attribute, _) if e.exprId == xs.exprId =>
| // SPARK-18814: Map any aliases to their AttributeReference children | ||
| // for the checking in the Aggregate operators below this Project. | ||
| subqueryColumns = subqueryColumns.map { | ||
| case xs => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you don't need the case statement here.
hvanhovell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor things, but otherwise LGTM.
|
LGTM - pending jenkins. |
|
@hvanhovell, I really appreciate your time reviewing my code. I have addressed all your four comments. Thanks! |
|
Test build #70110 has finished for PR 16246 at commit
|
| FROM (SELECT c1.cv, avg(c1.cv) avg | ||
| FROM c c1 | ||
| WHERE c1.ck = p.pk | ||
| GROUP BY c1.cv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question that is not related to this JIRA. In the above query, if we do not have the GROUP BY c1.cv, it still works. It sounds like the subquery progressing ignores group by clauses. What is the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! I think this is a bug. There could be multiple values of c1.cv. Without a GROUP BY clause, which value does it return? Could you please open a JIRA to track this? I will investigate along with my subquery work. Do you think this is a blocker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a blocker. We are probably missing this case in CheckAnalysis. This currently works because it gets eliminated during optimization (the optimizer prunes the unused output). @nsyca it would be great if you can take a look at it, could you also create a separate JIRA to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened SPARK-18863 to track this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you confirming this is a bug. I expect I can get the error message like
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'c1.`cv`' is not an aggregate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in some other cases, we should see the error message like
org.apache.spark.sql.AnalysisException: expression 'c1.`cv`' is neither present in the group by, nor is it an aggregate function.
Both error handling are missing.
|
Ok, I am merging this to master/2.1. Thanks! |
## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <[email protected]> Closes #16246 from nsyca/18814. (cherry picked from commit cccd643) Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.
This problem can be reproduced with a simple script now.
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
The requirements are:
1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
## How was this patch tested?
SubquerySuite and a simplified version of TPCDS-Q32
Author: Nattavut Sutyanyong <[email protected]>
Closes apache#16246 from nsyca/18814.
## What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.
This problem can be reproduced with a simple script now.
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
The requirements are:
1. We need to reference the same table twice in both the parent and the subquery. Here is the table c.
2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504.
## How was this patch tested?
SubquerySuite and a simplified version of TPCDS-Q32
Author: Nattavut Sutyanyong <[email protected]>
Closes apache#16246 from nsyca/18814.
What changes were proposed in this pull request?
Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.
This problem can be reproduced with a simple script now.
Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show
The requirements are:
ck#<n1>#<n2>atProjectaboveAggregateofavg. Then when we compareck#<n1>#<n2>and the original group by columnck#<n1>by their canonicalized form, which is # != #. That's how we trigger the exception added in SPARK-18504.How was this patch tested?
SubquerySuite and a simplified version of TPCDS-Q32