[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 #16246

nsyca · 2016-12-10T15:49:23Z

What changes were proposed in this pull request?

Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis
to Analysis to fix a regression caused by SPARK-18504.

This problem can be reproduced with a simple script now.

Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p")
Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c")
sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show

The requirements are:

We need to reference the same table twice in both the parent and the subquery. Here is the table c.
We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent.
We will then "deduplicate" c1.ck in the subquery to ck#<n1>#<n2> at Project above Aggregate of avg. Then when we compare ck#<n1>#<n2> and the original group by column ck#<n1> by their canonicalized form, which is # != #. That's how we trigger the exception added in SPARK-18504.

How was this patch tested?

SubquerySuite and a simplified version of TPCDS-Q32

…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.

SparkQA · 2016-12-10T15:54:29Z

Test build #69967 has finished for PR 16246 at commit e871783.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-10T18:24:45Z

Test build #69968 has finished for PR 16246 at commit b93b3ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-12T15:45:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // Block cases where GROUP BY columns are not part of the correlated columns
+      // of a scalar subquery.
+      sub collect {
+        case a @ Aggregate(grouping, _, _) if (isScalarSubq) =>


case a @ Aggregate(grouping, _, _) if (isScalarSubq)
->
case Aggregate(grouping, _, _) if isScalarSubq

SparkQA · 2016-12-12T16:13:23Z

Test build #70020 has finished for PR 16246 at commit f88a205.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-12T18:11:52Z

Test build #70025 has finished for PR 16246 at commit 724335a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-12T20:51:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // Block cases where GROUP BY columns are not part of the correlated columns
+      // of a scalar subquery.
+      sub collect {
+        case Aggregate(grouping, _, _) if isScalarSubq =>


Doesn't this break if you have nested aggregates in the subquery? I.e.:

create or replace temporary view t as select id, id % 10 val, id % 100 fk from range(1000); select * from t where t.id = (select max(id) from (select tt.val, max(tt.id) as id from t as tt where t.fk = tt.fk group by tt.val))

Right. Thank you. It seems we need to have this part from CheckAnalysis included to walk the subquery plan:

153 // Skip projects and subquery aliases added by the Analyzer and the SQLBuilder. 154 def cleanQuery(p: LogicalPlan): LogicalPlan = p match { 155 case s: SubqueryAlias => cleanQuery(s.child) 156 case p: Project => cleanQuery(p.child) 157 case child => child 158 } 159 160 cleanQuery(query) match { 161 case a: Aggregate => checkAggregate(a) 162 case Filter(_, a: Aggregate) => checkAggregate(a) 163 case fail => failAnalysis(s"Correlated scalar subqueries must be Aggregated: $fail") 164 }

hvanhovell · 2016-12-13T00:10:20Z

I think we need to revisit the rule in CheckAnalysis, and make it Alias aware. We only need to keep track of the inner references. We traverse/recurse down the tree and do a few things:

When we encounter any other than a Project/Filter/SubqueryAlias/Aggregate, we should fail.
When we encounter a Project. Assume that there are only Alias or AttributeReference expressions (this should be the case anyway), and update the references for columns that are aliased.
When we encounter a Filter/SubqueryAlias, do Nothing.
When we encounter a toplevel Aggregate, validate the grouping expressions using the updated references (the current logic should be fine here), and return.

gatorsmile · 2016-12-13T16:52:21Z

@hvanhovell I like your ideas. : )

This JIRA is in the Blocker level of Spark 2.1. If we are doing a major refactoring in CheckAnalysis , is it too risky at the last minute? I am fine if you think the above proposal has a limited impact. Thanks!

nsyca · 2016-12-13T18:05:55Z

I am working on the code based on @hvanhovell's proposal.

gatorsmile · 2016-12-13T18:10:46Z

Sure, thanks! @nsyca

SparkQA · 2016-12-13T22:28:06Z

Test build #70094 has finished for PR 16246 at commit 6040dcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-12-13T22:51:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

          case s @ ScalarSubquery(query, conditions, _) if conditions.nonEmpty =>
+
+            // Collect the columns from the subquery for further checking.
+            var subqueryColumns = conditions.flatMap(_.references).collect {


NIT: conditions.flatMap(_.references).filter(query.output.contains)

hvanhovell · 2016-12-13T22:55:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-              val invalidCols = groupByCols.diff(predicateCols)
+              // SPARK-18504/SPARK-18814: Block cases where GROUP BY columns
+              // are not part of the correlated columns.
+              val groupByCols = ExpressionSet(agg.groupingExpressions.flatMap(_.references))


Nit: Using an AttributeSet is more natural (I should have seen that for the initial PR), and probably a little faster. You can perform a diff by subtracting the sets, i.e.: 'val invalidCols = groupByCols -- correlatedCols`

hvanhovell · 2016-12-13T22:57:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+                subqueryColumns = subqueryColumns.map {
+                  case xs =>
+                    p.projectList.collectFirst {
+                      case e @ Alias(child : AttributeReference, _) if e.toAttribute equals xs =>


Nit: It is quicker to compare the exprIds of the alias and the attribute, e.g.:
case e @ Alias(child: Attribute, _) if e.exprId == xs.exprId =>

hvanhovell · 2016-12-13T22:58:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+                // SPARK-18814: Map any aliases to their AttributeReference children
+                // for the checking in the Aggregate operators below this Project.
+                subqueryColumns = subqueryColumns.map {
+                  case xs =>


Nit: you don't need the case statement here.

hvanhovell

Some minor things, but otherwise LGTM.

hvanhovell · 2016-12-13T23:49:04Z

LGTM - pending jenkins.

nsyca · 2016-12-13T23:53:19Z

@hvanhovell, I really appreciate your time reviewing my code. I have addressed all your four comments. Thanks!

SparkQA · 2016-12-14T02:14:21Z

Test build #70110 has finished for PR 16246 at commit 0b6bfd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-14T07:27:29Z

sql/core/src/test/resources/sql-tests/inputs/scalar-subquery.sql

+	       FROM   (SELECT   c1.cv, avg(c1.cv) avg
+		       FROM     c c1
+		       WHERE    c1.ck = p.pk
+                       GROUP BY c1.cv));


I have a question that is not related to this JIRA. In the above query, if we do not have the GROUP BY c1.cv, it still works. It sounds like the subquery progressing ignores group by clauses. What is the reason?

Nice catch! I think this is a bug. There could be multiple values of c1.cv. Without a GROUP BY clause, which value does it return? Could you please open a JIRA to track this? I will investigate along with my subquery work. Do you think this is a blocker?

It is not a blocker. We are probably missing this case in CheckAnalysis. This currently works because it gets eliminated during optimization (the optimizer prunes the unused output). @nsyca it would be great if you can take a look at it, could you also create a separate JIRA to track this?

I opened SPARK-18863 to track this problem.

Thank you confirming this is a bug. I expect I can get the error message like

org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'c1.`cv`' is not an aggregate function.

Or in some other cases, we should see the error message like

org.apache.spark.sql.AnalysisException: expression 'c1.`cv`' is neither present in the group by, nor is it an aggregate function.

Both error handling are missing.

hvanhovell · 2016-12-14T10:05:48Z

Ok, I am merging this to master/2.1. Thanks!

## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <[email protected]> Closes #16246 from nsyca/18814. (cherry picked from commit cccd643) Signed-off-by: Herman van Hovell <[email protected]>

## What changes were proposed in this pull request? Move the checking of GROUP BY column in correlated scalar subquery from CheckAnalysis to Analysis to fix a regression caused by SPARK-18504. This problem can be reproduced with a simple script now. Seq((1,1)).toDF("pk","pv").createOrReplaceTempView("p") Seq((1,1)).toDF("ck","cv").createOrReplaceTempView("c") sql("select * from p,c where p.pk=c.ck and c.cv = (select avg(c1.cv) from c c1 where c1.ck = p.pk)").show The requirements are: 1. We need to reference the same table twice in both the parent and the subquery. Here is the table c. 2. We need to have a correlated predicate but to a different table. Here is from c (as c1) in the subquery to p in the parent. 3. We will then "deduplicate" c1.ck in the subquery to `ck#<n1>#<n2>` at `Project` above `Aggregate` of `avg`. Then when we compare `ck#<n1>#<n2>` and the original group by column `ck#<n1>` by their canonicalized form, which is #<n2> != #<n1>. That's how we trigger the exception added in SPARK-18504. ## How was this patch tested? SubquerySuite and a simplified version of TPCDS-Q32 Author: Nattavut Sutyanyong <[email protected]> Closes apache#16246 from nsyca/18814.

nsyca added 28 commits July 29, 2016 17:43

New positive test cases

edca333

Fix unit test case failure

64184fd

blocking TABLESAMPLE

29f82b0

Fixing code styling

ac43ab4

Correcting Scala test style

631d396

One (last) attempt to correct the Scala style tests

7eb9b2d

Merge remote-tracking branch 'upstream/master'

1387cf5

Merge remote-tracking branch 'upstream/master'

6d9bade

Merge remote-tracking branch 'upstream/master'

9a1f80b

Merge remote-tracking branch 'upstream/master'

3fe9429

Merge remote-tracking branch 'upstream/master'

0757b81

Merge remote-tracking branch 'upstream/master'

35b77f0

Merge remote-tracking branch 'upstream/master'

c63b8c6

Merge remote-tracking branch 'upstream/master'

f3351d5

Merge remote-tracking branch 'upstream/master'

9fc5c33

Merge remote-tracking branch 'upstream/master'

402e1d9

Merge remote-tracking branch 'upstream/master'

b117281

Merge remote-tracking branch 'upstream/master'

3023399

Merge remote-tracking branch 'upstream/master'

4b692f0

Merge remote-tracking branch 'upstream/master'

c8aadb5

Merge remote-tracking branch 'upstream/master'

2181647

Merge remote-tracking branch 'upstream/master'

c8588de

Merge remote-tracking branch 'upstream/master'

0d823d5

Merge remote-tracking branch 'upstream/master'

bc27b4e

first pass

e871783

second pass

b93b3ce

gatorsmile reviewed Dec 12, 2016

View reviewed changes

nsyca added 2 commits December 12, 2016 10:49

Address @gatorsmile's 3rd round comments

ca1dc96

Address @gatorsmile's 3rd round comments(2)

724335a

hvanhovell reviewed Dec 12, 2016

View reviewed changes

Code the fix based on @hvanhovell's solution

6040dcf

hvanhovell reviewed Dec 13, 2016

View reviewed changes

hvanhovell approved these changes Dec 13, 2016

View reviewed changes

Address @hvanhovell's 2nd comments

0b6bfd4

gatorsmile reviewed Dec 14, 2016

View reviewed changes

asfgit closed this in cccd643 Dec 14, 2016

nsyca deleted the 18814 branch March 24, 2017 18:36

[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 #16246

[SPARK-18814][SQL] CheckAnalysis rejects TPCDS query 32 #16246

Uh oh!

Conversation

nsyca commented Dec 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 10, 2016

Uh oh!

SparkQA commented Dec 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Dec 13, 2016

Uh oh!

gatorsmile commented Dec 13, 2016

Uh oh!

nsyca commented Dec 13, 2016

Uh oh!

gatorsmile commented Dec 13, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Dec 13, 2016

Uh oh!

nsyca commented Dec 13, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsyca Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nsyca commented Dec 10, 2016 •

edited

Loading

hvanhovell Dec 13, 2016 •

edited

Loading

nsyca Dec 14, 2016 •

edited

Loading

hvanhovell Dec 14, 2016 •

edited

Loading

hvanhovell commented Dec 14, 2016 •

edited

Loading