[SPARK-4226][SQL] SparkSQL - Add support for subqueries in predicates('in' clause) #3249

ravipesala · 2014-11-13T18:24:05Z

This PR supports subqueries in preicates 'in' clause. The queries will be transformed to the LeftSemi join as mentioned below.

Case 1 Uncorelated queries

-- original query
select C
from R1
where R1.A in (Select B from R2)
-- rewritten query
Select C
from R1 left semijoin R2 on R1.A = R2.B

Case 2 Corelated queries

-- original query
select C
from R1
where R1.A in (Select B from R2 where R1.X = R2.Y)
-- rewritten query
select C
from R1 left semi join
(select B, R2.Y as sq1_col0 from R2) sq1
on R1.X = sq1.sq1_col0 and R1.A = sq1.B

Restriction : Alias need to be used as we convert it into join queries.
Complete specification is available in https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf

marmbrus · 2014-12-02T00:20:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala

Can you document what the arguments are here? It's not clear to me why we need exp (i.e., why can we just get it from the output of child. Also, style-wise I'd avoid abbreviation when not necessary and child is kind of an odd name given that its a different type of tree. Finally, should this be a LeafExpression.

Thank you for your comments.
Here exp is like predicate value. For example
SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b) . In this exp is a.key and child is subquery.
Now I have updated the names of them and added the documentation.

marmbrus · 2014-12-02T00:20:54Z

Cool feature! Thanks for working on this.

marmbrus · 2014-12-02T00:51:12Z

ok to test

SparkQA · 2014-12-02T02:46:00Z

Test build #24010 has finished for PR 3249 at commit 59dfab5.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(exp: Expression, child: LogicalPlan) extends Expression

scwf · 2014-12-02T02:51:28Z

Hi @ravipesala, can you rebase this PR

ravipesala · 2014-12-03T11:26:06Z

Rebased with master. And fixed comments

SparkQA · 2014-12-03T11:31:26Z

Test build #24090 has finished for PR 3249 at commit 353b86b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression

SparkQA · 2014-12-03T13:06:22Z

Test build #24091 has finished for PR 3249 at commit 0f5cc3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression

SparkQA · 2014-12-05T08:01:43Z

Test build #24177 has finished for PR 3249 at commit ea4e121.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression

SparkQA · 2014-12-05T09:42:39Z

Test build #24178 has finished for PR 3249 at commit d62887e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression

marmbrus · 2014-12-17T21:05:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Space before {.

Also I would consider doing this in two steps to avoid depending on transform for side effects: a collect to get the list and then a transform to replace with true.

Ok. Done in two steps.

marmbrus · 2014-12-17T21:35:45Z

Sorry for the delay reviewing this, but I finally had time to take a more thorough look at the code. I think this will be a really cool feature to have but I think there is some work that still needs to be done in the analysis rule. At a high level, I think what needs to be done is as follows: make sure the sub tree is already analyzed, which should simplify some things. Second, when possible we should be working with attributes in a way that uses expression ids (i.e. using AttributeSet and AttributeMap when possible).

SparkQA · 2015-02-27T20:41:44Z

Test build #28083 has finished for PR 3249 at commit 7653eee.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(subquery: LogicalPlan) extends Expression
- logError("User class threw exception: " + cause.getMessage, cause)

chenghao-intel · 2015-02-28T01:12:11Z

@ravipesala, do you have any idea how to implement the NOT IN? I believe we should consider how to implement the NOT IN when doing IN, or should they come within the same PR?

BTW, can you also enable the hive compatible test like subquery_in.q or subquery_in_having.q if you think that's also supported in this PR.

chenghao-intel · 2015-02-28T01:16:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala

Instead of making the Subquery as a fake expression, a better idea probably create a new logical plan like

SubQueryIn(left: LogicalPlan, nested: LogicalPlan, isNotIn:Boolean)

That's also how I implement the EXISTS at https://github.com/apache/spark/pull/4812/files#diff-9a11e98e8f4bd1c4bb18ca6a7a7b8948R262

chenghao-intel · 2015-02-28T01:56:54Z

Thank you @ravipesala for implementing this, however, this PR probably involve some unnecessary join condition transformation, probably you need to understand the rule of pushing down the join filter / condition first. Sorry, please correct me if I misunderstood something.

ravipesala · 2015-03-02T18:26:23Z

@chenghao-intel Thank you for reviewing it.I will go through your comments and fix it. And regarding not in case we can use left outer join . I will try to add to same PR.

marmbrus · 2015-04-03T00:17:12Z

Sorry for letting this languish. What is the status here and how does this relate to #4812?

SparkQA · 2015-06-24T20:23:41Z

Test build #35706 has finished for PR 3249 at commit 7653eee.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class SubqueryExpression(subquery: LogicalPlan) extends Expression

andrewor14 · 2015-09-02T01:55:52Z

@ravipesala can you answer @marmbrus' question and/or rebase this to master so we can decide how to proceed with it?

andrewor14 · 2015-09-02T01:56:13Z

Also cc @chenghao-intel who wrote the similar patch #4812

rxin · 2015-12-31T02:39:51Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

ravipesala force-pushed the SPARK-4226 branch 2 times, most recently from 3f63e1c to 59dfab5 Compare November 21, 2014 17:17

marmbrus reviewed Dec 2, 2014
View reviewed changes

ravipesala force-pushed the SPARK-4226 branch from 59dfab5 to 353b86b Compare December 3, 2014 11:24

marmbrus reviewed Dec 17, 2014
View reviewed changes

ravipesala force-pushed the SPARK-4226 branch from d62887e to 8cae35b Compare December 21, 2014 17:17

ravipesala and others added 12 commits February 28, 2015 00:46

Added new expression class

0134915

Added documentation and fixed style issues.

152fd23

Fixed style issue

9e361df

Added more comments to the code.

86a4430

Fixed style issues

4ee8c18

Fixed review comments

834acda

Fixed comments

f1b7d30

Fixed review comments

dc424df

Fixed compilation errors

4afc469

Fixed compilation errors

03db47b

Fixed compilation issuses

a27cca6

Fixed test cases

7653eee

ravipesala force-pushed the SPARK-4226 branch from 036f3c6 to 7653eee Compare February 27, 2015 19:20

chenghao-intel reviewed Feb 28, 2015
View reviewed changes

chenghao-intel mentioned this pull request Oct 10, 2015

[SPARK-4226][SQL]Add subquery (not) in/exists support #9055

Closed

asfgit closed this in 7b4452b Dec 31, 2015

[SPARK-4226][SQL] SparkSQL - Add support for subqueries in predicates('in' clause) #3249

[SPARK-4226][SQL] SparkSQL - Add support for subqueries in predicates('in' clause) #3249

Uh oh!

Conversation

ravipesala commented Nov 13, 2014

Uh oh!

marmbrus Dec 2, 2014

Choose a reason for hiding this comment

Uh oh!

ravipesala Dec 3, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 2, 2014

Uh oh!

marmbrus commented Dec 2, 2014

Uh oh!

SparkQA commented Dec 2, 2014

Uh oh!

scwf commented Dec 2, 2014

Uh oh!

ravipesala commented Dec 3, 2014

Uh oh!

SparkQA commented Dec 3, 2014

Uh oh!

SparkQA commented Dec 3, 2014

Uh oh!

SparkQA commented Dec 5, 2014

Uh oh!

SparkQA commented Dec 5, 2014

Uh oh!

marmbrus Dec 17, 2014

Choose a reason for hiding this comment

Uh oh!

ravipesala Dec 21, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 17, 2014

Uh oh!

SparkQA commented Feb 27, 2015

Uh oh!

chenghao-intel commented Feb 28, 2015

Uh oh!

chenghao-intel Feb 28, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel commented Feb 28, 2015

Uh oh!

ravipesala commented Mar 2, 2015

Uh oh!

marmbrus commented Apr 3, 2015

Uh oh!

SparkQA commented Jun 24, 2015

Uh oh!

andrewor14 commented Sep 2, 2015

Uh oh!

andrewor14 commented Sep 2, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants