-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4226][SQL] SparkSQL - Add support for subqueries in predicates('in' clause) #3249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3f63e1c to
59dfab5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document what the arguments are here? It's not clear to me why we need exp (i.e., why can we just get it from the output of child. Also, style-wise I'd avoid abbreviation when not necessary and child is kind of an odd name given that its a different type of tree. Finally, should this be a LeafExpression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
Here exp is like predicate value. For example
SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b) . In this exp is a.key and child is subquery.
Now I have updated the names of them and added the documentation.
|
Cool feature! Thanks for working on this. |
|
ok to test |
|
Test build #24010 has finished for PR 3249 at commit
|
|
Hi @ravipesala, can you rebase this PR |
59dfab5 to
353b86b
Compare
|
Rebased with master. And fixed comments |
|
Test build #24090 has finished for PR 3249 at commit
|
|
Test build #24091 has finished for PR 3249 at commit
|
|
Test build #24177 has finished for PR 3249 at commit
|
|
Test build #24178 has finished for PR 3249 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space before {.
Also I would consider doing this in two steps to avoid depending on transform for side effects: a collect to get the list and then a transform to replace with true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Done in two steps.
|
Sorry for the delay reviewing this, but I finally had time to take a more thorough look at the code. I think this will be a really cool feature to have but I think there is some work that still needs to be done in the analysis rule. At a high level, I think what needs to be done is as follows: make sure the sub tree is already analyzed, which should simplify some things. Second, when possible we should be working with attributes in a way that uses expression ids (i.e. using |
d62887e to
8cae35b
Compare
036f3c6 to
7653eee
Compare
|
Test build #28083 has finished for PR 3249 at commit
|
|
@ravipesala, do you have any idea how to implement the BTW, can you also enable the hive compatible test like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making the Subquery as a fake expression, a better idea probably create a new logical plan like
SubQueryIn(left: LogicalPlan, nested: LogicalPlan, isNotIn:Boolean)
That's also how I implement the EXISTS at https://github.com/apache/spark/pull/4812/files#diff-9a11e98e8f4bd1c4bb18ca6a7a7b8948R262
|
Thank you @ravipesala for implementing this, however, this PR probably involve some unnecessary join condition transformation, probably you need to understand the rule of pushing down the join filter / condition first. Sorry, please correct me if I misunderstood something. |
|
@chenghao-intel Thank you for reviewing it.I will go through your comments and fix it. And regarding |
|
Sorry for letting this languish. What is the status here and how does this relate to #4812? |
|
Test build #35706 has finished for PR 3249 at commit
|
|
@ravipesala can you answer @marmbrus' question and/or rebase this to master so we can decide how to proceed with it? |
|
Also cc @chenghao-intel who wrote the similar patch #4812 |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
This PR supports subqueries in preicates 'in' clause. The queries will be transformed to the LeftSemi join as mentioned below.
Case 1 Uncorelated queries
-- original query
select C
from R1
where R1.A in (Select B from R2)
-- rewritten query
Select C
from R1 left semijoin R2 on R1.A = R2.B
Case 2 Corelated queries
-- original query
select C
from R1
where R1.A in (Select B from R2 where R1.X = R2.Y)
-- rewritten query
select C
from R1 left semi join
(select B, R2.Y as sq1_col0 from R2) sq1
on R1.X = sq1.sq1_col0 and R1.A = sq1.B
Restriction : Alias need to be used as we convert it into join queries.
Complete specification is available in https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf