Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Aug 25, 2017

What changes were proposed in this pull request?

Correlated predicate subqueries are rewritten into Join by the rule RewritePredicateSubquery during optimization.

It is possibly that the two sides of the Join have conflicting attributes. The query plans produced by RewritePredicateSubquery become unresolved and break structural integrity.

We should check if there are conflicting attributes in the Join and de-duplicate them by adding a Project.

How was this patch tested?

Added tests.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81117 has finished for PR 19050 at commit edb6271.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 25, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81121 has finished for PR 19050 at commit edb6271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just move this into a separate file? The analyzer is way to big as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain when this becomes a problem? LeftSemi/LeftAnti joins do output attributes from the right hand side of the join. Is the condition messed up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicateResolved in logical Join doesn't consider if LeftSemi/LeftAnti or not.

But as LeftSemi/LeftAnti only do output from one side (left.output in logical.Join), should we revise duplicateResolved for logical.Join? Then we don't need to dedup here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Just for my understanding, Can you please put an example query that depicts this problem ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example:

Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
sql(s"CREATE TABLE t1 USING parquet LOCATION '${path.toURI}'")

val sqlText =
  """
    |SELECT * FROM t1
    |WHERE
    |NOT EXISTS (SELECT * FROM t1)
  """.stripMargin
  val ds = sql(sqlText)
  println(ds.queryExecution.analyzed)
  println(ds.queryExecution.optimizedPlan)


Project [i#236, j#237]
+- Filter NOT exists#235 []
   :  +- Project [i#236, j#237]
   :     +- SubqueryAlias t1
   :        +- Relation[i#236,j#237] parquet
   +- SubqueryAlias t1
      +- Relation[i#236,j#237] parquet

'Join LeftAnti
:- Relation[i#236,j#237] parquet
+- Relation[i#236,j#237] parquet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to fix duplicateResolved.

We should add a project when outputs are duplicate. Could you check this for a correlated subquery?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks for suggestion.

Copy link
Contributor

@dilipbiswal dilipbiswal Aug 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks. I think, we should be de-duping the subquery output in case of correlated subquery. Thats why i was wondering about how we are ending up with duplicate attributes.
@hvanhovell I had a question. Do we anticipate to have future rewrites to convert left-semi/left-anti joins to other types join such as inner join ? Can this be a problem then ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Because predicate subqueries are rewritten into left semi/anti joins which don't have duplicate outputs. I think you mean correlated scalar subqueries which are rewritten into left outer joins, is it right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell We may not be able to revise duplicateResolved.

Even LeftSemi do output from only left side of the join, we still need duplicateResolved as false if there are duplicate attributes between left and right sides.

Otherwise, if there is a condition, the condition will be pushdown to left side of the join, because all attribute references in the condition is belonging to one side. It changes the join results.

@viirya viirya force-pushed the SPARK-21835 branch 2 times, most recently from 121ad5a to bf07e2a Compare August 28, 2017 04:22
@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81173 has started for PR 19050 at commit bf07e2a.

@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81172 has finished for PR 19050 at commit 121ad5a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-21835][SQL][WIP] RewritePredicateSubquery should not produce unresolved query plans [SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans Aug 28, 2017
@SparkQA
Copy link

SparkQA commented Aug 28, 2017

Test build #81179 has finished for PR 19050 at commit 82b5bac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 28, 2017

ping @hvanhovell Does the current change look good for you?

@viirya
Copy link
Member Author

viirya commented Aug 28, 2017

also cc @cloud-fan for review. Thanks.

@viirya
Copy link
Member Author

viirya commented Aug 31, 2017

ping @cloud-fan @hvanhovell Can you have time to review this? Thanks.

@viirya
Copy link
Member Author

viirya commented Sep 4, 2017

ping @cloud-fan @hvanhovell This blocks the #18956 going forward, can you help review this change? Thanks.

@viirya
Copy link
Member Author

viirya commented Sep 5, 2017

@cloud-fan and @hvanhovell seems too busy these days. maybe @gatorsmile can also help review this. Thanks.

}
}

def dedupJoin(plan: LogicalPlan): LogicalPlan = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> private def

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.


def dedupJoin(plan: LogicalPlan): LogicalPlan = {
plan transform {
case j @ Join(left, right, joinType, joinCond) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All join types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be safe, this only makes sense for LeftAnti and LeftSemi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
Project(p.output, Filter(newCond.get, inputPlan))
}
dedupJoin(rewritten)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment above this line to explain it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rethinking it, we can be more conservative. Instead of doing a dedup at the end, we should do it when we convert it to the Join.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Will follow it.

|WHERE
|NOT EXISTS (SELECT * FROM t1)
""".stripMargin
val ds = sql(sqlText)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless ds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, missing this. I'll remove it.

val optimizedPlan = sql(sqlText).queryExecution.optimizedPlan
val join = optimizedPlan.collect {
case j: Join => j
}.head.asInstanceOf[Join]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        val join = optimizedPlan.collectFirst { case j: Join => j }.get

val joinNodes = optimizedPlan.collect {
case j: Join => j
}.map(_.asInstanceOf[Join])
joinNodes.map(j => assert(j.duplicateResolved))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        val joinNodes = optimizedPlan.collect { case j: Join => j }
        joinNodes.foreach(j => assert(j.duplicateResolved))

@gatorsmile
Copy link
Member

LGTM pending Jenkins

@viirya
Copy link
Member Author

viirya commented Sep 6, 2017

Thanks @gatorsmile

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81439 has finished for PR 19050 at commit c1325fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81440 has finished for PR 19050 at commit 8550828.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master

@asfgit asfgit closed this in f2e22ae Sep 6, 2017
mahak pushed a commit to mahak/spark that referenced this pull request Sep 7, 2017
…duce unresolved query plans

## What changes were proposed in this pull request?

This is a follow-up of apache#19050 to deal with `ExistenceJoin` case.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#19151 from viirya/SPARK-21835-followup.
@viirya viirya deleted the SPARK-21835 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants