Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -31,53 +31,43 @@ object CTESubstitution extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = {
LegacyBehaviorPolicy.withName(SQLConf.get.getConf(LEGACY_CTE_PRECEDENCE_POLICY)) match {
case LegacyBehaviorPolicy.EXCEPTION =>
assertNoNameConflictsInCTE(plan, inTraverse = false)
traverseAndSubstituteCTE(plan, inTraverse = false)
assertNoNameConflictsInCTE(plan)
traverseAndSubstituteCTE(plan)
case LegacyBehaviorPolicy.LEGACY =>
legacyTraverseAndSubstituteCTE(plan)
case LegacyBehaviorPolicy.CORRECTED =>
traverseAndSubstituteCTE(plan, inTraverse = false)
traverseAndSubstituteCTE(plan)
}
}

/**
* Check the plan to be traversed has naming conflicts in nested CTE or not, traverse through
* child, innerChildren and subquery for the current plan.
* child, innerChildren and subquery expressions for the current plan.
*/
private def assertNoNameConflictsInCTE(
plan: LogicalPlan,
inTraverse: Boolean,
cteNames: Set[String] = Set.empty): Unit = {
plan.foreach {
namesInChildren: Set[String] = Set.empty,
namesInExpressions: Set[String] = Set.empty): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we separate children and expression? Can we just have one parameter outerNames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this trick is necessary to find ambiguous situations in one traversal pass. Ambiguous here doesn't only mean the the same CTE name is used again in a nested CTE, but it is also important that we don't want to throw an exception when the legacy and the corrected substitution returns the same result. E.g.

WITH t(c) AS (SELECT 1)
SELECT max(c) FROM (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

returns 2 in both modes so we don't throw the exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really support this? I don't think it's possible at analysis time as the data can be from a table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from the current code, namesInExpressions is not used, just carry over with recursive calls.

Copy link
Contributor Author

@peter-toth peter-toth Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. What we want to catch here is that legacy and corrected modes MAY produce different results due to the different substitution order on the same data. But there are cases like the above when both modes does the substitution in the same order and we can be sure that the result is the same.

We usually say that in legacy mode the outer CTE takes precedence while in corrected mode the inner one does.
While this statement is entirely true for corrected mode, the legacy mode is a bit weird in terms that in case of we have a nested conflicting CTE in a subquery the inner takes precedence (similar to the corrected mode). But if we have a nested CTE in a subquery expression or in an inner children the outer does (in contrast to the corrected mode).

namesInExpressions is used, when we traverse subquery expressions or inner children (CTE definitions) we pass namesInExpressions into namesInChildren.
I admit that the naming of these parameters might be confusing namesInChildren is what we actually check for conflicts.
namesInChildren is what we pass along when we traverse children (in this case the 2 different modes do the substitution in the same order and inner always takes precedence so we don't expand this set with new names when we encounter a With).
namesInExpressions is what we pass along when we traverse subquery expressions and inner children (in this case the 2 different modes do the substitution different order). But once we encounter a subquery expressions or an inner children we need to switch the possible conflicting set to the namesInExpressions.

Copy link
Contributor

@cloud-fan cloud-fan Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WITH t(c) AS (SELECT 1)
SELECT max(c) FROM (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)

This query can't be parsed by Spark 2.4. Seems we support CTE in subquery since 3.0. So legacy behavior is the same as the correct behavior.

However, it's a subquery, not subquery expression. I think namesInExpressions is a confusing name. How about naming the two parameters as outerCTERelationNames and namesInSubqueries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, renamed.

plan match {
case w @ With(child, relations) =>
val newNames = relations.map {
case (cteName, _) =>
if (cteNames.contains(cteName)) {
if (namesInChildren.contains(cteName)) {
throw new AnalysisException(s"Name $cteName is ambiguous in nested CTE. " +
s"Please set ${LEGACY_CTE_PRECEDENCE_POLICY.key} to CORRECTED so that name " +
"defined in inner CTE takes precedence. If set it to LEGACY, outer CTE " +
"definitions will take precedence. See more details in SPARK-28228.")
} else {
cteName
}
}.toSet
child.transformExpressions {
case e: SubqueryExpression =>
assertNoNameConflictsInCTE(e.plan, inTraverse = true, cteNames ++ newNames)
e
}
w.innerChildren.foreach { p =>
assertNoNameConflictsInCTE(p, inTraverse = true, cteNames ++ newNames)
}

case other if inTraverse =>
other.transformExpressions {
case e: SubqueryExpression =>
assertNoNameConflictsInCTE(e.plan, inTraverse = true, cteNames)
e
}
}.toSet ++ namesInExpressions
assertNoNameConflictsInCTE(child, namesInChildren, newNames)
w.innerChildren.foreach(assertNoNameConflictsInCTE(_, newNames, newNames))

case _ =>
case other =>
other.subqueries.foreach(
assertNoNameConflictsInCTE(_, namesInExpressions, namesInExpressions))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass namesInExpressions twice?

Copy link
Contributor Author

@peter-toth peter-toth Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do. What would you pass in here?

other.children.foreach(assertNoNameConflictsInCTE(_, namesInChildren, namesInExpressions))
}
}

Expand Down Expand Up @@ -131,37 +121,27 @@ object CTESubstitution extends Rule[LogicalPlan] {
* SELECT * FROM t
* )
* @param plan the plan to be traversed
* @param inTraverse whether the current traverse is called from another traverse, only in this
* case name collision can occur
* @return the plan where CTE substitution is applied
*/
private def traverseAndSubstituteCTE(plan: LogicalPlan, inTraverse: Boolean): LogicalPlan = {
private def traverseAndSubstituteCTE(plan: LogicalPlan): LogicalPlan = {
plan.resolveOperatorsUp {
case With(child: LogicalPlan, relations) =>
// child might contain an inner CTE that has priority so traverse and substitute inner CTEs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an incorrect assumption that only direct child can contain a nested, conflicting CTE as subquery expression. As the added new test case shows such a conflicting CTE can be in any node under With.

// in child first
val traversedChild: LogicalPlan = child transformExpressions {
case e: SubqueryExpression => e.withNewPlan(traverseAndSubstituteCTE(e.plan, true))
}

// Substitute CTE definitions from last to first as a CTE definition can reference a
// previous one
relations.foldRight(traversedChild) {
relations.foldRight(child) {
case ((cteName, ctePlan), currentPlan) =>
// A CTE definition might contain an inner CTE that has priority, so traverse and
// substitute CTE defined in ctePlan.
// A CTE definition might not be used at all or might be used multiple times. To avoid
// computation if it is not used and to avoid multiple recomputation if it is used
// multiple times we use a lazy construct with call-by-name parameter passing.
lazy val substitutedCTEPlan = traverseAndSubstituteCTE(ctePlan, true)
lazy val substitutedCTEPlan = traverseAndSubstituteCTE(ctePlan)
substituteCTE(currentPlan, cteName, substitutedCTEPlan)
}

// CTE name collision can occur only when inTraverse is true, it helps to avoid eager CTE
// substitution in a subquery expression.
case other if inTraverse =>
case other =>
other.transformExpressions {
case e: SubqueryExpression => e.withNewPlan(traverseAndSubstituteCTE(e.plan, true))
case e: SubqueryExpression => e.withNewPlan(traverseAndSubstituteCTE(e.plan))
}
}
}
Expand Down
8 changes: 8 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,11 @@ SELECT (
SELECT * FROM t
)
);

-- CTE in subquery expression shadows outer 4
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
);
15 changes: 14 additions & 1 deletion sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 12
-- Number of queries: 13


-- !query
Expand Down Expand Up @@ -166,3 +166,16 @@ SELECT (
struct<scalarsubquery():int>
-- !query output
1


-- !query
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
)
-- !query schema
struct<c:int>
-- !query output
1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a legacy result.

16 changes: 15 additions & 1 deletion sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 12
-- Number of queries: 13


-- !query
Expand Down Expand Up @@ -172,3 +172,17 @@ struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Name t is ambiguous in nested CTE. Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name defined in inner CTE takes precedence. If set it to LEGACY, outer CTE definitions will take precedence. See more details in SPARK-28228.;


-- !query
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
)
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Name t is ambiguous in nested CTE. Please set spark.sql.legacy.ctePrecedencePolicy to CORRECTED so that name defined in inner CTE takes precedence. If set it to LEGACY, outer CTE definitions will take precedence. See more details in SPARK-28228.;
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 12
-- Number of queries: 13


-- !query
Expand Down Expand Up @@ -166,3 +166,16 @@ SELECT (
struct<scalarsubquery():int>
-- !query output
3


-- !query
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
)
-- !query schema
struct<c:int>
-- !query output