Skip to content

Conversation

@nsyca
Copy link
Contributor

@nsyca nsyca commented Nov 24, 2016

What changes were proposed in this pull request?

  • Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
  • Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)

How was this patch tested?

Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.

nsyca added 21 commits July 29, 2016 17:43
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
// in a Full (Outer) Join operator and its descendants
case j @ Join(left, right, FullOuter, _) =>
failOnOuterReference(j)
failOnOuterReferenceInSubTree(left, "a FULL OUTER JOIN")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call failOnOuterReferenceInSubTree(j, "a FULL OUTER JOIN") you only need to do that once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@nsyca
Copy link
Contributor Author

nsyca commented Nov 24, 2016

@hvanhovell I want to get your opinion on this. The more I read the code in this block of pullOutCorrelatedPredicates, I feel it may have subtle side effects in the future.

  // Simplify the predicates before pulling them out.
  val transformed = BooleanSimplification(sub) transformUp {
    case f @ Filter(cond, child) => ...
    case p @ Project(expressions, child) => ...
    case a @ Aggregate(grouping, expressions, child) => ...
    case w : Window => ...
    case j @ Join(left, _, RightOuter, _) => ...
    case j @ Join(left, right, FullOuter, _) => ...
    case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
    case u: Union => ...
    case s: SetOperation => ...
    case e: Expand => ...
    case l : LocalLimit => ...
    case g : GlobalLimit => ...
    case s : Sample => ...
    case p =>
      failOnOuterReference(p)
      ...
  }

The code disallows operators in a sub plan of an operator hosting correlation on a case by case basis. As it is today, it only blocks Union/Intersect/Except/Expand/LocalLimit/GlobalLimit/Sample/FOJ and right table of LOJ (and left table of ROJ). That means any LogicalPlan operators that are not in the list above are permitted to be under a correlation point. Is this risky? There are many (30+ at least from browsing the LogicalPlan type hierarchy) operators derived from LogicalPlan class. Should we whitelist what operators allowed? For the case of ScalarSubquery, it explicitly checks that only SubqueryAlias/Project/Filter/Aggregate are allowed (CheckAnalysis.scala around line 126-165 in and after def cleanQuery). If we go this route, we should allow, in addition to the ones allowed in ScalarSubquery: Join, Distinct, Sort, OneRowRelation. I am debating about including Window though.

@hvanhovell
Copy link
Contributor

@nsyca +1000 on whitelisting operators. That is what we should have done from the start.

Let's break it down:

  1. LeafNodes should not be a problem. We don't need to explicitly handle them
  2. We should allow the following UnaryNode: Project, Filter, Aggregate, SubqueryAlias, Distinct, Generate (only when join=true), BroadcastHint, Sort, Repartition & RedistributeData (parent of SortPartitions and RepartitionByExpression). We need to find out what other systems allow for Window.
  3. The only BinaryNode we should allow is Join with special cases for Left/Right/Full. We should also make sure that the LeftAnti and LeftSemi are handled properly.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69135 has finished for PR 16005 at commit 099f75d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nsyca
Copy link
Contributor Author

nsyca commented Nov 24, 2016

@hvanhovell I will work on the whitelist in a new JIRA under SPARK-18455. It will be my top priority task and I hope we can merge it in the next minor release of 2.0.x. Let's have this PR scoped for the FOJ and Window cases. Shall we?

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69137 has finished for PR 16005 at commit 5d6cc8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

hvanhovell commented Nov 24, 2016

LGTM. Merging to master/2.1. Thanks!

asfgit pushed a commit that referenced this pull request Nov 24, 2016
…orrect results

## What changes were proposed in this pull request?

- Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
- Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)

## How was this patch tested?

Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.

Author: Nattavut Sutyanyong <[email protected]>

Closes #16005 from nsyca/FOJ-incorrect.1.

(cherry picked from commit a367d5f)
Signed-off-by: Herman van Hovell <[email protected]>
@hvanhovell
Copy link
Contributor

Hmmm. I cannot merge to 2.0 :(... Can you open a backport against 2.0?

@asfgit asfgit closed this in a367d5f Nov 24, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…orrect results

## What changes were proposed in this pull request?

- Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
- Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)

## How was this patch tested?

Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16005 from nsyca/FOJ-incorrect.1.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…orrect results

## What changes were proposed in this pull request?

- Raise Analysis exception when correlated predicates exist in the descendant operators of either operand of a Full outer join in a subquery as well as in a FOJ operator itself
- Raise Analysis exception when correlated predicates exists in a Window operator (a side effect inadvertently introduced by SPARK-17348)

## How was this patch tested?

Run sql/test catalyst/test and new test cases, added to SubquerySuite, showing the reported incorrect results.

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16005 from nsyca/FOJ-incorrect.1.
@nsyca nsyca deleted the FOJ-incorrect.1 branch March 14, 2017 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants