[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project #21674

maryannxue · 2018-06-29T21:33:14Z

What changes were proposed in this pull request?

The ColumnPruning rule tries adding an extra Project if an input node produces fields more than needed, but as a post-processing step, it needs to remove the lower Project in the form of "Project - Filter - Project" otherwise it would conflict with PushPredicatesThroughProject and would thus cause a infinite optimization loop. The current post-processing method is defined as:

  private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform {
    case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
      if p2.outputSet.subsetOf(child.outputSet) =>
      p1.copy(child = f.copy(child = child))
  }

This method works well when there is only one Filter but would not if there's two or more Filters. In this case, there is a deterministic filter and a non-deterministic filter so they stay as separate filter nodes and cannot be combined together.

An simplified illustration of the optimization process that forms the infinite loop is shown below (F1 stands for the 1st filter, F2 for the 2nd filter, P for project, S for scan of relation, PredicatePushDown as abbrev. of PushPredicatesThroughProject):

                             F1 - F2 - P - S
PredicatePushDown      =>    F1 - P - F2 - S
ColumnPruning          =>    F1 - P - F2 - P - S
                       =>    F1 - P - F2 - S        (Project removed)
PredicatePushDown      =>    P - F1 - F2 - S
ColumnPruning          =>    P - F1 - P - F2 - S
                       =>    P - F1 - P - F2 - P - S 
                       =>    P - F1 - F2 - P - S    (only one Project removed)
RemoveRedundantProject =>    F1 - F2 - P - S        (goes back to the loop start)

So the problem is the ColumnPruning rule adds a Project under a Filter (and fails to remove it in the end), and that new Project triggers PushPredicateThroughProject. Once the filters have been push through the Project, a new Project will be added by the ColumnPruning rule and this goes on and on.
The fix should be when adding Projects, the rule applies top-down, but later when removing extra Projects, the process should go bottom-up to ensure all extra Projects can be matched.

How was this patch tested?

Added a optimization rule test in ColumnPruningSuite; and a end-to-end test in SQLQuerySuite.

gatorsmile

LGTM

maropu · 2018-06-29T23:24:28Z

LGTM, too.

SparkQA · 2018-06-30T01:24:51Z

Test build #92485 has finished for PR 21674 at commit 11fde8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-30T01:28:44Z

Test build #92486 has finished for PR 21674 at commit f45a8b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-06-30T01:51:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

      }
    }

+    def rand(e: Long): Expression = Rand(Literal.create(e, LongType))


We can just use Rand(seed: Long). See object Rand in randomExpressions.

Since we already have a bunch of expressions here, I don't think it would hurt to add this one?

I mean: def rand(e: Long): Expression = Rand(e).

I addressed the comment when I merged the code.

viirya · 2018-06-30T01:56:59Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    }
  }
+
+  test("SPARK-24696 ColumnPruning rule fails to remove extra Project") {


The test in Jira is simpler than this. Do we need to have two tables and a join? Why not just use the test in Jira?

The new unit test in ColumnPruningSuite.scala already covers that.

gatorsmile · 2018-06-30T06:53:51Z

Thanks! Merged to master/2.3

The ColumnPruning rule tries adding an extra Project if an input node produces fields more than needed, but as a post-processing step, it needs to remove the lower Project in the form of "Project - Filter - Project" otherwise it would conflict with PushPredicatesThroughProject and would thus cause a infinite optimization loop. The current post-processing method is defined as: ``` private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform { case p1 Project(_, f Filter(_, p2 Project(_, child))) if p2.outputSet.subsetOf(child.outputSet) => p1.copy(child = f.copy(child = child)) } ``` This method works well when there is only one Filter but would not if there's two or more Filters. In this case, there is a deterministic filter and a non-deterministic filter so they stay as separate filter nodes and cannot be combined together. An simplified illustration of the optimization process that forms the infinite loop is shown below (F1 stands for the 1st filter, F2 for the 2nd filter, P for project, S for scan of relation, PredicatePushDown as abbrev. of PushPredicatesThroughProject): ``` F1 - F2 - P - S PredicatePushDown => F1 - P - F2 - S ColumnPruning => F1 - P - F2 - P - S => F1 - P - F2 - S (Project removed) PredicatePushDown => P - F1 - F2 - S ColumnPruning => P - F1 - P - F2 - S => P - F1 - P - F2 - P - S => P - F1 - F2 - P - S (only one Project removed) RemoveRedundantProject => F1 - F2 - P - S (goes back to the loop start) ``` So the problem is the ColumnPruning rule adds a Project under a Filter (and fails to remove it in the end), and that new Project triggers PushPredicateThroughProject. Once the filters have been push through the Project, a new Project will be added by the ColumnPruning rule and this goes on and on. The fix should be when adding Projects, the rule applies top-down, but later when removing extra Projects, the process should go bottom-up to ensure all extra Projects can be matched. Added a optimization rule test in ColumnPruningSuite; and a end-to-end test in SQLQuerySuite. Author: maryannxue <[email protected]> Closes #21674 from maryannxue/spark-24696.

maryannxue added 2 commits June 29, 2018 14:27

SPARK-24696 ColumnPruning rule fails to remove extra Project

11fde8b

Refine test

f45a8b8

gatorsmile reviewed Jun 29, 2018

View reviewed changes

viirya reviewed Jun 30, 2018

View reviewed changes

asfgit closed this in 797971e Jun 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project #21674

[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project #21674

Uh oh!

maryannxue commented Jun 29, 2018 •

edited

Loading

Uh oh!

gatorsmile left a comment

Uh oh!

maropu commented Jun 29, 2018

Uh oh!

SparkQA commented Jun 30, 2018

Uh oh!

SparkQA commented Jun 30, 2018

Uh oh!

viirya Jun 30, 2018 •

edited

Loading

Uh oh!

maryannxue Jun 30, 2018

Uh oh!

viirya Jun 30, 2018

Uh oh!

gatorsmile Jun 30, 2018

Uh oh!

viirya Jun 30, 2018 •

edited

Loading

Uh oh!

gatorsmile Jun 30, 2018

Uh oh!

gatorsmile commented Jun 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project #21674

[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project #21674

Uh oh!

Conversation

maryannxue commented Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

maropu commented Jun 29, 2018

Uh oh!

SparkQA commented Jun 30, 2018

Uh oh!

SparkQA commented Jun 30, 2018

Uh oh!

viirya Jun 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue Jun 30, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jun 30, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 30, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jun 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jun 30, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 30, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

maryannxue commented Jun 29, 2018 •

edited

Loading

viirya Jun 30, 2018 •

edited

Loading

viirya Jun 30, 2018 •

edited

Loading