Skip to content
Next Next commit
fix column pruning through filter
  • Loading branch information
Davies Liu committed Mar 18, 2016
commit ffe12702a074b71833b84f4d8bf081005767da27
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
PushPredicateThroughAggregate,
LimitPushDown,
ColumnPruning,
InferFiltersFromConstraints,
// TODO: enable this once it's fixed
// InferFiltersFromConstraints,
// Operator combine
CollapseRepartition,
CollapseProject,
Expand Down Expand Up @@ -305,21 +306,14 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper {
}

/**
* Attempts to eliminate the reading of unneeded columns from the query plan using the following
* transformations:
*
* - Inserting Projections beneath the following operators:
* - Aggregate
* - Generate
* - Project <- Join
* - LeftSemiJoin
* Attempts to eliminate the reading of unneeded columns from the query plan.
*/
object ColumnPruning extends Rule[LogicalPlan] {
private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean =
output1.size == output2.size &&
output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2))

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are using transform, which is actually transformDown. In this rule ColumnPruning, we could add many Project into the child. This could easily cause stack overflow. That is why my PR #11745 is changing it to transformUp. Do you think this change makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column pruning have to be from top to bottom, or you will need multiple run of this rule. The added Projection is exactly the same whenever you go from top or bottom. If going from bottom, it will not work sometimes (because the added Project will be moved by other rules, for sample filter push down).

Have you actually see the stack overflow on this rule? I donot think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using transformUp, the removeProjectBeforeFilter 's assumption is not right. The following line does not cover all the cases:

case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
   if p2.outputSet.subsetOf(child.outputSet) =>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the stack overflow in my local environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my PR: #11745 covers all the cases even if we change it from transform to transformUp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change transform to transformUp, it will be great if you can post a test case that cause StackOverflow, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it tonight. I did not have it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to reproduce the stack overflow now, if we keep the following lines in ColumnPruning:

    // Eliminate no-op Projects
    case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child

If we remove the above line, we will get the stack overflow easily because we can generate duplicate Project. Anyway, I am fine if you want to use transformDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason we should remove this line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If transformDown is required here, could you change transform to transformDown? Got it from the comment in the function transform
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L242-L243

// Prunes the unused columns from project list of Project/Aggregate/Expand
case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty =>
p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains)))
Expand Down Expand Up @@ -398,7 +392,7 @@ object ColumnPruning extends Rule[LogicalPlan] {
} else {
p
}
}
})

/** Applies a projection only when the child is producing unnecessary attributes */
private def prunedChild(c: LogicalPlan, allReferences: AttributeSet) =
Expand All @@ -407,6 +401,16 @@ object ColumnPruning extends Rule[LogicalPlan] {
} else {
c
}

/**
* The Project before Filter is not necessary but conflict with PushPredicatesThroughProject,
* so remove it.
*/
private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We still need to explicitly use transformDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use transform in everywhere, even we know that tranformDown is better, for example, all those rules that push down a predicate.

I think it's fine, or we should update all these places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still correct if someone change transform to transformUp suddenly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Is that possible there are two continuous Project following the Filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two continuous Project will be combined together by other rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollapseProject is called after this rule. Anyway, we can leave it here if no test case failed due to it.

case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
if p2.outputSet.subsetOf(child.outputSet) =>
p1.copy(child = f.copy(child = child))
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging {
if (iteration > batch.strategy.maxIterations) {
// Only log if this is a rule that is supposed to run more than once.
if (iteration != 2) {
logInfo(s"Max iterations (${iteration - 1}) reached for batch ${batch.name}")
logWarning(s"Max iterations (${iteration - 1}) reached for batch ${batch.name}")
}
continue = false
}
Expand Down