-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13919] [SQL] fix column pruning through filter #11828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ffe1270
d956aad
6c474a5
b26d1c0
920de45
b1118e5
6e698cc
580bf4e
6a41cd4
bb8f0cc
6ac25f7
cd7132e
d2da9e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -306,21 +306,21 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { | |
| } | ||
|
|
||
| /** | ||
| * Attempts to eliminate the reading of unneeded columns from the query plan using the following | ||
| * transformations: | ||
| * Attempts to eliminate the reading of unneeded columns from the query plan. | ||
| * | ||
| * - Inserting Projections beneath the following operators: | ||
| * - Aggregate | ||
| * - Generate | ||
| * - Project <- Join | ||
| * - LeftSemiJoin | ||
| * Since adding Project before Filter conflicts with PushPredicatesThroughProject, this rule will | ||
| * remove the Project p2 in the following pattern: | ||
| * | ||
| * p1 @ Project(_, Filter(_, p2 @ Project(_, child))) if p2.outputSet.subsetOf(p2.inputSet) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is our target, why not add a new case before the last case, and handle
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to insert the p2 to pruning the columns further, for example Project(Filter(Join)), we need p2 to prune the columns from Join. |
||
| * | ||
| * p2 is usually inserted by this rule and useless, p1 could prune the columns anyway. | ||
| */ | ||
| object ColumnPruning extends Rule[LogicalPlan] { | ||
| private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean = | ||
| output1.size == output2.size && | ||
| output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2)) | ||
|
|
||
| def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
| def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, we are using
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Column pruning have to be from top to bottom, or you will need multiple run of this rule. The added Projection is exactly the same whenever you go from top or bottom. If going from bottom, it will not work sometimes (because the added Project will be moved by other rules, for sample filter push down). Have you actually see the stack overflow on this rule? I donot think so.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are using case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child)))
if p2.outputSet.subsetOf(child.outputSet) =>
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw the stack overflow in my local environment.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think my PR: #11745 covers all the cases even if we change it from
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should not change transform to transformUp, it will be great if you can post a test case that cause StackOverflow, thanks!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do it tonight. I did not have it now.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unable to reproduce the stack overflow now, if we keep the following lines in // Eliminate no-op Projects
case p @ Project(projectList, child) if sameOutput(child.output, p.output) => childIf we remove the above line, we will get the stack overflow easily because we can generate duplicate
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no reason we should remove this line.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If |
||
| // Prunes the unused columns from project list of Project/Aggregate/Expand | ||
| case p @ Project(_, p2: Project) if (p2.outputSet -- p.references).nonEmpty => | ||
| p.copy(child = p2.copy(projectList = p2.projectList.filter(p.references.contains))) | ||
|
|
@@ -399,7 +399,7 @@ object ColumnPruning extends Rule[LogicalPlan] { | |
| } else { | ||
| p | ||
| } | ||
| } | ||
| }) | ||
|
|
||
| /** Applies a projection only when the child is producing unnecessary attributes */ | ||
| private def prunedChild(c: LogicalPlan, allReferences: AttributeSet) = | ||
|
|
@@ -408,6 +408,16 @@ object ColumnPruning extends Rule[LogicalPlan] { | |
| } else { | ||
| c | ||
| } | ||
|
|
||
| /** | ||
| * The Project before Filter is not necessary but conflict with PushPredicatesThroughProject, | ||
| * so remove it. | ||
| */ | ||
| private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. We still need to explicitly use
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We usually use I think it's fine, or we should update all these places.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's still correct if someone change
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Is that possible there are two continuous
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. two continuous Project will be combined together by other rules.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| case p1 @ Project(_, f @ Filter(_, p2 @ Project(_, child))) | ||
| if p2.outputSet.subsetOf(child.outputSet) => | ||
| p1.copy(child = f.copy(child = child)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,7 @@ class ColumnPruningSuite extends PlanTest { | |
|
|
||
| object Optimize extends RuleExecutor[LogicalPlan] { | ||
| val batches = Batch("Column pruning", FixedPoint(100), | ||
| PushPredicateThroughProject, | ||
| ColumnPruning, | ||
| CollapseProject) :: Nil | ||
| } | ||
|
|
@@ -133,12 +134,16 @@ class ColumnPruningSuite extends PlanTest { | |
|
|
||
| test("Column pruning on Filter") { | ||
| val input = LocalRelation('a.int, 'b.string, 'c.double) | ||
| val plan1 = Filter('a > 1, input).analyze | ||
| comparePlans(Optimize.execute(plan1), plan1) | ||
| val query = Project('a :: Nil, Filter('c > Literal(0.0), input)).analyze | ||
| val expected = | ||
| Project('a :: Nil, | ||
| Filter('c > Literal(0.0), | ||
| Project(Seq('a, 'c), input))).analyze | ||
| comparePlans(Optimize.execute(query), expected) | ||
| comparePlans(Optimize.execute(query), query) | ||
| val plan2 = Filter('b > 1, Project(Seq('a, 'b), input)).analyze | ||
| val expected2 = Project(Seq('a, 'b), Filter('b > 1, input)).analyze | ||
| comparePlans(Optimize.execute(plan2), expected2) | ||
| val plan3 = Project(Seq('a), Filter('b > 1, Project(Seq('a, 'b), input))).analyze | ||
| val expected3 = Project(Seq('a), Filter('b > 1, input)).analyze | ||
| comparePlans(Optimize.execute(plan3), expected3) | ||
| } | ||
|
|
||
| test("Column pruning on except/intersect/distinct") { | ||
|
|
@@ -297,7 +302,7 @@ class ColumnPruningSuite extends PlanTest { | |
| SortOrder('b, Ascending) :: Nil, | ||
| UnspecifiedFrame)).as('window) :: Nil, | ||
| 'a :: Nil, 'b.asc :: Nil) | ||
| .select('a, 'c, 'window).where('window > 1).select('a, 'c).analyze | ||
| .where('window > 1).select('a, 'c).analyze | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason why removing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If so, it becomes harder for Optimizer to judge which plan is better. Based on my understanding, the general principle of Could you explain the current strategy for this rule? We might need to add more test cases to check if it does the desired work.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After more thinking, can we modify the existing operator
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comment for that. I don't think that's necessary or good idea to add the functionality of Project into Filter.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. It is easier to understand it now. : ) |
||
|
|
||
| val optimized = Optimize.execute(originalQuery.analyze) | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from ResolveStar without any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this can speed up resolution for nested plans, thanks for fixing it!