[SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala #32439

sigmod · 2021-05-04T22:14:49Z

What changes were proposed in this pull request?

Added the following TreePattern enums:

ALIAS
AND_OR
AVERAGE
GENERATE
INTERSECT
SORT
SUM
DISTINCT_LIKE
PROJECT
REPARTITION_OPERATION
UNION

Added tree traversal pruning to the following rules in Optimizer.scala:

EliminateAggregateFilter
RemoveRedundantAggregates
RemoveNoopOperators
RemoveNoopUnion
LimitPushDown
ColumnPruning
CollapseRepartition
OptimizeRepartition
OptimizeWindowFunctions
CollapseWindow
TransposeWindow
InferFiltersFromGenerate
InferFiltersFromConstraints
CombineUnions
CombineFilters
EliminateSorts
PruneFilters
EliminateLimits
DecimalAggregates
ConvertToLocalRelation
ReplaceDistinctWithAggregate
ReplaceIntersectWithSemiJoin
ReplaceExceptWithAntiJoin
RewriteExceptAll
RewriteIntersectAll
RemoveLiteralFromGroupExpressions
RemoveRepetitionFromGroupExpressions
OptimizeLimitZero

Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

perf diff:
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
RemoveRedundantAggregates | 51290766 | 67070477 | 1.31
RemoveNoopOperators | 192371141 | 196631275 | 1.02
RemoveNoopUnion | 49222561 | 43266681 | 0.88
LimitPushDown | 40885185 | 21672646 | 0.53
ColumnPruning | 2003406120 | 1285562149 | 0.64
CollapseRepartition | 40648048 | 72646515 | 1.79
OptimizeRepartition | 37813850 | 20600803 | 0.54
OptimizeWindowFunctions | 174426904 | 46741409 | 0.27
CollapseWindow | 38959957 | 24542426 | 0.63
TransposeWindow | 33533191 | 20414930 | 0.61
InferFiltersFromGenerate | 21758688 | 15597344 | 0.72
InferFiltersFromConstraints | 518009794 | 493282321 | 0.95
CombineUnions | 67694022 | 70550382 | 1.04
CombineFilters | 35265060 | 29005424 | 0.82
EliminateSorts | 57025509 | 19795776 | 0.35
PruneFilters | 433964815 | 465579200 | 1.07
EliminateLimits | 44275393 | 24476859 | 0.55
DecimalAggregates | 83143172 | 28816090 | 0.35
ReplaceDistinctWithAggregate | 21783760 | 18287489 | 0.84
ReplaceIntersectWithSemiJoin | 22311271 | 16566393 | 0.74
ReplaceExceptWithAntiJoin | 23838520 | 16588808 | 0.70
RewriteExceptAll | 32750296 | 29421957 | 0.90
RewriteIntersectAll | 29760454 | 21243599 | 0.71
RemoveLiteralFromGroupExpressions | 28151861 | 25270947 | 0.90
RemoveRepetitionFromGroupExpressions | 29587030 | 23447041 | 0.79
OptimizeLimitZero | 18081943 | 15597344 | 0.86
Accumulated | 4129959311 | 3112676285 | 0.75

How was this patch tested?

Existing tests.

SparkQA · 2021-05-05T00:54:21Z

Test build #138144 has finished for PR 32439 at commit 7fbd213.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ExtractValue extends Expression

SparkQA · 2021-05-05T01:25:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42665/

SparkQA · 2021-05-05T01:27:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42665/

SparkQA · 2021-05-05T07:37:15Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42689/

SparkQA · 2021-05-05T07:51:41Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42691/

SparkQA · 2021-05-05T08:11:47Z

Test build #138170 has finished for PR 32439 at commit 7a6b879.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-05T08:20:14Z

Test build #138168 has finished for PR 32439 at commit 98830eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-05T15:40:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42696/

SparkQA · 2021-05-05T15:45:13Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42696/

sigmod · 2021-05-05T17:46:01Z

@hvanhovell @gengliangwang @dbaliafroozeh @maryannxue this PR is ready for review.

SparkQA · 2021-05-05T18:37:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42701/

SparkQA · 2021-05-05T18:37:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42701/

SparkQA · 2021-05-05T19:28:13Z

Test build #138175 has finished for PR 32439 at commit 381e38f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-05T22:17:10Z

Test build #138180 has finished for PR 32439 at commit 564d8fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-06T02:38:23Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42707/

SparkQA · 2021-05-06T06:22:56Z

Test build #138186 has finished for PR 32439 at commit f376055.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2021-05-11T03:50:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala


-  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform {
+  def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(
+    plan.transformWithPruning(AlwaysProcess.fn, ruleId) {


I might have missed sth. from the previous commit, but what difference is there between regular transform and transformWithPruning(AlwaysProcess.fn, ...) ?

transform internally calls
transformWithPruning(AlwaysProcess.fn, UnknownRuleId) ...

The argument comments are here:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

Lines 434 to 441 in e08c40f

* @param cond a Lambda expression to prune tree traversals. If `cond.apply` returns false

* on a TreeNode T, skips processing T and its subtree; otherwise, processes

* T and its subtree recursively.

* @param ruleId is a unique Id for `rule` to prune unnecessary tree traversals. When it is

* UnknownRuleId, no pruning happens. Otherwise, if `rule` (with id `ruleId`)

* has been marked as in effective on a TreeNode T, skips processing T and its

* subtree. Do not pass it if the rule is not purely functional and reads a

* varying initial state for different invocations.

So here,
plan.transformWithPruning(AlwaysProcess.fn, ruleId) means there's no pruning based on TreePattern bits, but there's pruning based on ruleIds (if the rule is known to ineffective on a tree instance T, it will be skipped next time when it is invoked on the same tree instance T).

I see. Thanks!

SparkQA · 2021-05-11T08:08:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42888/

SparkQA · 2021-05-11T08:08:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42888/

gengliangwang · 2021-05-11T08:33:42Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

      case _ => RoundRobinPartitioning(numPartitions)
    }
  }
+


nit: unnecessary change.

gengliangwang

LGTM

SparkQA · 2021-05-11T11:45:46Z

Test build #138365 has finished for PR 32439 at commit 27e92ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShuffledJoin extends JoinCodegenSupport

maryannxue

LGTM

SparkQA · 2021-05-12T00:05:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42930/

SparkQA · 2021-05-12T00:05:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42930/

SparkQA · 2021-05-12T03:58:57Z

Test build #138408 has finished for PR 32439 at commit 0df10d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-12T04:06:59Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42940/

SparkQA · 2021-05-12T08:16:52Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42960/

SparkQA · 2021-05-12T08:37:47Z

Test build #138419 has finished for PR 32439 at commit 2408a18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-12T12:31:25Z

Test build #138439 has finished for PR 32439 at commit 5027ebc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-05-12T12:41:56Z

Thanks, merging to master

dongjoon-hyun · 2021-05-12T16:04:54Z

Hi, @gengliangwang and @sigmod .
The last commit seems to break JAVA 11 consistently. Could you take a look at this? Thanks!

gengliangwang · 2021-05-12T16:52:26Z

@dongjoon-hyun Thanks! @sigmod and I are looking at it. We will revert this if we can't resolve it shortly.

### What changes were proposed in this pull request? After merging #32439, there is flaky error from the Github action job "Java 11 build with Maven": ``` Error: ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes java.lang.StackOverflowError scala.reflect.internal.Trees.itransform(Trees.scala:1376) scala.reflect.internal.Trees.itransform$(Trees.scala:1374) scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51) ``` We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine. ### Why are the changes needed? Fix flaky test failure in Java 11 build test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Github action test Closes #32521 from gengliangwang/increaseStackSize. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

sigmod added 3 commits May 3, 2021 15:54

update

6a35a88

update

c9d0ef0

merge master

7fbd213

github-actions bot added the SQL label May 4, 2021

sigmod added 2 commits May 4, 2021 23:43

support pruning in more rules

98830eb

update

7a6b879

sigmod changed the title ~~[WIP][SPARK-35298] Migrate to transformWithPruning for rules in Optimizer.scala~~ [WIP][SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala May 5, 2021

fix test failure

381e38f

sigmod changed the title ~~[WIP][SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala~~ [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala May 5, 2021

update

564d8fb

merge master

f376055

maryannxue reviewed May 11, 2021

View reviewed changes

merge master

27e92ea

gengliangwang reviewed May 11, 2021

View reviewed changes

gengliangwang approved these changes May 11, 2021

View reviewed changes

maryannxue approved these changes May 11, 2021

View reviewed changes

sigmod added 2 commits May 11, 2021 15:59

Address Gengliang's comment

0df10d1

Merge branch 'master' into optimizer

2408a18

Merge branch 'master' into optimizer

5027ebc

gengliangwang closed this in d92018e May 12, 2021

gengliangwang mentioned this pull request May 12, 2021

[SPARK-35387][INFRA] Increase the JVM stack size for Java 11 build test #32521

Closed

	* @param cond a Lambda expression to prune tree traversals. If `cond.apply` returns false
	* on a TreeNode T, skips processing T and its subtree; otherwise, processes
	* T and its subtree recursively.
	* @param ruleId is a unique Id for `rule` to prune unnecessary tree traversals. When it is
	* UnknownRuleId, no pruning happens. Otherwise, if `rule` (with id `ruleId`)
	* has been marked as in effective on a TreeNode T, skips processing T and its
	* subtree. Do not pass it if the rule is not purely functional and reads a
	* varying initial state for different invocations.

[SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala #32439

[SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala #32439

Uh oh!

Conversation

sigmod commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

sigmod commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 5, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

SparkQA commented May 6, 2021

Uh oh!

maryannxue May 11, 2021

Choose a reason for hiding this comment

Uh oh!

sigmod May 11, 2021

Choose a reason for hiding this comment

Uh oh!

maryannxue May 11, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2021

Uh oh!

SparkQA commented May 11, 2021

Uh oh!

gengliangwang May 11, 2021

Choose a reason for hiding this comment

Uh oh!

sigmod May 11, 2021

Choose a reason for hiding this comment

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2021

Uh oh!

maryannxue left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 12, 2021

Uh oh!

SparkQA commented May 12, 2021

Uh oh!

SparkQA commented May 12, 2021

Uh oh!

SparkQA commented May 12, 2021

Uh oh!

SparkQA commented May 12, 2021

Uh oh!

SparkQA commented May 12, 2021

sigmod commented May 4, 2021 •

edited

Loading

dongjoon-hyun commented May 12, 2021 •

edited

Loading