Skip to content

Conversation

@sigmod
Copy link
Contributor

@sigmod sigmod commented May 4, 2021

What changes were proposed in this pull request?

Added the following TreePattern enums:

  • ALIAS
  • AND_OR
  • AVERAGE
  • GENERATE
  • INTERSECT
  • SORT
  • SUM
  • DISTINCT_LIKE
  • PROJECT
  • REPARTITION_OPERATION
  • UNION

Added tree traversal pruning to the following rules in Optimizer.scala:

  • EliminateAggregateFilter
  • RemoveRedundantAggregates
  • RemoveNoopOperators
  • RemoveNoopUnion
  • LimitPushDown
  • ColumnPruning
  • CollapseRepartition
  • OptimizeRepartition
  • OptimizeWindowFunctions
  • CollapseWindow
  • TransposeWindow
  • InferFiltersFromGenerate
  • InferFiltersFromConstraints
  • CombineUnions
  • CombineFilters
  • EliminateSorts
  • PruneFilters
  • EliminateLimits
  • DecimalAggregates
  • ConvertToLocalRelation
  • ReplaceDistinctWithAggregate
  • ReplaceIntersectWithSemiJoin
  • ReplaceExceptWithAntiJoin
  • RewriteExceptAll
  • RewriteIntersectAll
  • RemoveLiteralFromGroupExpressions
  • RemoveRepetitionFromGroupExpressions
  • OptimizeLimitZero

Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

perf diff:
Rule name | Total Time (baseline) | Total Time (experiment) | experiment/baseline
RemoveRedundantAggregates | 51290766 | 67070477 | 1.31
RemoveNoopOperators | 192371141 | 196631275 | 1.02
RemoveNoopUnion | 49222561 | 43266681 | 0.88
LimitPushDown | 40885185 | 21672646 | 0.53
ColumnPruning | 2003406120 | 1285562149 | 0.64
CollapseRepartition | 40648048 | 72646515 | 1.79
OptimizeRepartition | 37813850 | 20600803 | 0.54
OptimizeWindowFunctions | 174426904 | 46741409 | 0.27
CollapseWindow | 38959957 | 24542426 | 0.63
TransposeWindow | 33533191 | 20414930 | 0.61
InferFiltersFromGenerate | 21758688 | 15597344 | 0.72
InferFiltersFromConstraints | 518009794 | 493282321 | 0.95
CombineUnions | 67694022 | 70550382 | 1.04
CombineFilters | 35265060 | 29005424 | 0.82
EliminateSorts | 57025509 | 19795776 | 0.35
PruneFilters | 433964815 | 465579200 | 1.07
EliminateLimits | 44275393 | 24476859 | 0.55
DecimalAggregates | 83143172 | 28816090 | 0.35
ReplaceDistinctWithAggregate | 21783760 | 18287489 | 0.84
ReplaceIntersectWithSemiJoin | 22311271 | 16566393 | 0.74
ReplaceExceptWithAntiJoin | 23838520 | 16588808 | 0.70
RewriteExceptAll | 32750296 | 29421957 | 0.90
RewriteIntersectAll | 29760454 | 21243599 | 0.71
RemoveLiteralFromGroupExpressions | 28151861 | 25270947 | 0.90
RemoveRepetitionFromGroupExpressions | 29587030 | 23447041 | 0.79
OptimizeLimitZero | 18081943 | 15597344 | 0.86
Accumulated | 4129959311 | 3112676285 | 0.75

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label May 4, 2021
@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138144 has finished for PR 32439 at commit 7fbd213.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ExtractValue extends Expression

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42665/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42665/

@sigmod sigmod changed the title [WIP][SPARK-35298] Migrate to transformWithPruning for rules in Optimizer.scala [WIP][SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala May 5, 2021
@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42689/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42691/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138170 has finished for PR 32439 at commit 7a6b879.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138168 has finished for PR 32439 at commit 98830eb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42696/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42696/

@sigmod sigmod changed the title [WIP][SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala [SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala May 5, 2021
@sigmod
Copy link
Contributor Author

sigmod commented May 5, 2021

@hvanhovell @gengliangwang @dbaliafroozeh @maryannxue this PR is ready for review.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42701/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42701/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138175 has finished for PR 32439 at commit 381e38f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138180 has finished for PR 32439 at commit 564d8fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42707/

@SparkQA
Copy link

SparkQA commented May 6, 2021

Test build #138186 has finished for PR 32439 at commit f376055.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(plan transform {
def apply(plan: LogicalPlan): LogicalPlan = removeProjectBeforeFilter(
plan.transformWithPruning(AlwaysProcess.fn, ruleId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed sth. from the previous commit, but what difference is there between regular transform and transformWithPruning(AlwaysProcess.fn, ...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform internally calls
transformWithPruning(AlwaysProcess.fn, UnknownRuleId) ...

The argument comments are here:

* @param cond a Lambda expression to prune tree traversals. If `cond.apply` returns false
* on a TreeNode T, skips processing T and its subtree; otherwise, processes
* T and its subtree recursively.
* @param ruleId is a unique Id for `rule` to prune unnecessary tree traversals. When it is
* UnknownRuleId, no pruning happens. Otherwise, if `rule` (with id `ruleId`)
* has been marked as in effective on a TreeNode T, skips processing T and its
* subtree. Do not pass it if the rule is not purely functional and reads a
* varying initial state for different invocations.

So here,
plan.transformWithPruning(AlwaysProcess.fn, ruleId) means there's no pruning based on TreePattern bits, but there's pruning based on ruleIds (if the rule is known to ineffective on a tree instance T, it will be skipped next time when it is invoked on the same tree instance T).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42888/

@SparkQA
Copy link

SparkQA commented May 11, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42888/

case _ => RoundRobinPartitioning(numPartitions)
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented May 11, 2021

Test build #138365 has finished for PR 32439 at commit 27e92ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ShuffledJoin extends JoinCodegenSupport

Copy link
Contributor

@maryannxue maryannxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42930/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42930/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138408 has finished for PR 32439 at commit 0df10d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42940/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42960/

@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138419 has finished for PR 32439 at commit 2408a18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2021

Test build #138439 has finished for PR 32439 at commit 5027ebc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 12, 2021

Hi, @gengliangwang and @sigmod .
The last commit seems to break JAVA 11 consistently. Could you take a look at this? Thanks!

Screen Shot 2021-05-12 at 9 05 53 AM

@gengliangwang
Copy link
Member

@dongjoon-hyun Thanks! @sigmod and I are looking at it. We will revert this if we can't resolve it shortly.

dongjoon-hyun pushed a commit that referenced this pull request May 12, 2021
### What changes were proposed in this pull request?

After merging #32439, there is flaky error from the Github action job "Java 11 build with Maven":

```
Error:  ## Exception when compiling 473 sources to /home/runner/work/spark/spark/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.reflect.internal.Trees.itransform(Trees.scala:1376)
scala.reflect.internal.Trees.itransform$(Trees.scala:1374)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563)
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:51)
```
We can resolve it by increasing the stack size of JVM to 256M. The container for Github action jobs has 7G memory so this should be fine.

### Why are the changes needed?

Fix flaky test failure in Java 11 build test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Github action test

Closes #32521 from gengliangwang/increaseStackSize.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants