[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974

cloud-fan · 2021-12-21T14:42:25Z

What changes were proposed in this pull request?

In #32816 , we moved the rule OptimizeSkewedJoin from queryStageOptimizerRules to queryStagePreparationRules. This means that the input plan of OptimizeSkewedJoin will be the entire query plan, while before the input plan was a shuffle stage which is usually a small part of the query plan.

However, to simplify the implementation, OptimizeSkewedJoin has a check that it will be skipped if the input plan has more than 2 shuffle stages. This is obviously problematic now, as the input plan is the entire query plan and is very likely to have more than 2 shuffles.

This PR proposes to remove the check from OptimizeSkewedJoin completely, as it's no longer needed.

We can and should optimize all the skewed joins in the query plan.
If it introduces extra shuffles, we can return the original input plan, or accept it if the force-apply config is true.

Why are the changes needed?

Fix a performance regression in the master branch (not released yet)

Does this PR introduce any user-facing change?

no

How was this patch tested?

a new test

cloud-fan · 2021-12-21T14:43:39Z

cc @zhengruifeng @ulysses-you

SparkQA · 2021-12-21T15:28:12Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50916/

SparkQA · 2021-12-21T16:24:11Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50916/

SparkQA · 2021-12-21T19:35:24Z

Test build #146441 has finished for PR 34974 at commit 844077f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ulysses-you

I agree with removing this check as I mentioned before, and without the check I think we can also support to optimize the skew join through union.

zhengruifeng

LGTM

cloud-fan · 2021-12-22T08:54:23Z

thanks for the review, merging to master!

### What changes were proposed in this pull request? #34974, solved most scenarios of data skew in union. add test for it. ### Why are the changes needed? Added tests for the following scenarios： scenes 1 ``` Union SMJ ShuffleQueryStage ShuffleQueryStage SMJ ShuffleQueryStage ShuffleQueryStage ``` scenes 2 ``` Union SMJ ShuffleQueryStage ShuffleQueryStage HashAggregate ``` scenes 3: not yet supported, SMJ-3 will introduce a new shuffle, so SMJ-1 cannot be optimized ``` Union SMJ-1 ShuffleQueryStage ShuffleQueryStage SMJ-2 SMJ-3 ShuffleQueryStage ShuffleQueryStage HashAggregate ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the added test Closes #34908 from mcdull-zhang/skewed_union. Authored-by: mcdull-zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

OptimizeSkewedJoin should work for complex queries with multiple joins

844077f

github-actions bot added the SQL label Dec 21, 2021

cloud-fan mentioned this pull request Dec 21, 2021

[SPARK-37328][SQL] Fix bug that OptimizeSkewedJoin may not work after it was moved from queryStageOptimizerRules to queryStagePreparationRules. #34602

Closed

ulysses-you approved these changes Dec 22, 2021

View reviewed changes

zhengruifeng approved these changes Dec 22, 2021

View reviewed changes

cloud-fan closed this in e0b9318 Dec 22, 2021

zhengruifeng mentioned this pull request Dec 29, 2021

[SPARK-37652][SQL]Add test for optimize skewed join through union #34908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974

[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974

Uh oh!

cloud-fan commented Dec 21, 2021

Uh oh!

cloud-fan commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

ulysses-you left a comment

Uh oh!

zhengruifeng left a comment

Uh oh!

cloud-fan commented Dec 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974

[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974

Uh oh!

Conversation

cloud-fan commented Dec 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

SparkQA commented Dec 21, 2021

Uh oh!

ulysses-you left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants