Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In #32816 , we moved the rule OptimizeSkewedJoin from queryStageOptimizerRules to queryStagePreparationRules. This means that the input plan of OptimizeSkewedJoin will be the entire query plan, while before the input plan was a shuffle stage which is usually a small part of the query plan.

However, to simplify the implementation, OptimizeSkewedJoin has a check that it will be skipped if the input plan has more than 2 shuffle stages. This is obviously problematic now, as the input plan is the entire query plan and is very likely to have more than 2 shuffles.

This PR proposes to remove the check from OptimizeSkewedJoin completely, as it's no longer needed.

  1. We can and should optimize all the skewed joins in the query plan.
  2. If it introduces extra shuffles, we can return the original input plan, or accept it if the force-apply config is true.

Why are the changes needed?

Fix a performance regression in the master branch (not released yet)

Does this PR introduce any user-facing change?

no

How was this patch tested?

a new test

@cloud-fan
Copy link
Contributor Author

cc @zhengruifeng @ulysses-you

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50916/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50916/

@SparkQA
Copy link

SparkQA commented Dec 21, 2021

Test build #146441 has finished for PR 34974 at commit 844077f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@ulysses-you ulysses-you left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with removing this check as I mentioned before, and without the check I think we can also support to optimize the skew join through union.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in e0b9318 Dec 22, 2021
cloud-fan pushed a commit that referenced this pull request Feb 9, 2022
### What changes were proposed in this pull request?

#34974, solved most scenarios of data skew in union.
add test for it.

### Why are the changes needed?

Added tests for the following scenarios:

<b>scenes 1</b>
```
Union
    SMJ
        ShuffleQueryStage
        ShuffleQueryStage
    SMJ
        ShuffleQueryStage
        ShuffleQueryStage
```

<b>scenes 2</b>
```
Union
    SMJ
        ShuffleQueryStage
        ShuffleQueryStage
    HashAggregate
```

<b>scenes 3: not yet supported, SMJ-3 will introduce a new shuffle, so SMJ-1 cannot be optimized</b>
```
Union
    SMJ-1
        ShuffleQueryStage
        ShuffleQueryStage
    SMJ-2
       SMJ-3
         ShuffleQueryStage
         ShuffleQueryStage
       HashAggregate
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Pass the added test

Closes #34908 from mcdull-zhang/skewed_union.

Authored-by: mcdull-zhang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants