-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37328][SQL] OptimizeSkewedJoin should work for complex queries with multiple joins #34974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146441 has finished for PR 34974 at commit
|
ulysses-you
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with removing this check as I mentioned before, and without the check I think we can also support to optimize the skew join through union.
zhengruifeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
thanks for the review, merging to master! |
### What changes were proposed in this pull request? #34974, solved most scenarios of data skew in union. add test for it. ### Why are the changes needed? Added tests for the following scenarios: <b>scenes 1</b> ``` Union SMJ ShuffleQueryStage ShuffleQueryStage SMJ ShuffleQueryStage ShuffleQueryStage ``` <b>scenes 2</b> ``` Union SMJ ShuffleQueryStage ShuffleQueryStage HashAggregate ``` <b>scenes 3: not yet supported, SMJ-3 will introduce a new shuffle, so SMJ-1 cannot be optimized</b> ``` Union SMJ-1 ShuffleQueryStage ShuffleQueryStage SMJ-2 SMJ-3 ShuffleQueryStage ShuffleQueryStage HashAggregate ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the added test Closes #34908 from mcdull-zhang/skewed_union. Authored-by: mcdull-zhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In #32816 , we moved the rule
OptimizeSkewedJoinfromqueryStageOptimizerRulestoqueryStagePreparationRules. This means that the input plan ofOptimizeSkewedJoinwill be the entire query plan, while before the input plan was a shuffle stage which is usually a small part of the query plan.However, to simplify the implementation,
OptimizeSkewedJoinhas a check that it will be skipped if the input plan has more than 2 shuffle stages. This is obviously problematic now, as the input plan is the entire query plan and is very likely to have more than 2 shuffles.This PR proposes to remove the check from
OptimizeSkewedJoincompletely, as it's no longer needed.Why are the changes needed?
Fix a performance regression in the master branch (not released yet)
Does this PR introduce any user-facing change?
no
How was this patch tested?
a new test