-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32806][SQL] SortMergeJoin with partial hash distribution can be optimized to remove shuffle #29655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #128330 has finished for PR 29655 at commit
|
|
cc: @cloud-fan @viirya @maropu Thanks in advance! |
| * If the above conditions are met, shuffle can be eliminated for the sort merge join | ||
| * because rows are sorted before join logic is applied. | ||
| */ | ||
| case class OptimizeSortMergeJoinWithPartialHashDistribution(conf: SQLConf) extends Rule[SparkPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot implement this optimization in EnsureRequirements instead? Any reason to apply this rule after EnsureRequirements insert shuffles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you add fine-grained tests for this rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this inside EnsureRequirements.ensureDistributionAndOrdering, it would require a new Partitioning and Distribution that know both sides of join, so I didn't go that route. Doing this outside would be less intrusive, I thought. But please let me know if doing this inside EnsureRequirements makes more sense. Thanks.
This is done after EnsureRequirements since reordering keys may eliminate shuffles in which case this rule is not applied.
|
Great to see this PR, I bump into the same issues. I'm not an spark-contributor, so I really can't judge the importance of the "unknown error code -9" in the test above, but maybe the above can help you. |
|
I didn't notice the discussion and thanks for the sharing, @andyvanyperenAM. I've checked it and, yea, they look the same. Could you check the discussion in SPARK-18067/#19054, @imback82 ? I think we need to address all the issues described there, e.g., data skew. cc: @hvanhovell |
|
Thanks for the pointers! I wasn't aware of the existing PR. I will take a look. |
|
My first thought is like the concerns as same as @hvanhovell in the previous discussion. |
Is the concern with the data skew, or are there any other concerns? I couldn't find more in the discussion. The main scenario that this PR is going after is to allow bucketed tables to be utilized by more workloads. Since bucketed tables are created by users, we rarely observed cases where users pick bucket columns with low cardinality - similar to how users pick partition columns. I could make the rule more restrictive to check if sources are bucketed tables. (btw, if this approach is fine, I could extend the rule to support cc: @c21 |
c21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @imback82 for working on this, but I think #19054 seems to be a better approach for me (i.e. add leftDistributionKeys and rightDistributionKeys in SortMergeJoinExec/ShuffledHashJoinExec, and avoid shuffle by adding logic in EnsureRequirements.reorderJoinPredicates). @tejasapatil and I are in the same team so just bringing more context on this: we added #19054 in our internal fork and don't see much OOM issues. If #19054 is better approach in other people's opinions as well, I can redo that PR to latest master for review.
Adding the rule after ensureRequirements seems to add more burden on future development. We need to think about it every time during development as there's a new rule after ensureRequirements can remove shuffle.
| ExtractShuffleExchangeExecChild( | ||
| lChild, | ||
| lChildOutputPartitioning: HashPartitioning), | ||
| _), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why we can't just pattern matching ShuffleExchangeExec(_, leftChild, _) here? It seems to be looking simpler to me.
| val mapping = leftKeyToRightKeyMapping(leftKeys, rightKeys) | ||
| (leftPartitioning.numPartitions == rightPartitioning.numPartitions) && | ||
| leftPartitioning.expressions.zip(rightPartitioning.expressions) | ||
| .forall { | ||
| case (le, re) => mapping.get(le.canonicalized) | ||
| .map(_.exists(_.semanticEquals(re))) | ||
| .getOrElse(false) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I miss anything, but I feel this might not be correct. We should make sure the leftPartitioning.expressions and rightPartitioning.expressions has same size, and the order of expressions matters, right?
expressions size is different, so we should not remove shuffle:
t1 has 1024 buckets on column (a)
t2 has 1024 buckets on columns (a, b)
SELECT *
FROM t1
JOIN t2
ON t1.a = t2.a AND t1.b = t2.b
expressions size is same, but order is wrong, so we should not remove shuffle:
t1 has 1024 buckets on column (a, b)
t2 has 1024 buckets on columns (b, a)
SELECT *
FROM t1
JOIN t2
ON t1.a = t2.a AND AND t1.a = t2.b AND t1.b = t2.a AND t1.b = t2.b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I agree with your concerns for both cases. But, for the first example, only one side will be shuffled, so the rule should not kick in. For the second example, we have t1.a = t2.b AND t1.b = t2.a which matches the bucket ordering, so this should be also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I miss anything:
But, for the first example, only one side will be shuffled, so the rule should not kick in.
If the number of buckets for t1 is less than number of shuffle partitions, shouldn't it shuffle both sides ? (in EnsureRequirements). So the rule kicks in here and removes both shuffles, but we shouldn't remove any shuffle here.
For the second example, we have t1.a = t2.b AND t1.b = t2.a which matches the bucket ordering, so this should be also fine.
I think it's unsafe if we do not shuffle both sides. HashPartitioning(Seq(a, b)) and HashPartitioning(Seq(b, a)) are not same thing, e.g. for tuple (a: 1, b: 2) it will be assigned to different buckets given current Murmur3Hash implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of buckets for
t1is less than number of shuffle partitions, shouldn't it shuffle both sides ? (inEnsureRequirements). So the rule kicks in here and removes both shuffles, but we shouldn't remove any shuffle here.
You are right. Thanks for the catch!
I think it's unsafe if we do not shuffle both sides.
HashPartitioning(Seq(a, b))andHashPartitioning(Seq(b, a))are not same thing, e.g. for tuple (a: 1, b: 2) it will be assigned to different buckets given currentMurmur3Hashimplementation.
Yes, I understand they produce different hash values. However, it has the join condition t1.a = t2.b AND t1.b = t2.a. On the other hand, this rule will not be applied if the condition was t1.a = t2.a AND t1.b = t2.b. Please let me know if I missed something. Thanks!
|
cc @cloud-fan I think we would like to get your opinions here, as you reviewed #19054 in the past and have context on this. Thanks. |
Even so, I think removing shuffles in the middles of stages (e.g., many join cases) can make the prob. of OOM higher in theory in case of data skew. Since we can control input distributions somewhat, e.g., by the bucketing technique, it might be worth trying the restrictive approach that @imback82 suggested above, I think. |
Hi @c21 Thanks for the update and sorry if this is the incorrect place to give this a bump. kind regards, |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR proposes to optimize SortMergeJoin (SMJ) if each of its children has hash output partitioning which "partially" satisfies the required distribution. In this case where the child's output partitioning expressions are a subset of required distribution expressions (join keys expressions), the shuffle can be removed because rows will be sorted by join keys before rows are joined (the required child ordering for SMJ is on join keys).
This PR introduces
OptimizeSortMergeJoinWithPartialHashDistributionwhich removes shuffle for the sort merge join if the following conditions are met:This rule can be turned on by setting
spark.sql.execution.sortMergeJoin.optimizePartialHashDistribution.enabledtotrue(falseby default).Why are the changes needed?
To remove unnecessary shuffles in certain scenarios.
Does this PR introduce any user-facing change?
Suppose the following case where
t1is bucketed byi1, andt2byi2:Now if you join two tables by
t1("i1") === t2("i2") && t1("j1") === t2("j2")Before this change:
After the PR:
How was this patch tested?
Added tests.