-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42132][SQL] Deduplicate attributes in groupByKey.cogroup #39673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42132][SQL] Deduplicate attributes in groupByKey.cogroup #39673
Conversation
|
Ideally, |
|
Can one of the admins verify this patch? |
4e5901e to
16721f1
Compare
|
cc: @cloud-fan @viirya @gengliangwang For help with reviews. |
|
+CC @HyukjinKwon, @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit weird to do dedup here. Can we update the DeduplicateRelations rule to handle CoGroup specially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see how DeduplicateRelations can be modified to not rewrite all attributes of CoGroup.
In DeduplicateRelations.apply method renewDuplicatedRelations is called, which calls rewriteAttrs(attrMap) on the CoGroup, which rewrites all attributes.
If you are suggesting to add case cogroup @ CoGroup(...) => to
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala
Lines 45 to 83 in f1a6c5e
| newPlan.resolveOperatorsUpWithPruning( | |
| _.containsAnyPattern(JOIN, LATERAL_JOIN, AS_OF_JOIN, INTERSECT, EXCEPT, UNION, COMMAND), | |
| ruleId) { | |
| case p: LogicalPlan if !p.childrenResolved => p | |
| // To resolve duplicate expression IDs for Join. | |
| case j @ Join(left, right, _, _, _) if !j.duplicateResolved => | |
| j.copy(right = dedupRight(left, right)) | |
| // Resolve duplicate output for LateralJoin. | |
| case j @ LateralJoin(left, right, _, _) if right.resolved && !j.duplicateResolved => | |
| j.copy(right = right.withNewPlan(dedupRight(left, right.plan))) | |
| // Resolve duplicate output for AsOfJoin. | |
| case j @ AsOfJoin(left, right, _, _, _, _, _) if !j.duplicateResolved => | |
| j.copy(right = dedupRight(left, right)) | |
| // intersect/except will be rewritten to join at the beginning of optimizer. Here we need to | |
| // deduplicate the right side plan, so that we won't produce an invalid self-join later. | |
| case i @ Intersect(left, right, _) if !i.duplicateResolved => | |
| i.copy(right = dedupRight(left, right)) | |
| case e @ Except(left, right, _) if !e.duplicateResolved => | |
| e.copy(right = dedupRight(left, right)) | |
| // Only after we finish by-name resolution for Union | |
| case u: Union if !u.byName && !u.duplicateResolved => | |
| // Use projection-based de-duplication for Union to avoid breaking the checkpoint sharing | |
| // feature in streaming. | |
| val newChildren = u.children.foldRight(Seq.empty[LogicalPlan]) { (head, tail) => | |
| head +: tail.map { | |
| case child if head.outputSet.intersect(child.outputSet).isEmpty => | |
| child | |
| case child => | |
| val projectList = child.output.map { attr => | |
| Alias(attr, attr.name)() | |
| } | |
| Project(projectList, child) | |
| } | |
| } | |
| u.copy(children = newChildren) | |
| case merge: MergeIntoTable if !merge.duplicateResolved => | |
| merge.copy(sourceTable = dedupRight(merge.targetTable, merge.sourceTable)) | |
| } | |
| } |
then this won't work because all attributes of
CoGroup have been rewritten at this point.
I'd appreciate some pointers or sketch of a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16721f1 to
cc5773c
Compare
|
Fixed in #41554. |
What changes were proposed in this pull request?
This deduplicate attributes that exist on both sides of a
CoGroupby aliasing the occurrence on the right side.Why are the changes needed?
Usually, DeduplicateRelations rule does exactly this. But the generic
QueryPlan.rewriteAttrsreplaces all occurrences of the duplicate reference with the new reference, butCoGroupuses the old reference for left and right group attributes, value attributes, and group order. Only the occurrences in the right attributes must be replaced.Further, the right deserialization expression is not touched at all.
The following DataFrame cannot be evaluated:
The query plan:
Evaluating this plan fails with:
Does this PR introduce any user-facing change?
This fixes correctness.
How was this patch tested?
Unit test in
DataFrameSuite.