-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143819 has finished for PR 34172 at commit
|
|
cc @cloud-fan FYI |
|
thanks, merging to master/3.2! |
… avoid ambiguous self join
### What changes were proposed in this pull request?
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")
df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.
df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```
The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes #34172 from sarutak/fix-deduplication-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit fa1805d)
Signed-off-by: Wenchen Fan <[email protected]>
|
Thank you, @sarutak and @cloud-fan . According to the JIRA issue, do we need this at branch-3.1, too? |
|
@dongjoon-hyun Thank you for letting me know. It seems better to backport. I'll do it. |
…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (#34172) mainly, and SPARK-34634 (#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… avoid ambiguous self join
### What changes were proposed in this pull request?
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")
df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.
df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```
The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes apache#34172 from sarutak/fix-deduplication-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit fa1805d)
Signed-off-by: Wenchen Fan <[email protected]>
…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (apache#34172) mainly, and SPARK-34634 (apache#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… avoid ambiguous self join
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")
df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.
df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```
The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.
Bug fix.
No.
New tests.
Closes #34172 from sarutak/fix-deduplication-issue.
Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
The root cause seems that an inner function
collectConflictPlansinDeduplicateRelations.doesn't copy thedataset_idtag when it copies aLogicalPlan.Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests.