[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172

sarutak · 2021-10-04T18:54:31Z

What changes were proposed in this pull request?

This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.

val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")

df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.

df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.

The root cause seems that an inner function collectConflictPlans in DeduplicateRelations. doesn't copy the dataset_id tag when it copies a LogicalPlan.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

…s self join.

SparkQA · 2021-10-04T19:37:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48332/

SparkQA · 2021-10-04T20:34:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48332/

SparkQA · 2021-10-04T23:46:31Z

Test build #143819 has finished for PR 34172 at commit d7c1f65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-10-05T01:41:07Z

cc @cloud-fan FYI

cloud-fan · 2021-10-05T03:16:50Z

thanks, merging to master/3.2!

… avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit fa1805d) Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun · 2021-10-06T17:56:20Z

Thank you, @sarutak and @cloud-fan . According to the JIRA issue, do we need this at branch-3.1, too?

sarutak · 2021-10-07T02:17:28Z

@dongjoon-hyun Thank you for letting me know. It seems better to backport. I'll do it.

…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (#34172) mainly, and SPARK-34634 (#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit fa1805d) Signed-off-by: Wenchen Fan <[email protected]>

…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (apache#34172) mainly, and SPARK-34634 (apache#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… avoid ambiguous self join This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. Bug fix. No. New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Fix the DeduplicateRelations to copy dataset_id tag to avoid ambiguou…

d7c1f65

…s self join.

github-actions bot added the SQL label Oct 4, 2021

cloud-fan approved these changes Oct 5, 2021

View reviewed changes

cloud-fan closed this in fa1805d Oct 5, 2021

sarutak mentioned this pull request Oct 7, 2021

[SPARK-36874][SPARK-34634][SQL][3.1] ResolveReference.dedupRight should copy dataset_id tag to avoid ambiguous self join #34205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172

[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172

Uh oh!

sarutak commented Oct 4, 2021

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

HyukjinKwon commented Oct 5, 2021

Uh oh!

cloud-fan commented Oct 5, 2021

Uh oh!

dongjoon-hyun commented Oct 6, 2021 •

edited

Loading

Uh oh!

sarutak commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172

[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join #34172

Uh oh!

Conversation

sarutak commented Oct 4, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

SparkQA commented Oct 4, 2021

Uh oh!

HyukjinKwon commented Oct 5, 2021

Uh oh!

cloud-fan commented Oct 5, 2021

Uh oh!

dongjoon-hyun commented Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarutak commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun commented Oct 6, 2021 •

edited

Loading