Skip to content

Conversation

@WangGuangxin
Copy link
Contributor

What changes were proposed in this pull request?

When we do self join with transform in a CTE, spark will throw AnalysisException.

A simple way to reproduce is

create temporary view t as select * from values 0, 1, 2 as t(a);

WITH temp AS (
  SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

before this patch, it throws

org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41;
'Project ['t1.b]
+- 'Join Inner, ('t1.b = 't2.b)
   :- SubqueryAlias t1
   :  +- SubqueryAlias temp
   :     +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
   :        +- SubqueryAlias t
   :           +- Project [a#1]
   :              +- SubqueryAlias t
   :                 +- LocalRelation [a#1]
   +- SubqueryAlias t2
      +- SubqueryAlias temp
         +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
            +- SubqueryAlias t
               +- Project [a#1]
                  +- SubqueryAlias t
                     +- LocalRelation [a#1]

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add a UT

@WangGuangxin WangGuangxin changed the title SPAKR-34634 ResolveReferences.dedupRight should handle ScriptTransform [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransform Mar 5, 2021
@github-actions github-actions bot added the SQL label Mar 5, 2021
@WangGuangxin WangGuangxin changed the title [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransform [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation Mar 5, 2021
@WangGuangxin WangGuangxin changed the title [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation [SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation Mar 5, 2021
@WangGuangxin
Copy link
Contributor Author

@cloud-fan @Ngone51 @maropu Could you please help review this?

@maropu
Copy link
Member

maropu commented Mar 5, 2021

ok to test

@maropu
Copy link
Member

maropu commented Mar 5, 2021

This issue can happen in v2.4, too (reading the jira ticket)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this test into transform.sql?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40383/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40383/

@WangGuangxin WangGuangxin force-pushed the selfjoin-with-transform branch from 0d2fb72 to 59ffb92 Compare March 5, 2021 09:48
@WangGuangxin
Copy link
Contributor Author

This issue can happen in v2.4, too (reading the jira ticket)?

yes, I've updated the jira's affects version

FROM t
) tmp;

-- SPARK-34634 self join using CTE contains transform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: SPARK-34634:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40390/

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40390/

Seq((oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))))

case oldVersion @ ScriptTransformation(_, _, output, _, _)
if AttributeSet(output).intersect(conflictingAttributes).nonEmpty =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4 indents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@Ngone51
Copy link
Member

Ngone51 commented Mar 5, 2021

LGTM. The k8s test failure looks unrelated.

@SparkQA
Copy link

SparkQA commented Mar 5, 2021

Test build #135808 has finished for PR 31752 at commit 59ffb92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40411/

@SparkQA
Copy link

SparkQA commented Mar 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40411/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@WangGuangxin, it conflicts with other branches. Do you mind creating a PR to backport?

WangGuangxin added a commit to WangGuangxin/spark that referenced this pull request Mar 8, 2021
…ransformation

When we do self join with transform in a CTE, spark will throw AnalysisException.

A simple way to reproduce is

```
create temporary view t as select * from values 0, 1, 2 as t(a);

WITH temp AS (
  SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
```

before this patch, it throws

```
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41;
'Project ['t1.b]
+- 'Join Inner, ('t1.b = 't2.b)
   :- SubqueryAlias t1
   :  +- SubqueryAlias temp
   :     +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
   :        +- SubqueryAlias t
   :           +- Project [a#1]
   :              +- SubqueryAlias t
   :                 +- LocalRelation [a#1]
   +- SubqueryAlias t2
      +- SubqueryAlias temp
         +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
            +- SubqueryAlias t
               +- Project [a#1]
                  +- SubqueryAlias t
                     +- LocalRelation [a#1]
```

NO

Add a UT

Closes apache#31752 from WangGuangxin/selfjoin-with-transform.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@WangGuangxin
Copy link
Contributor Author

@WangGuangxin, it conflicts with other branches. Do you mind creating a PR to backport?

sure. I'll send out later

HyukjinKwon pushed a commit that referenced this pull request Oct 7, 2021
…ld copy dataset_id tag to avoid ambiguous self join

### What changes were proposed in this pull request?

This PR backports the change of SPARK-36874 (#34172) mainly, and SPARK-34634 (#31752) partially to care about the ambiguous self join for `ScriptTransformation`.
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")

df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.

df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```

The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #34205 from sarutak/backport-SPARK-36874.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…ld copy dataset_id tag to avoid ambiguous self join

### What changes were proposed in this pull request?

This PR backports the change of SPARK-36874 (apache#34172) mainly, and SPARK-34634 (apache#31752) partially to care about the ambiguous self join for `ScriptTransformation`.
This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped.
This is an example.
```
val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")
val df2 = df1.filter($"value" === "A2")

df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown.

df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected.
```

The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes apache#34205 from sarutak/backport-SPARK-36874.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants