[SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation #31752

WangGuangxin · 2021-03-05T04:56:31Z

What changes were proposed in this pull request?

When we do self join with transform in a CTE, spark will throw AnalysisException.

A simple way to reproduce is

create temporary view t as select * from values 0, 1, 2 as t(a);

WITH temp AS (
  SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

before this patch, it throws

org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41;
'Project ['t1.b]
+- 'Join Inner, ('t1.b = 't2.b)
   :- SubqueryAlias t1
   :  +- SubqueryAlias temp
   :     +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
   :        +- SubqueryAlias t
   :           +- Project [a#1]
   :              +- SubqueryAlias t
   :                 +- LocalRelation [a#1]
   +- SubqueryAlias t2
      +- SubqueryAlias temp
         +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,	)),List((field.delim,	)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
            +- SubqueryAlias t
               +- Project [a#1]
                  +- SubqueryAlias t
                     +- LocalRelation [a#1]

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Add a UT

WangGuangxin · 2021-03-05T06:49:45Z

@cloud-fan @Ngone51 @maropu Could you please help review this?

maropu · 2021-03-05T07:22:35Z

ok to test

maropu · 2021-03-05T07:23:39Z

This issue can happen in v2.4, too (reading the jira ticket)?

maropu · 2021-03-05T07:25:18Z

sql/core/src/test/resources/sql-tests/inputs/selfjoin-with-transform.sql

Could you move this test into transform.sql?

SparkQA · 2021-03-05T09:02:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40383/

SparkQA · 2021-03-05T09:28:26Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40383/

…m operator

WangGuangxin · 2021-03-05T09:52:18Z

This issue can happen in v2.4, too (reading the jira ticket)?

yes, I've updated the jira's affects version

maropu · 2021-03-05T10:36:51Z

sql/core/src/test/resources/sql-tests/inputs/transform.sql

  FROM t
 ) tmp;
+
+-- SPARK-34634 self join using CTE contains transform


super nit: SPARK-34634:

SparkQA · 2021-03-05T11:05:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40390/

SparkQA · 2021-03-05T11:39:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40390/

Ngone51 · 2021-03-05T13:08:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          Seq((oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))))

+        case oldVersion @ ScriptTransformation(_, _, output, _, _)
+          if AttributeSet(output).intersect(conflictingAttributes).nonEmpty =>


nit: 4 indents

Ngone51 · 2021-03-05T13:23:01Z

LGTM. The k8s test failure looks unrelated.

SparkQA · 2021-03-05T14:43:41Z

Test build #135808 has finished for PR 31752 at commit 59ffb92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-03-06T15:12:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40411/

SparkQA · 2021-03-06T15:46:46Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40411/

HyukjinKwon

LGTM

HyukjinKwon · 2021-03-07T06:55:42Z

Merged to master.

HyukjinKwon · 2021-03-07T06:56:03Z

@WangGuangxin, it conflicts with other branches. Do you mind creating a PR to backport?

…ransformation When we do self join with transform in a CTE, spark will throw AnalysisException. A simple way to reproduce is ``` create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b ``` before this patch, it throws ``` org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41; 'Project ['t1.b] +- 'Join Inner, ('t1.b = 't2.b) :- SubqueryAlias t1 : +- SubqueryAlias temp : +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) : +- SubqueryAlias t : +- Project [a#1] : +- SubqueryAlias t : +- LocalRelation [a#1] +- SubqueryAlias t2 +- SubqueryAlias temp +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- SubqueryAlias t +- Project [a#1] +- SubqueryAlias t +- LocalRelation [a#1] ``` NO Add a UT Closes apache#31752 from WangGuangxin/selfjoin-with-transform. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

WangGuangxin · 2021-03-08T15:22:37Z

@WangGuangxin, it conflicts with other branches. Do you mind creating a PR to backport?

sure. I'll send out later

…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (#34172) mainly, and SPARK-34634 (#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (apache#34172) mainly, and SPARK-34634 (apache#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

WangGuangxin changed the title ~~SPAKR-34634 ResolveReferences.dedupRight should handle ScriptTransform~~ [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransform Mar 5, 2021

github-actions bot added the SQL label Mar 5, 2021

WangGuangxin changed the title ~~[SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransform~~ [SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation Mar 5, 2021

WangGuangxin changed the title ~~[SPAKR-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation~~ [SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation Mar 5, 2021

maropu reviewed Mar 5, 2021

View reviewed changes

SPAKR-34634 ResolveReferences.dedupRight should handle ScriptTransfor…

59ffb92

…m operator

WangGuangxin force-pushed the selfjoin-with-transform branch from 0d2fb72 to 59ffb92 Compare March 5, 2021 09:48

maropu reviewed Mar 5, 2021

View reviewed changes

maropu approved these changes Mar 5, 2021

View reviewed changes

Ngone51 reviewed Mar 5, 2021

View reviewed changes

fix style

d8ae6f0

HyukjinKwon approved these changes Mar 7, 2021

View reviewed changes

HyukjinKwon closed this in 9ec8696 Mar 7, 2021

sarutak mentioned this pull request Oct 7, 2021

[SPARK-36874][SPARK-34634][SQL][3.1] ResolveReference.dedupRight should copy dataset_id tag to avoid ambiguous self join #34205

Closed

[SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation #31752

[SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation #31752

Uh oh!

Conversation

WangGuangxin commented Mar 5, 2021

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

WangGuangxin commented Mar 5, 2021

Uh oh!

maropu commented Mar 5, 2021

Uh oh!

maropu commented Mar 5, 2021

Uh oh!

maropu Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

WangGuangxin Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 5, 2021

Uh oh!

SparkQA commented Mar 5, 2021

Uh oh!

WangGuangxin commented Mar 5, 2021

Uh oh!

maropu Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

WangGuangxin Mar 6, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 5, 2021

Uh oh!

SparkQA commented Mar 5, 2021

Uh oh!

Ngone51 Mar 5, 2021

Choose a reason for hiding this comment

Uh oh!

WangGuangxin Mar 6, 2021

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Mar 5, 2021

Uh oh!

SparkQA commented Mar 5, 2021

Uh oh!

SparkQA commented Mar 6, 2021

Uh oh!

SparkQA commented Mar 6, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 7, 2021

Uh oh!

HyukjinKwon commented Mar 7, 2021

Uh oh!

WangGuangxin commented Mar 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants