-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34634][SQL] ResolveReferences.dedupRight should handle ScriptTransformation #31752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cloud-fan @Ngone51 @maropu Could you please help review this? |
|
ok to test |
|
This issue can happen in v2.4, too (reading the jira ticket)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this test into transform.sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
0d2fb72 to
59ffb92
Compare
yes, I've updated the jira's affects version |
| FROM t | ||
| ) tmp; | ||
|
|
||
| -- SPARK-34634 self join using CTE contains transform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: SPARK-34634:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| Seq((oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions)))) | ||
|
|
||
| case oldVersion @ ScriptTransformation(_, _, output, _, _) | ||
| if AttributeSet(output).intersect(conflictingAttributes).nonEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 4 indents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
LGTM. The k8s test failure looks unrelated. |
|
Test build #135808 has finished for PR 31752 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. |
|
@WangGuangxin, it conflicts with other branches. Do you mind creating a PR to backport? |
…ransformation
When we do self join with transform in a CTE, spark will throw AnalysisException.
A simple way to reproduce is
```
create temporary view t as select * from values 0, 1, 2 as t(a);
WITH temp AS (
SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
```
before this patch, it throws
```
org.apache.spark.sql.AnalysisException: cannot resolve '`t1.b`' given input columns: [t1.b]; line 6 pos 41;
'Project ['t1.b]
+- 'Join Inner, ('t1.b = 't2.b)
:- SubqueryAlias t1
: +- SubqueryAlias temp
: +- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
: +- SubqueryAlias t
: +- Project [a#1]
: +- SubqueryAlias t
: +- LocalRelation [a#1]
+- SubqueryAlias t2
+- SubqueryAlias temp
+- ScriptTransformation [a#1], cat, [b#2], ScriptInputOutputSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.DelimitedJSONSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
+- SubqueryAlias t
+- Project [a#1]
+- SubqueryAlias t
+- LocalRelation [a#1]
```
NO
Add a UT
Closes apache#31752 from WangGuangxin/selfjoin-with-transform.
Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sure. I'll send out later |
…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (#34172) mainly, and SPARK-34634 (#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ld copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR backports the change of SPARK-36874 (apache#34172) mainly, and SPARK-34634 (apache#31752) partially to care about the ambiguous self join for `ScriptTransformation`. This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `ResolveReference.dedupRight.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#34205 from sarutak/backport-SPARK-36874. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
When we do self join with transform in a CTE, spark will throw AnalysisException.
A simple way to reproduce is
before this patch, it throws
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Add a UT