Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 3, 2024

What changes were proposed in this pull request?

update the logic to resolve column in spark connect

Why are the changes needed?

        df = spark.createDataFrame([(1, 2), (3, 4)], schema=["a", "b"])
        df2 = df.select(df.a.alias("aa"), df.b)
        df3 = df2.join(df, df2.b == df.b)

AnalysisException: [AMBIGUOUS_COLUMN_REFERENCE] Column "b" is ambiguous. It's because you joined several DataFrame together, and some of these DataFrames are the same.
This column points to one of the DataFrames but Spark is unable to figure out which one.
Please alias the DataFrames with different names via `DataFrame.alias` before joining them,
and specify the column using qualified name, e.g. `df.alias("a").join(df.alias("b"), col("a.id") > col("b.id"))`. SQLSTATE: 42702

Does this PR introduce any user-facing change?

yes, above query can run successfully after this PR

This PR only affects Spark Connect, won't affect Classic Spark.

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng force-pushed the fix_connect_self_join_depth branch from ffcbae7 to 0124dec Compare April 3, 2024 11:43
@zhengruifeng zhengruifeng changed the title [WIP][SQL][CONNECT] Fix a self-join case with depth [WIP][SQL][CONNECT] Fix a self-join failure Apr 3, 2024
@zhengruifeng zhengruifeng changed the title [WIP][SQL][CONNECT] Fix a self-join failure [SPARK-47713][SQL][CONNECT] Fix a self-join failure Apr 3, 2024
@zhengruifeng zhengruifeng requested a review from cloud-fan April 3, 2024 11:52
@zhengruifeng zhengruifeng marked this pull request as ready for review April 3, 2024 11:52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this condition can be false if resolved.isEmpty is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two cases:

1, analyzer rules supporting missing column resolution

https://github.com/apache/spark/blob/923f04606fe6bee5913f8fce7aaa643984f79756/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala#L583-L607

2, plan id missed in some way (should be bugs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we still have the matched flag? The new code is confusing and I can't understand it even after reading your comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment in code would be very helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me restore matched

@zhengruifeng zhengruifeng force-pushed the fix_connect_self_join_depth branch from 923f046 to b400131 Compare April 4, 2024 01:01
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row constructor doesn't accept duplicated names, so test in this way

image

@zhengruifeng zhengruifeng force-pushed the fix_connect_self_join_depth branch from b400131 to 3aa147e Compare April 7, 2024 01:03
fix scala style
@zhengruifeng zhengruifeng force-pushed the fix_connect_self_join_depth branch from 3aa147e to eda211c Compare April 7, 2024 03:04
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3a39ac2 Apr 8, 2024
@zhengruifeng zhengruifeng deleted the fix_connect_self_join_depth branch April 8, 2024 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants