-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| |SELECT | ||
| | 1 | ||
| |FROM foo | ||
| |FULL OUTER JOIN bar USING(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full outer join will trigger the rule AddMetadataColumns to add meta columns over the whole plan.
Before rule AddMetadataColumns:
!Filter isnotnull(id#223)
+- Project [coalesce(id#227, id#225) AS id#228]
+- Join FullOuter, (id#227 = id#225)
:- Project [coalesce(id#223, id#224) AS id#227]
: +- Join FullOuter, (id#223 = id#224)
: :- SubqueryAlias foo
: : +- CTERelationRef 2, true, [id#223]
: +- SubqueryAlias bar
: +- CTERelationRef 3, true, [id#224]
+- SubqueryAlias spark_catalog.default.view_1
+- View (`spark_catalog`.`default`.`view_1`, [id#225])
+- Project [cast(id#226 as string) AS id#225]
+- WithCTE
:- CTERelationDef 4, false
: +- SubqueryAlias source
: +- Project [a#218, s#219]
: +- SubqueryAlias spark_catalog.default.table_1
: +- Relation spark_catalog.default.table_1[a#218,s#219] parquet
:- CTERelationDef 5, false
: +- SubqueryAlias renamed
: +- Project [s#219.id AS id#226]
: +- SubqueryAlias source
: +- CTERelationRef 4, true, [a#218, s#219]
+- Project [id#226]
+- SubqueryAlias renamed
+- CTERelationRef 5, true, [id#226]
After the rule:
Filter isnotnull(id#223)
+- Project [id#227, id#225, id#223, id#224, coalesce(id#227, id#225) AS id#228]
+- Join FullOuter, (id#227 = id#225)
:- Project [id#223, id#224, coalesce(id#223, id#224) AS id#227]
: +- Join FullOuter, (id#223 = id#224)
: :- SubqueryAlias foo
: : +- CTERelationRef 2, true, [id#223]
: +- SubqueryAlias bar
: +- CTERelationRef 3, true, [id#224]
+- SubqueryAlias spark_catalog.default.view_1
+- View (`spark_catalog`.`default`.`view_1`, [id#225])
+- Project [cast(id#226 as string) AS id#225]
+- WithCTE
:- CTERelationDef 4, false
: +- SubqueryAlias source
: +- Project [_metadata#221, a#218, s#219]
: +- SubqueryAlias spark_catalog.default.table_1
: +- Relation spark_catalog.default.table_1[a#218,s#219,_metadata#221] parquet
:- CTERelationDef 5, false
: +- SubqueryAlias renamed
: +- Project [s#219.id AS id#226]
: +- SubqueryAlias source
: +- CTERelationRef 4, true, [a#218, s#219]
+- Project [id#226]
+- SubqueryAlias renamed
+- CTERelationRef 5, true, [id#226]
| case p: Project => | ||
| val newProj = p.copy( | ||
| projectList = p.metadataOutput ++ p.projectList, | ||
| projectList = p.projectList ++ p.metadataOutput, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this cause any user-facing query result change? I guess it's not. Just want to get your confirmation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The analyzer adds the metadata columns so the parent operators can access theym. If the metadata columns are not used, they will be pruned.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @gengliangwang .
|
+1, LGTM. Merging to 3.3/master. |
…roject list
### What changes were proposed in this pull request?
For the following query:
```
CREATE TABLE table_1 (
a ARRAY<STRING>,
s STRUCT<id: STRING>)
USING parquet;
CREATE VIEW view_1 (id)
AS WITH source AS (
SELECT * FROM table_1
),
renamed AS (
SELECT
s.id
FROM source
)
SELECT id FROM renamed;
with foo AS (
SELECT 'a' as id
),
bar AS (
SELECT 'a' as id
)
SELECT
1
FROM foo
FULL OUTER JOIN bar USING(id)
FULL OUTER JOIN view_1 USING(id)
WHERE foo.id IS NOT NULL
```
There will be the following error:
```
class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108)
at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108)
```
This is caused by the inconsistent metadata column positions in the following two nodes:
* Table relation: at the ending position
* Project list: at the beginning position
<img width="1442" alt="image" src="https://user-images.githubusercontent.com/1097932/207992343-438714bc-e1d1-46f7-9a79-84ab83dd299f.png">
When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output.
<img width="1438" alt="image" src="https://user-images.githubusercontent.com/1097932/207992431-f4cfc774-4cab-4728-b109-2ebff94e5fe2.png">
Thus the column `a ARRAY<STRING>` is casted as `s STRUCT<id: STRING>` and cause the error.
This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation.
### Why are the changes needed?
Bug fix
### Does this PR introduce _any_ user-facing change?
Yes, it fixes a bug in the analysis rule `AddMetadataColumns`
### How was this patch tested?
New test case
Closes #39081 from gengliangwang/fixMetadata.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit 172f719)
Signed-off-by: Max Gekk <[email protected]>
|
late LGTM |
### What changes were proposed in this pull request? Ideally it's OK to always propagate metadata columns, as column pruning will kick in later and prune them aways if they are not used. However, it may cause problems in cases like CTE. #39081 fixed such a bug. This PR only propagates metadata columns if they are used, to keep the analyzed plan simple and reliable. ### Why are the changes needed? avoid potential bugs. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #39152 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…tadataColumnSuite ### What changes were proposed in this pull request? Move the new test case for Metadata column in #39081 to `MetadataColumnSuite` ### Why are the changes needed? All metadata column related test cases should go into `MetadataColumnSuite`. For example: - #37758 - #39152 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #39425 from gengliangwang/moveTest. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
What changes were proposed in this pull request?
For the following query:
There will be the following error:
This is caused by the inconsistent metadata column positions in the following two nodes:
When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output.

Thus the column
a ARRAY<STRING>is casted ass STRUCT<id: STRING>and cause the error.This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation.
Why are the changes needed?
Bug fix
Does this PR introduce any user-facing change?
Yes, it fixes a bug in the analysis rule
AddMetadataColumnsHow was this patch tested?
New test case