[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081

gengliangwang · 2022-12-16T00:10:16Z

What changes were proposed in this pull request?

For the following query:

CREATE TABLE table_1 (
  a ARRAY<STRING>,
 s STRUCT<id: STRING>)
USING parquet;

CREATE VIEW view_1 (id)
AS WITH source AS (
    SELECT * FROM table_1
),
renamed AS (
    SELECT
     s.id
    FROM source
)
SELECT id FROM renamed;

with foo AS (
  SELECT 'a' as id
),
bar AS (
  SELECT 'a' as id
)
SELECT
  1
FROM foo
FULL OUTER JOIN bar USING(id)
FULL OUTER JOIN view_1 USING(id)
WHERE foo.id IS NOT NULL

There will be the following error:

class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108)
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108)

This is caused by the inconsistent metadata column positions in the following two nodes:

Table relation: at the ending position
Project list: at the beginning position

When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output.

Thus the column a ARRAY<STRING> is casted as s STRUCT<id: STRING> and cause the error.

This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

Yes, it fixes a bug in the analysis rule AddMetadataColumns

How was this patch tested?

New test case

gengliangwang · 2022-12-16T00:12:13Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+             |SELECT
+             |  1
+             |FROM foo
+             |FULL OUTER JOIN bar USING(id)


The full outer join will trigger the rule AddMetadataColumns to add meta columns over the whole plan.
Before rule AddMetadataColumns:

!Filter isnotnull(id#223) +- Project [coalesce(id#227, id#225) AS id#228] +- Join FullOuter, (id#227 = id#225) :- Project [coalesce(id#223, id#224) AS id#227] : +- Join FullOuter, (id#223 = id#224) : :- SubqueryAlias foo : : +- CTERelationRef 2, true, [id#223] : +- SubqueryAlias bar : +- CTERelationRef 3, true, [id#224] +- SubqueryAlias spark_catalog.default.view_1 +- View (`spark_catalog`.`default`.`view_1`, [id#225]) +- Project [cast(id#226 as string) AS id#225] +- WithCTE :- CTERelationDef 4, false : +- SubqueryAlias source : +- Project [a#218, s#219] : +- SubqueryAlias spark_catalog.default.table_1 : +- Relation spark_catalog.default.table_1[a#218,s#219] parquet :- CTERelationDef 5, false : +- SubqueryAlias renamed : +- Project [s#219.id AS id#226] : +- SubqueryAlias source : +- CTERelationRef 4, true, [a#218, s#219] +- Project [id#226] +- SubqueryAlias renamed +- CTERelationRef 5, true, [id#226]

After the rule:

Filter isnotnull(id#223) +- Project [id#227, id#225, id#223, id#224, coalesce(id#227, id#225) AS id#228] +- Join FullOuter, (id#227 = id#225) :- Project [id#223, id#224, coalesce(id#223, id#224) AS id#227] : +- Join FullOuter, (id#223 = id#224) : :- SubqueryAlias foo : : +- CTERelationRef 2, true, [id#223] : +- SubqueryAlias bar : +- CTERelationRef 3, true, [id#224] +- SubqueryAlias spark_catalog.default.view_1 +- View (`spark_catalog`.`default`.`view_1`, [id#225]) +- Project [cast(id#226 as string) AS id#225] +- WithCTE :- CTERelationDef 4, false : +- SubqueryAlias source : +- Project [_metadata#221, a#218, s#219] : +- SubqueryAlias spark_catalog.default.table_1 : +- Relation spark_catalog.default.table_1[a#218,s#219,_metadata#221] parquet :- CTERelationDef 5, false : +- SubqueryAlias renamed : +- Project [s#219.id AS id#226] : +- SubqueryAlias source : +- CTERelationRef 4, true, [a#218, s#219] +- Project [id#226] +- SubqueryAlias renamed +- CTERelationRef 5, true, [id#226]

dongjoon-hyun · 2022-12-16T04:31:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      case p: Project =>
        val newProj = p.copy(
-          projectList = p.metadataOutput ++ p.projectList,
+          projectList = p.projectList ++ p.metadataOutput,


Does this cause any user-facing query result change? I guess it's not. Just want to get your confirmation.

I don't think so. The analyzer adds the metadata columns so the parent operators can access theym. If the metadata columns are not used, they will be pruned.

dongjoon-hyun

+1, LGTM. Thank you, @gengliangwang .

MaxGekk · 2022-12-16T07:42:51Z

+1, LGTM. Merging to 3.3/master.
Thank you, @gengliangwang and @dongjoon-hyun @viirya for review.

…roject list ### What changes were proposed in this pull request? For the following query: ``` CREATE TABLE table_1 ( a ARRAY<STRING>, s STRUCT<id: STRING>) USING parquet; CREATE VIEW view_1 (id) AS WITH source AS ( SELECT * FROM table_1 ), renamed AS ( SELECT s.id FROM source ) SELECT id FROM renamed; with foo AS ( SELECT 'a' as id ), bar AS ( SELECT 'a' as id ) SELECT 1 FROM foo FULL OUTER JOIN bar USING(id) FULL OUTER JOIN view_1 USING(id) WHERE foo.id IS NOT NULL ``` There will be the following error: ``` class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108) at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108) ``` This is caused by the inconsistent metadata column positions in the following two nodes: * Table relation: at the ending position * Project list: at the beginning position <img width="1442" alt="image" src="https://user-images.githubusercontent.com/1097932/207992343-438714bc-e1d1-46f7-9a79-84ab83dd299f.png"> When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output. <img width="1438" alt="image" src="https://user-images.githubusercontent.com/1097932/207992431-f4cfc774-4cab-4728-b109-2ebff94e5fe2.png"> Thus the column `a ARRAY<STRING>` is casted as `s STRUCT<id: STRING>` and cause the error. This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation. ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug in the analysis rule `AddMetadataColumns` ### How was this patch tested? New test case Closes #39081 from gengliangwang/fixMetadata. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Max Gekk <[email protected]> (cherry picked from commit 172f719) Signed-off-by: Max Gekk <[email protected]>

cloud-fan · 2022-12-19T08:09:48Z

late LGTM

### What changes were proposed in this pull request? Ideally it's OK to always propagate metadata columns, as column pruning will kick in later and prune them aways if they are not used. However, it may cause problems in cases like CTE. #39081 fixed such a bug. This PR only propagates metadata columns if they are used, to keep the analyzed plan simple and reliable. ### Why are the changes needed? avoid potential bugs. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #39152 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…tadataColumnSuite ### What changes were proposed in this pull request? Move the new test case for Metadata column in #39081 to `MetadataColumnSuite` ### Why are the changes needed? All metadata column related test cases should go into `MetadataColumnSuite`. For example: - #37758 - #39152 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #39425 from gengliangwang/moveTest. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

Metadata column should be appended at the end of project

0f284f0

github-actions bot added the SQL label Dec 16, 2022

gengliangwang commented Dec 16, 2022

View reviewed changes

gengliangwang requested review from MaxGekk, cloud-fan, dongjoon-hyun and viirya December 16, 2022 00:12

dongjoon-hyun reviewed Dec 16, 2022

View reviewed changes

dongjoon-hyun approved these changes Dec 16, 2022

View reviewed changes

MaxGekk approved these changes Dec 16, 2022

View reviewed changes

viirya approved these changes Dec 16, 2022

View reviewed changes

MaxGekk closed this in 172f719 Dec 16, 2022

cloud-fan mentioned this pull request Dec 21, 2022

[SPARK-41660][SQL] Only propagate metadata columns if they are used #39152

Closed

gengliangwang mentioned this pull request Jan 6, 2023

[SPARK-41538][FollowUp][TESTS] Move a metadata column test case to MetadataColumnSuite #39425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081

[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081

Uh oh!

gengliangwang commented Dec 16, 2022

Uh oh!

gengliangwang Dec 16, 2022

Uh oh!

dongjoon-hyun Dec 16, 2022 •

edited

Loading

Uh oh!

gengliangwang Dec 16, 2022 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

MaxGekk commented Dec 16, 2022

Uh oh!

cloud-fan commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081

[SPARK-41538][SQL] Metadata column should be appended at the end of project list #39081

Uh oh!

Conversation

gengliangwang commented Dec 16, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Dec 16, 2022

Uh oh!

cloud-fan commented Dec 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun Dec 16, 2022 •

edited

Loading

gengliangwang Dec 16, 2022 •

edited

Loading