Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

For the following query:

CREATE TABLE table_1 (
  a ARRAY<STRING>,
 s STRUCT<id: STRING>)
USING parquet;

CREATE VIEW view_1 (id)
AS WITH source AS (
    SELECT * FROM table_1
),
renamed AS (
    SELECT
     s.id
    FROM source
)
SELECT id FROM renamed;

with foo AS (
  SELECT 'a' as id
),
bar AS (
  SELECT 'a' as id
)
SELECT
  1
FROM foo
FULL OUTER JOIN bar USING(id)
FULL OUTER JOIN view_1 USING(id)
WHERE foo.id IS NOT NULL

There will be the following error:

class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108)
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108)

This is caused by the inconsistent metadata column positions in the following two nodes:

  • Table relation: at the ending position
  • Project list: at the beginning position

image

When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output.
image

Thus the column a ARRAY<STRING> is casted as s STRUCT<id: STRING> and cause the error.

This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

Yes, it fixes a bug in the analysis rule AddMetadataColumns

How was this patch tested?

New test case

@github-actions github-actions bot added the SQL label Dec 16, 2022
|SELECT
| 1
|FROM foo
|FULL OUTER JOIN bar USING(id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full outer join will trigger the rule AddMetadataColumns to add meta columns over the whole plan.
Before rule AddMetadataColumns:

!Filter isnotnull(id#223)
+- Project [coalesce(id#227, id#225) AS id#228]
   +- Join FullOuter, (id#227 = id#225)
      :- Project [coalesce(id#223, id#224) AS id#227]
      :  +- Join FullOuter, (id#223 = id#224)
      :     :- SubqueryAlias foo
      :     :  +- CTERelationRef 2, true, [id#223]
      :     +- SubqueryAlias bar
      :        +- CTERelationRef 3, true, [id#224]
      +- SubqueryAlias spark_catalog.default.view_1
         +- View (`spark_catalog`.`default`.`view_1`, [id#225])
            +- Project [cast(id#226 as string) AS id#225]
               +- WithCTE
                  :- CTERelationDef 4, false
                  :  +- SubqueryAlias source
                  :     +- Project [a#218, s#219]
                  :        +- SubqueryAlias spark_catalog.default.table_1
                  :           +- Relation spark_catalog.default.table_1[a#218,s#219] parquet
                  :- CTERelationDef 5, false
                  :  +- SubqueryAlias renamed
                  :     +- Project [s#219.id AS id#226]
                  :        +- SubqueryAlias source
                  :           +- CTERelationRef 4, true, [a#218, s#219]
                  +- Project [id#226]
                     +- SubqueryAlias renamed
                        +- CTERelationRef 5, true, [id#226]

After the rule:

Filter isnotnull(id#223)
+- Project [id#227, id#225, id#223, id#224, coalesce(id#227, id#225) AS id#228]
   +- Join FullOuter, (id#227 = id#225)
      :- Project [id#223, id#224, coalesce(id#223, id#224) AS id#227]
      :  +- Join FullOuter, (id#223 = id#224)
      :     :- SubqueryAlias foo
      :     :  +- CTERelationRef 2, true, [id#223]
      :     +- SubqueryAlias bar
      :        +- CTERelationRef 3, true, [id#224]
      +- SubqueryAlias spark_catalog.default.view_1
         +- View (`spark_catalog`.`default`.`view_1`, [id#225])
            +- Project [cast(id#226 as string) AS id#225]
               +- WithCTE
                  :- CTERelationDef 4, false
                  :  +- SubqueryAlias source
                  :     +- Project [_metadata#221, a#218, s#219]
                  :        +- SubqueryAlias spark_catalog.default.table_1
                  :           +- Relation spark_catalog.default.table_1[a#218,s#219,_metadata#221] parquet
                  :- CTERelationDef 5, false
                  :  +- SubqueryAlias renamed
                  :     +- Project [s#219.id AS id#226]
                  :        +- SubqueryAlias source
                  :           +- CTERelationRef 4, true, [a#218, s#219]
                  +- Project [id#226]
                     +- SubqueryAlias renamed
                        +- CTERelationRef 5, true, [id#226]

case p: Project =>
val newProj = p.copy(
projectList = p.metadataOutput ++ p.projectList,
projectList = p.projectList ++ p.metadataOutput,
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause any user-facing query result change? I guess it's not. Just want to get your confirmation.

Copy link
Member Author

@gengliangwang gengliangwang Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The analyzer adds the metadata columns so the parent operators can access theym. If the metadata columns are not used, they will be pruned.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @gengliangwang .

@MaxGekk
Copy link
Member

MaxGekk commented Dec 16, 2022

+1, LGTM. Merging to 3.3/master.
Thank you, @gengliangwang and @dongjoon-hyun @viirya for review.

@MaxGekk MaxGekk closed this in 172f719 Dec 16, 2022
MaxGekk pushed a commit that referenced this pull request Dec 16, 2022
…roject list

### What changes were proposed in this pull request?

For the following query:
```
CREATE TABLE table_1 (
  a ARRAY<STRING>,
 s STRUCT<id: STRING>)
USING parquet;

CREATE VIEW view_1 (id)
AS WITH source AS (
    SELECT * FROM table_1
),
renamed AS (
    SELECT
     s.id
    FROM source
)
SELECT id FROM renamed;

with foo AS (
  SELECT 'a' as id
),
bar AS (
  SELECT 'a' as id
)
SELECT
  1
FROM foo
FULL OUTER JOIN bar USING(id)
FULL OUTER JOIN view_1 USING(id)
WHERE foo.id IS NOT NULL
```

There will be the following error:
```
class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.apache.spark.sql.types.ArrayType cannot be cast to class org.apache.spark.sql.types.StructType (org.apache.spark.sql.types.ArrayType and org.apache.spark.sql.types.StructType are in unnamed module of loader 'app')
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema$lzycompute(complexTypeExtractors.scala:108)
    at org.apache.spark.sql.catalyst.expressions.GetStructField.childSchema(complexTypeExtractors.scala:108)
```
This is caused by the inconsistent metadata column positions in the following two nodes:
* Table relation: at the ending position
* Project list: at the beginning position
<img width="1442" alt="image" src="https://user-images.githubusercontent.com/1097932/207992343-438714bc-e1d1-46f7-9a79-84ab83dd299f.png">

When the InlineCTE rule executes, the metadata column in the project is wrongly combined with the table output.
<img width="1438" alt="image" src="https://user-images.githubusercontent.com/1097932/207992431-f4cfc774-4cab-4728-b109-2ebff94e5fe2.png">

Thus the column `a ARRAY<STRING>` is casted as `s STRUCT<id: STRING>` and cause the error.

This PR is to fix the issue by putting the Metadata column at the end of project list, so that it is consistent with the table relation.
### Why are the changes needed?

Bug fix

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug in the analysis rule `AddMetadataColumns`

### How was this patch tested?

New test case

Closes #39081 from gengliangwang/fixMetadata.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit 172f719)
Signed-off-by: Max Gekk <[email protected]>
@cloud-fan
Copy link
Contributor

late LGTM

cloud-fan added a commit that referenced this pull request Dec 21, 2022
### What changes were proposed in this pull request?

Ideally it's OK to always propagate metadata columns, as column pruning will kick in later and prune them aways if they are not used. However, it may cause problems in cases like CTE. #39081 fixed such a bug.

This PR only propagates metadata columns if they are used, to keep the analyzed plan simple and reliable.

### Why are the changes needed?

avoid potential bugs.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #39152 from cloud-fan/follow.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
gengliangwang added a commit that referenced this pull request Jan 6, 2023
…tadataColumnSuite

### What changes were proposed in this pull request?

Move the new test case for Metadata column in #39081 to `MetadataColumnSuite`

### Why are the changes needed?

All metadata column related test cases should go into `MetadataColumnSuite`. For example:

- #37758
- #39152

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests

Closes #39425 from gengliangwang/moveTest.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants