[SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one #23474

viirya · 2019-01-06T04:31:30Z

What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))

java.lang.IllegalArgumentException: middle does not exist. Available: last                                                                    
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)                                                         
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)                                                                              
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)                                                                     
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)                                
[info]   at scala.Option.map(Option.scala:163)                                                                                                         
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)                                           
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)                                                 
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)

How was this patch tested?

Added tests.

…is not null predicate on another one.

viirya · 2019-01-06T04:31:59Z

cc @dbtsai @cloud-fan @dongjoon-hyun

SparkQA · 2019-01-06T08:05:02Z

Test build #100820 has finished for PR 23474 at commit 4f5a91a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-06T08:08:58Z

retest this please.

SparkQA · 2019-01-06T09:51:51Z

Test build #100825 has finished for PR 23474 at commit 4f5a91a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-06T10:22:50Z

retest this please...

SparkQA · 2019-01-06T14:17:00Z

Test build #100829 has finished for PR 23474 at commit 4f5a91a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-06T21:07:34Z

Thank you for pinging me, @viirya !

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala

SparkQA · 2019-01-07T06:09:44Z

Test build #100861 has finished for PR 23474 at commit 6dbd753.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-07T06:41:04Z

retest this please

SparkQA · 2019-01-07T08:05:01Z

Test build #100864 has finished for PR 23474 at commit 6dbd753.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-01-07T08:26:35Z

retest this please.

SparkQA · 2019-01-07T12:24:41Z

Test build #100869 has finished for PR 23474 at commit 6dbd753.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2019-01-07T20:07:20Z

When we read name.first and check if name is not null, we are marking name as contentAccessed = false to avoid reading the entire name column.

Now, the issue is when we read name.first and check if name.middle is not null, we are still marking name.middle as contentAccessed = false resulting java.lang.IllegalArgumentException: middle does not exist..

To fix the root cause, and avoid the misunderstanding of the meaning of contentAccessed, we might mark contentAccessed = true in the second case; thus, this change here will not be required.

What do you think?

Thanks.

viirya · 2019-01-08T03:37:14Z

I think It is harder to mark name.middle as contentAccessed = false. It is true we can check all field accesses and see if middle is not accessed by others. But I feel it is more difficult to do that and current fix is simpler.

dbtsai · 2019-01-08T19:44:48Z

If it's really difficult to mark name.middle as contentAccessed = true in this case (which I feel is a less hacky solution), can we reformat the code with the following with documentation?

      !rootFields.exists { root =>
        root.field.name == opt.field.name && {
          // If the merged field type of root and opt field is different from opt field type,
          // we will keep it.
          // For example, when root field type is `struct<name:struct<last:string>>`,
          // and opt field type is `struct<name:struct<middle:string>>`, the merged field type will be
          // `struct<name:struct<last:string,middle:string>>`. Since the merged one contains more
          // nested fields than opt field type, we have to keep it.
          val rootFieldType = StructType(Array(root.field))
          val optFieldType = StructType(Array(opt.field))
          val merged = optFieldType.merge(rootFieldType)
          merged.sameType(optFieldType)
        }

Add @hvanhovell @gatorsmile for more input.

Thanks!

dongjoon-hyun · 2019-01-08T22:11:14Z

+1 for @dbtsai 's refactored code.

dbtsai · 2019-01-10T18:24:27Z

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala


    optRootFields.filter { opt =>
-      !rootFields.exists(_.field.name == opt.field.name)
+      val optFieldType = StructType(Array(opt.field))


Can we move optFieldType right after val rootFieldType = StructType(Array(root.field))? Thanks!

Do you mean moving it inside the exists call?

I make it out of exist call so it can be reused, isn't? Moving it to after rootFieldType is for readability?

It's not very expensive, and we only need to compute it when root.field.name == opt.field.name. As a result, I feel moving it right after val rootFieldType will be more readable.

root.field.name == opt.field.name && { val rootFieldType = StructType(Array(root.field)) val optFieldType = StructType(Array(opt.field)) val merged = optFieldType.merge(rootFieldType) merged.sameType(optFieldType) }

Ok. I see. Let me move it. Thanks.

dbtsai · 2019-01-10T18:26:14Z

LGTM. Just one minor comment. Thanks!

SparkQA · 2019-01-10T20:18:53Z

Test build #101024 has finished for PR 23474 at commit ff1cc85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-11T12:32:59Z

Test build #101075 has finished for PR 23474 at commit ff2fa67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…x field and having is not null predicate on another one ## What changes were proposed in this pull request? Schema pruning has errors when selecting one complex field and having is not null predicate on another one: ```scala val query = sql("select * from contacts") .where("name.middle is not null") .select( "id", "name.first", "name.middle", "name.last" ) .where("last = 'Jones'") .select(count("id")) ``` ``` java.lang.IllegalArgumentException: middle does not exist. Available: last [info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303) [info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119) [info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58) [info] at scala.Option.map(Option.scala:163) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque tSchemaPruning.scala:153) ``` ## How was this patch tested? Added tests. Closes #23474 from viirya/SPARK-26551. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit 50ebf3a) Signed-off-by: DB Tsai <[email protected]>

dbtsai · 2019-01-11T19:24:32Z

LGTM. Merged into master and 2.4 branch. Thanks!

…x field and having is not null predicate on another one ## What changes were proposed in this pull request? Schema pruning has errors when selecting one complex field and having is not null predicate on another one: ```scala val query = sql("select * from contacts") .where("name.middle is not null") .select( "id", "name.first", "name.middle", "name.last" ) .where("last = 'Jones'") .select(count("id")) ``` ``` java.lang.IllegalArgumentException: middle does not exist. Available: last [info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303) [info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119) [info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58) [info] at scala.Option.map(Option.scala:163) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque tSchemaPruning.scala:153) ``` ## How was this patch tested? Added tests. Closes apache#23474 from viirya/SPARK-26551. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: DB Tsai <[email protected]>

…x field and having is not null predicate on another one ## What changes were proposed in this pull request? Schema pruning has errors when selecting one complex field and having is not null predicate on another one: ```scala val query = sql("select * from contacts") .where("name.middle is not null") .select( "id", "name.first", "name.middle", "name.last" ) .where("last = 'Jones'") .select(count("id")) ``` ``` java.lang.IllegalArgumentException: middle does not exist. Available: last [info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303) [info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119) [info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58) [info] at scala.Option.map(Option.scala:163) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56) [info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque tSchemaPruning.scala:153) ``` ## How was this patch tested? Added tests. Closes apache#23474 from viirya/SPARK-26551. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit 50ebf3a) Signed-off-by: DB Tsai <[email protected]>

Fix schema pruning error when selecting one complex field and having …

4f5a91a

…is not null predicate on another one.

HyukjinKwon reviewed Jan 7, 2019

View reviewed changes

...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala Outdated Show resolved Hide resolved

Address comment.

6dbd753

Address comment.

ff1cc85

dbtsai reviewed Jan 10, 2019

View reviewed changes

Move optFieldType.

ff2fa67

asfgit closed this in 50ebf3a Jan 11, 2019

viirya deleted the SPARK-26551 branch December 27, 2023 18:36

[SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one #23474

[SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one #23474

Uh oh!

Conversation

viirya commented Jan 6, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Jan 6, 2019

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

viirya commented Jan 6, 2019

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

viirya commented Jan 6, 2019

Uh oh!

SparkQA commented Jan 6, 2019

Uh oh!

dongjoon-hyun commented Jan 6, 2019

Uh oh!

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

HyukjinKwon commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

viirya commented Jan 7, 2019

Uh oh!

SparkQA commented Jan 7, 2019

Uh oh!

dbtsai commented Jan 7, 2019

Uh oh!

viirya commented Jan 8, 2019

Uh oh!

dbtsai commented Jan 8, 2019

Uh oh!

dongjoon-hyun commented Jan 8, 2019

Uh oh!

dbtsai Jan 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jan 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Jan 11, 2019

Choose a reason for hiding this comment

Uh oh!

dbtsai Jan 11, 2019

Choose a reason for hiding this comment

Uh oh!

viirya Jan 11, 2019

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Jan 10, 2019

Uh oh!

SparkQA commented Jan 10, 2019

Uh oh!

SparkQA commented Jan 11, 2019

Uh oh!

dbtsai commented Jan 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dbtsai Jan 10, 2019 •

edited

Loading