Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jan 6, 2019

What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
java.lang.IllegalArgumentException: middle does not exist. Available: last                                                                    
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)                                                         
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)                                                                              
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)                                                                     
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)                                
[info]   at scala.Option.map(Option.scala:163)                                                                                                         
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)                                           
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)                                                 
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)                                                                             

How was this patch tested?

Added tests.

@viirya
Copy link
Member Author

viirya commented Jan 6, 2019

cc @dbtsai @cloud-fan @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Jan 6, 2019

Test build #100820 has finished for PR 23474 at commit 4f5a91a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 6, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 6, 2019

Test build #100825 has finished for PR 23474 at commit 4f5a91a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 6, 2019

retest this please...

@SparkQA
Copy link

SparkQA commented Jan 6, 2019

Test build #100829 has finished for PR 23474 at commit 4f5a91a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @viirya !

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100861 has finished for PR 23474 at commit 6dbd753.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100864 has finished for PR 23474 at commit 6dbd753.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jan 7, 2019

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 7, 2019

Test build #100869 has finished for PR 23474 at commit 6dbd753.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Jan 7, 2019

When we read name.first and check if name is not null, we are marking name as contentAccessed = false to avoid reading the entire name column.

Now, the issue is when we read name.first and check if name.middle is not null, we are still marking name.middle as contentAccessed = false resulting java.lang.IllegalArgumentException: middle does not exist..

To fix the root cause, and avoid the misunderstanding of the meaning of contentAccessed, we might mark contentAccessed = true in the second case; thus, this change here will not be required.

What do you think?

Thanks.

@viirya
Copy link
Member Author

viirya commented Jan 8, 2019

I think It is harder to mark name.middle as contentAccessed = false. It is true we can check all field accesses and see if middle is not accessed by others. But I feel it is more difficult to do that and current fix is simpler.

@dbtsai
Copy link
Member

dbtsai commented Jan 8, 2019

If it's really difficult to mark name.middle as contentAccessed = true in this case (which I feel is a less hacky solution), can we reformat the code with the following with documentation?

      !rootFields.exists { root =>
        root.field.name == opt.field.name && {
          // If the merged field type of root and opt field is different from opt field type,
          // we will keep it.
          // For example, when root field type is `struct<name:struct<last:string>>`,
          // and opt field type is `struct<name:struct<middle:string>>`, the merged field type will be
          // `struct<name:struct<last:string,middle:string>>`. Since the merged one contains more
          // nested fields than opt field type, we have to keep it.
          val rootFieldType = StructType(Array(root.field))
          val optFieldType = StructType(Array(opt.field))
          val merged = optFieldType.merge(rootFieldType)
          merged.sameType(optFieldType)
        }

Add @hvanhovell @gatorsmile for more input.

Thanks!

@dongjoon-hyun
Copy link
Member

+1 for @dbtsai 's refactored code.


optRootFields.filter { opt =>
!rootFields.exists(_.field.name == opt.field.name)
val optFieldType = StructType(Array(opt.field))
Copy link
Member

@dbtsai dbtsai Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move optFieldType right after val rootFieldType = StructType(Array(root.field))? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean moving it inside the exists call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make it out of exist call so it can be reused, isn't? Moving it to after rootFieldType is for readability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very expensive, and we only need to compute it when root.field.name == opt.field.name. As a result, I feel moving it right after val rootFieldType will be more readable.

        root.field.name == opt.field.name && {
          val rootFieldType = StructType(Array(root.field))
          val optFieldType = StructType(Array(opt.field))
          val merged = optFieldType.merge(rootFieldType)
          merged.sameType(optFieldType)
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I see. Let me move it. Thanks.

@dbtsai
Copy link
Member

dbtsai commented Jan 10, 2019

LGTM. Just one minor comment. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 10, 2019

Test build #101024 has finished for PR 23474 at commit ff1cc85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2019

Test build #101075 has finished for PR 23474 at commit ff2fa67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 11, 2019
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes #23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
@dbtsai
Copy link
Member

dbtsai commented Jan 11, 2019

LGTM. Merged into master and 2.4 branch. Thanks!

@asfgit asfgit closed this in 50ebf3a Jan 11, 2019
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes apache#23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes apache#23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…x field and having is not null predicate on another one

## What changes were proposed in this pull request?

Schema pruning has errors when selecting one complex field and having is not null predicate on another one:

```scala
val query = sql("select * from contacts")
  .where("name.middle is not null")
  .select(
    "id",
    "name.first",
    "name.middle",
    "name.last"
  )
  .where("last = 'Jones'")
  .select(count("id"))
```

```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info]   at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info]   at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info]   at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info]   at scala.Option.map(Option.scala:163)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info]   at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```

## How was this patch tested?

Added tests.

Closes apache#23474 from viirya/SPARK-26551.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
@viirya viirya deleted the SPARK-26551 branch December 27, 2023 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants