-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26551][SQL] Fix schema pruning error when selecting one complex field and having is not null predicate on another one #23474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…is not null predicate on another one.
|
Test build #100820 has finished for PR 23474 at commit
|
|
retest this please. |
|
Test build #100825 has finished for PR 23474 at commit
|
|
retest this please... |
|
Test build #100829 has finished for PR 23474 at commit
|
|
Thank you for pinging me, @viirya ! |
...src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
Outdated
Show resolved
Hide resolved
|
Test build #100861 has finished for PR 23474 at commit
|
|
retest this please |
|
Test build #100864 has finished for PR 23474 at commit
|
|
retest this please. |
|
Test build #100869 has finished for PR 23474 at commit
|
|
When we read Now, the issue is when we read To fix the root cause, and avoid the misunderstanding of the meaning of What do you think? Thanks. |
|
I think It is harder to mark |
|
If it's really difficult to mark !rootFields.exists { root =>
root.field.name == opt.field.name && {
// If the merged field type of root and opt field is different from opt field type,
// we will keep it.
// For example, when root field type is `struct<name:struct<last:string>>`,
// and opt field type is `struct<name:struct<middle:string>>`, the merged field type will be
// `struct<name:struct<last:string,middle:string>>`. Since the merged one contains more
// nested fields than opt field type, we have to keep it.
val rootFieldType = StructType(Array(root.field))
val optFieldType = StructType(Array(opt.field))
val merged = optFieldType.merge(rootFieldType)
merged.sameType(optFieldType)
}Add @hvanhovell @gatorsmile for more input. Thanks! |
|
+1 for @dbtsai 's refactored code. |
|
|
||
| optRootFields.filter { opt => | ||
| !rootFields.exists(_.field.name == opt.field.name) | ||
| val optFieldType = StructType(Array(opt.field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move optFieldType right after val rootFieldType = StructType(Array(root.field))? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean moving it inside the exists call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it out of exist call so it can be reused, isn't? Moving it to after rootFieldType is for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not very expensive, and we only need to compute it when root.field.name == opt.field.name. As a result, I feel moving it right after val rootFieldType will be more readable.
root.field.name == opt.field.name && {
val rootFieldType = StructType(Array(root.field))
val optFieldType = StructType(Array(opt.field))
val merged = optFieldType.merge(rootFieldType)
merged.sameType(optFieldType)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see. Let me move it. Thanks.
|
LGTM. Just one minor comment. Thanks! |
|
Test build #101024 has finished for PR 23474 at commit
|
|
Test build #101075 has finished for PR 23474 at commit
|
…x field and having is not null predicate on another one
## What changes were proposed in this pull request?
Schema pruning has errors when selecting one complex field and having is not null predicate on another one:
```scala
val query = sql("select * from contacts")
.where("name.middle is not null")
.select(
"id",
"name.first",
"name.middle",
"name.last"
)
.where("last = 'Jones'")
.select(count("id"))
```
```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info] at scala.Option.map(Option.scala:163)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```
## How was this patch tested?
Added tests.
Closes #23474 from viirya/SPARK-26551.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
|
LGTM. Merged into master and 2.4 branch. Thanks! |
…x field and having is not null predicate on another one
## What changes were proposed in this pull request?
Schema pruning has errors when selecting one complex field and having is not null predicate on another one:
```scala
val query = sql("select * from contacts")
.where("name.middle is not null")
.select(
"id",
"name.first",
"name.middle",
"name.last"
)
.where("last = 'Jones'")
.select(count("id"))
```
```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info] at scala.Option.map(Option.scala:163)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```
## How was this patch tested?
Added tests.
Closes apache#23474 from viirya/SPARK-26551.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
…x field and having is not null predicate on another one
## What changes were proposed in this pull request?
Schema pruning has errors when selecting one complex field and having is not null predicate on another one:
```scala
val query = sql("select * from contacts")
.where("name.middle is not null")
.select(
"id",
"name.first",
"name.middle",
"name.last"
)
.where("last = 'Jones'")
.select(count("id"))
```
```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info] at scala.Option.map(Option.scala:163)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```
## How was this patch tested?
Added tests.
Closes apache#23474 from viirya/SPARK-26551.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
…x field and having is not null predicate on another one
## What changes were proposed in this pull request?
Schema pruning has errors when selecting one complex field and having is not null predicate on another one:
```scala
val query = sql("select * from contacts")
.where("name.middle is not null")
.select(
"id",
"name.first",
"name.middle",
"name.last"
)
.where("last = 'Jones'")
.select(count("id"))
```
```
java.lang.IllegalArgumentException: middle does not exist. Available: last
[info] at org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:303)
[info] at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:119)
[info] at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:302)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.$anonfun$getProjection$6(ProjectionOverSchema.scala:58)
[info] at scala.Option.map(Option.scala:163)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:56)
[info] at org.apache.spark.sql.execution.ProjectionOverSchema.unapply(ProjectionOverSchema.scala:32)
[info] at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning$$anonfun$$nestedInanonfun$buildNewProjection$1$1.applyOrElse(Parque
tSchemaPruning.scala:153)
```
## How was this patch tested?
Added tests.
Closes apache#23474 from viirya/SPARK-26551.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50ebf3a)
Signed-off-by: DB Tsai <[email protected]>
What changes were proposed in this pull request?
Schema pruning has errors when selecting one complex field and having is not null predicate on another one:
How was this patch tested?
Added tests.