-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // Only the values in the Spark types below can be recognized by | ||
| // the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method. | ||
| case ByteType | ShortType | FloatType | DoubleType => true | ||
| case IntegerType | LongType | StringType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself: this should be okay because CatalystTypeConverters.createToScalaConverter() is always called in DataSourceStrategy for the values in source.Filter. I checked all the cases and it seems there are no cases that they are converted to one of DateWritable | _: HiveDecimal | _: HiveChar | _: HiveVarchar.
As the all values in source.Filter are converted by DataType, this should be okay. I check the test codes and ParquetFilters is also doing this in this way as well.
+I had to do this because IsNull and IsNotNull do not have values in source.Filter so there is no way to check the types with the original isSearchableLiteral.
|
@liancheng @yhuai Could you take a look please? |
|
BTW, during doing this, I realised there is a unused classe and functions due to the change of
Could I maybe test this like #8346 for +I had a confirm this can be removed and this is unused from @liancheng |
|
Test build #57329 has finished for PR 12777 at commit
|
|
Test build #57330 has finished for PR 12777 at commit
|
|
Test build #57333 has finished for PR 12777 at commit
|
|
Hi @yhuai Would you mind taking a look for this please? |
|
LGTM, merging to master and branch-2.0. Thanks for fixing this! And yes, |
…ypes in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #12777 from HyukjinKwon/SPARK-14962. (cherry picked from commit fa928ff) Signed-off-by: Cheng Lian <[email protected]>
| def isSearchableType(dataType: DataType): Boolean = dataType match { | ||
| // Only the values in the Spark types below can be recognized by | ||
| // the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method. | ||
| case ByteType | ShortType | FloatType | DoubleType => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about BooleanType ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedyu Let me test and will make a follow-up or another PR.
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14962
ORC filters were being pushed down for all types for both
IsNullandIsNotNull.This is apparently OK because both
IsNullandIsNotNulldo not take a type as an argument (Hive 1.2.x) during building filters (SearchArgument) in Spark-side but they do not filter correctly because stored statistics always producesnullfor not supported types (egArrayType) in ORC-side. So, it is alwaystrueforIsNullwhich ends up with alwaysfalseforIsNotNull. (Please see RecordReaderImpl.java#L296-L318 and RecordReaderImpl.java#L359-L365 in Hive 1.2)This looks prevented in Hive 1.3.x >= by forcing to give a type (
PredicateLeaf.Type) when building a filter (SearchArgument) but Hive 1.2.x seems not doing this.This PR prevents ORC filter creation for
IsNullandIsNotNullon unsupported types.OrcFiltersresemblesParquetFilters.How was this patch tested?
Unittests in
OrcQuerySuiteandOrcFilterSuiteandsbt scalastyle.