[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777

HyukjinKwon · 2016-04-29T11:50:13Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both IsNull and IsNotNull.

This is apparently OK because both IsNull and IsNotNull do not take a type as an argument (Hive 1.2.x) during building filters (SearchArgument) in Spark-side but they do not filter correctly because stored statistics always produces null for not supported types (eg ArrayType) in ORC-side. So, it is always true for IsNull which ends up with always false for IsNotNull. (Please see RecordReaderImpl.java#L296-L318 and RecordReaderImpl.java#L359-L365 in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type (PredicateLeaf.Type) when building a filter (SearchArgument) but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for IsNull and IsNotNull on unsupported types. OrcFilters resembles ParquetFilters.

How was this patch tested?

Unittests in OrcQuerySuite and OrcFilterSuite and sbt scalastyle.

HyukjinKwon · 2016-04-29T11:54:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala

+      // Only the values in the Spark types below can be recognized by
+      // the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method.
+      case ByteType | ShortType | FloatType | DoubleType => true
+      case IntegerType | LongType | StringType => true


Note to myself: this should be okay because CatalystTypeConverters.createToScalaConverter() is always called in DataSourceStrategy for the values in source.Filter. I checked all the cases and it seems there are no cases that they are converted to one of DateWritable | _: HiveDecimal | _: HiveChar | _: HiveVarchar.

As the all values in source.Filter are converted by DataType, this should be okay. I check the test codes and ParquetFilters is also doing this in this way as well.

+I had to do this because IsNull and IsNotNull do not have values in source.Filter so there is no way to check the types with the original isSearchableLiteral.

HyukjinKwon · 2016-04-29T11:54:47Z

@liancheng @yhuai Could you take a look please?

HyukjinKwon · 2016-04-29T12:24:59Z

BTW, during doing this, I realised there is a unused classe and functions due to the change of HadoopFsRelation.

The class OrcTableScan is not used and It seems OrcRelation.buildReader is doing exactly the same thing.
The functions initializeLocalJobFunc, initializeDriverSideJobFunc and overrideMinSplitSize in ParquetRelation are not also being used but I guess overrideMinSplitSize() is not handled in ParquetRelation.buildReader.

Could I maybe test this like #8346 for overrideMinSplitSize and remove those? I will appreciate if you give me some feedback.

+I had a confirm this can be removed and this is unused from @liancheng
+Let me test if it is still happening first before creating a JIRA

SparkQA · 2016-04-29T12:44:55Z

Test build #57329 has finished for PR 12777 at commit c959f7e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-29T13:03:41Z

Test build #57330 has finished for PR 12777 at commit 0cbfbe4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-29T14:47:05Z

Test build #57333 has finished for PR 12777 at commit f51d04b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-06T09:07:36Z

Hi @yhuai Would you mind taking a look for this please?

liancheng · 2016-05-06T17:45:06Z

LGTM, merging to master and branch-2.0. Thanks for fixing this!

And yes, OrcTableScan should be safe to remove now.

…ypes in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #12777 from HyukjinKwon/SPARK-14962. (cherry picked from commit fa928ff) Signed-off-by: Cheng Lian <[email protected]>

tedyu · 2016-05-06T21:33:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala

+    def isSearchableType(dataType: DataType): Boolean = dataType match {
+      // Only the values in the Spark types below can be recognized by
+      // the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method.
+      case ByteType | ShortType | FloatType | DoubleType => true


What about BooleanType ?

@tedyu Let me test and will make a follow-up or another PR.

Do not push down isnotnull/isnull on unsuportted types.

c959f7e

HyukjinKwon reviewed Apr 29, 2016
View reviewed changes

HyukjinKwon changed the title ~~[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types.~~ [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC Apr 29, 2016

Update test to check correctly and rename a test

0cbfbe4

Provide a correct schema for filter creation tests.

f51d04b

asfgit closed this in fa928ff May 6, 2016

tedyu reviewed May 6, 2016
View reviewed changes

HyukjinKwon deleted the SPARK-14962 branch January 2, 2018 03:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777

[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777

Uh oh!

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

Uh oh!

HyukjinKwon Apr 29, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

HyukjinKwon commented May 6, 2016

Uh oh!

liancheng commented May 6, 2016

Uh oh!

tedyu May 6, 2016

Uh oh!

HyukjinKwon May 7, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777

[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC #12777

Uh oh!

Conversation

HyukjinKwon commented Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

SparkQA commented Apr 29, 2016

Uh oh!

HyukjinKwon commented May 6, 2016

Uh oh!

liancheng commented May 6, 2016

Uh oh!

tedyu May 6, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

HyukjinKwon Apr 29, 2016 •

edited

Loading

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

HyukjinKwon commented Apr 29, 2016 •

edited

Loading

HyukjinKwon May 7, 2016 •

edited

Loading