Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 29, 2016

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both IsNull and IsNotNull.

This is apparently OK because both IsNull and IsNotNull do not take a type as an argument (Hive 1.2.x) during building filters (SearchArgument) in Spark-side but they do not filter correctly because stored statistics always produces null for not supported types (eg ArrayType) in ORC-side. So, it is always true for IsNull which ends up with always false for IsNotNull. (Please see RecordReaderImpl.java#L296-L318 and RecordReaderImpl.java#L359-L365 in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type (PredicateLeaf.Type) when building a filter (SearchArgument) but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for IsNull and IsNotNull on unsupported types. OrcFilters resembles ParquetFilters.

How was this patch tested?

Unittests in OrcQuerySuite and OrcFilterSuite and sbt scalastyle.

// Only the values in the Spark types below can be recognized by
// the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method.
case ByteType | ShortType | FloatType | DoubleType => true
case IntegerType | LongType | StringType => true
Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: this should be okay because CatalystTypeConverters.createToScalaConverter() is always called in DataSourceStrategy for the values in source.Filter. I checked all the cases and it seems there are no cases that they are converted to one of DateWritable | _: HiveDecimal | _: HiveChar | _: HiveVarchar.

As the all values in source.Filter are converted by DataType, this should be okay. I check the test codes and ParquetFilters is also doing this in this way as well.

+I had to do this because IsNull and IsNotNull do not have values in source.Filter so there is no way to check the types with the original isSearchableLiteral.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 29, 2016

@liancheng @yhuai Could you take a look please?

@HyukjinKwon HyukjinKwon changed the title [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types. [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC Apr 29, 2016
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 29, 2016

BTW, during doing this, I realised there is a unused classe and functions due to the change of HadoopFsRelation.

  • The class OrcTableScan is not used and It seems OrcRelation.buildReader is doing exactly the same thing.
  • The functions initializeLocalJobFunc, initializeDriverSideJobFunc and overrideMinSplitSize in ParquetRelation are not also being used but I guess overrideMinSplitSize() is not handled in ParquetRelation.buildReader.

Could I maybe test this like #8346 for overrideMinSplitSize and remove those? I will appreciate if you give me some feedback.

+I had a confirm this can be removed and this is unused from @liancheng
+Let me test if it is still happening first before creating a JIRA

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57329 has finished for PR 12777 at commit c959f7e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57330 has finished for PR 12777 at commit 0cbfbe4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2016

Test build #57333 has finished for PR 12777 at commit f51d04b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @yhuai Would you mind taking a look for this please?

@liancheng
Copy link
Contributor

LGTM, merging to master and branch-2.0. Thanks for fixing this!

And yes, OrcTableScan should be safe to remove now.

@asfgit asfgit closed this in fa928ff May 6, 2016
asfgit pushed a commit that referenced this pull request May 6, 2016
…ypes in ORC

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`.

This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318)  and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`.

## How was this patch tested?

Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`.

Author: hyukjinkwon <[email protected]>
Author: Hyukjin Kwon <[email protected]>

Closes #12777 from HyukjinKwon/SPARK-14962.

(cherry picked from commit fa928ff)
Signed-off-by: Cheng Lian <[email protected]>
def isSearchableType(dataType: DataType): Boolean = dataType match {
// Only the values in the Spark types below can be recognized by
// the `SearchArgumentImpl.BuilderImpl.boxLiteral()` method.
case ByteType | ShortType | FloatType | DoubleType => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about BooleanType ?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedyu Let me test and will make a follow-up or another PR.

@HyukjinKwon HyukjinKwon deleted the SPARK-14962 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants