Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 30, 2018

What changes were proposed in this pull request?

ByteType and ShortType support pushdown to parquet data source.
Benchmark result.

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Jun 30, 2018

Test build #92503 has finished for PR 21682 at commit e9d5625.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jul 2, 2018

retest this please

@maropu
Copy link
Member

maropu commented Jul 2, 2018

@HyukjinKwon btw, why we don't support pushdown for these types? Any historical reason?

private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = {
case BooleanType =>
(n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean])
case ByteType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us update the comment in the line 34.

Note: For now timestamp SQL type is not supported for pushing down, because the corresponding
      Parquet type can be Long or INT96. There is no context about Parquet type here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably INT64 or INT96 (which is long vs binary) might be a bit better while we are here.

@gatorsmile
Copy link
Member

cc @gengliangwang

@gatorsmile
Copy link
Member

@maropu We accidentally dropped the pushdown of these data types when we refactored the file formats. You can check the change history and find the PR.

@SparkQA
Copy link

SparkQA commented Jul 2, 2018

Test build #92519 has finished for PR 21682 at commit e9d5625.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private val makeNotEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = {
case BooleanType =>
(n: String, v: Any) => FilterApi.notEq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean])
case ByteType =>
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for doubly sure, is it an official way to handle short and byte? Can you double check and leave the description about the API specification (or related code bit or reference) @wangyum? Parquet one is pretty core and we should better be clear on this (so that we could also probably judge if we need a configuration or not). @rdblue too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, both byte and short would be stored as integers in Parquet. Because Parquet uses bit packing, it doesn't matter if you store them as ints (or even longs) because they'll get packed into the same space.

The important thing is to match the Parquet file's type when pushing a filter. Since Spark stores ByteType and ShortType in Parquet as INT32, this is correct.

@maropu
Copy link
Member

maropu commented Jul 2, 2018

@gatorsmile aha, thanks.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 2, 2018

re: #21682 (comment) - I couldn't recall related things about it. Thanks @gatorsmile. So, is it a regression? shall we update the migration guide to note that the performance effect? - like if your workload are heavily dependent on short and byte in Parquet, avoid upgrading to xx version. This optimization was restored in yy+ version. Please consider using yy+ version.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 2, 2018

It is a regression that was introduced in Spark 1.2. Almost 4 years ago. https://issues.apache.org/jira/browse/SPARK-4453

Thus, I think we do not need to document it. The pushdown can be treated as new features. : )

@HyukjinKwon
Copy link
Member

Wow .. so it was 4 years ago .. okay.

case ByteType =>
(n: String, v: Any) => FilterApi.eq(
intColumn(n),
Option(v).map(b => b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use Byte instead of java.lang.Byte here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Also, there's no need to use Option.map because the value cannot be null. That's why the IntegerType case just casts the value.

Copy link
Member Author

@wangyum wangyum Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeEq.lift(nameToType(name)).map(_(name, null)) makes value to null.

scala> null.asInstanceOf[Short].toInt.asInstanceOf[Integer]
res49: Integer = 0

scala> null.asInstanceOf[java.lang.Short].toInt.asInstanceOf[Integer]
java.lang.NullPointerException
  at scala.Predef$.Short2short(Predef.scala:360)
  ... 51 elided

That's why I use Option.map here.

case ShortType =>
(n: String, v: Any) => FilterApi.eq(
intColumn(n),
Option(v).map(b => b.asInstanceOf[java.lang.Short].toInt.asInstanceOf[Integer]).orNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use Short instead of java.lang.Short here.

(n: String, v: Any) => FilterApi.eq(
intColumn(n),
Option(v).map(b => b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull)
case ShortType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually fold the cases into like:

case ByteType | ShortType | IntegerType =>
  ... b.asInstanceOf[Number].intValue()

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about like this:

  • makeEq and makeNotEq
    case ByteType | ShortType =>
      (n: String, v: Any) => FilterApi.notEq(
        intColumn(n),
        Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull)
    case IntegerType =>
      (n: String, v: Any) => FilterApi.notEq(intColumn(n), v.asInstanceOf[Integer])
  • makeLt, makeLtEq, makeGt and makeGtEq:
    case ByteType | ShortType =>
      (n: String, v: Any) => FilterApi.gtEq(
        intColumn(n),
        v.asInstanceOf[Number].intValue.asInstanceOf[Integer])
    case IntegerType =>
      (n: String, v: Any) => FilterApi.gtEq(intColumn(n), v.asInstanceOf[java.lang.Integer])

@rdblue
Copy link
Contributor

rdblue commented Jul 3, 2018

+1

I agree with some of the minor refactoring suggestions, but overall this looks correct to me.

case ShortType => -> case ByteType | ShortType =>
case ByteType | ShortType =>
(n: String, v: Any) => FilterApi.eq(
intColumn(n),
Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value may be null.

case ByteType | ShortType =>
(n: String, v: Any) => FilterApi.lt(
intColumn(n),
v.asInstanceOf[Number].intValue.asInstanceOf[Integer])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value cannot be null.

@SparkQA
Copy link

SparkQA commented Jul 4, 2018

Test build #92589 has finished for PR 21682 at commit b16a607.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

ping @wangyum

wangyum added 2 commits July 10, 2018 08:23
# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@SparkQA
Copy link

SparkQA commented Jul 10, 2018

Test build #92781 has finished for PR 21682 at commit 3e66530.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case ParquetByteType | ParquetShortType | ParquetIntegerType =>
(n: String, v: Any) => FilterApi.eq(
intColumn(n),
Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when v can be null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeEq and makeNotEq may be null:

case sources.IsNull(name) if canMakeFilterOn(name) =>
makeEq.lift(nameToType(name)).map(_(name, null))

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in a289009 Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants