[SPARK-24549][SQL] Support Decimal type push down to the parquet data sources #21556

wangyum · 2018-06-13T12:32:35Z

What changes were proposed in this pull request?

Support Decimal type push down to the parquet data sources.
The Decimal comparator used is: BINARY_AS_SIGNED_INTEGER_COMPARATOR.

How was this patch tested?

unit tests and manual tests.

manual tests:

spark.range(10000000).selectExpr("id", "cast(id as decimal(9)) as d1", "cast(id as decimal(9, 2)) as d2", "cast(id as decimal(18)) as d3", "cast(id as decimal(18, 4)) as d4", "cast(id as decimal(38)) as d5", "cast(id as decimal(38, 18)) as d6").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/decimal")
val df = spark.read.parquet("/tmp/spark/parquet/decimal/")
spark.sql("set spark.sql.parquet.filterPushdown.decimal=true")
// Only read about 1 MB data
df.filter("d2 = 10000").show
// Only read about 1 MB data
df.filter("d4 = 10000").show
spark.sql("set spark.sql.parquet.filterPushdown.decimal=false")
// Read 174.3 MB data
df.filter("d2 = 10000").show
// Read 174.3 MB data
df.filter("d4 = 10000").show

…ources

SparkQA · 2018-06-13T16:10:52Z

Test build #91769 has finished for PR 21556 at commit 9832661.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-14T01:46:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    case decimal: DecimalType if DecimalType.is32BitDecimalType(decimal) =>
+      (n: String, v: Any) => FilterApi.eq(
+        intColumn(n),
+        Option(v).map(_.asInstanceOf[java.math.BigDecimal].unscaledValue().intValue()


REF:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala

Line 219 in 21a7bfd

val unscaledLong = row.getDecimal(ordinal, precision, scale).toUnscaledLong

SparkQA · 2018-06-15T07:05:02Z

Test build #91881 has finished for PR 21556 at commit 51d8540.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-15T07:11:43Z

Jenkins, retest this please.

SparkQA · 2018-06-15T10:08:28Z

Test build #91894 has finished for PR 21556 at commit 51d8540.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-15T11:00:06Z

Jenkins, retest this please.

SparkQA · 2018-06-15T14:36:55Z

Test build #91909 has finished for PR 21556 at commit 51d8540.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-28T07:05:01Z

Test build #92413 has finished for PR 21556 at commit 0b5d0e7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-06-28T07:07:32Z

Jenkins, retest this please.

maropu · 2018-06-28T08:44:46Z

Can you benchmark code and results (on your env) in FilterPushdownBenchmark for this type?

SparkQA · 2018-06-28T10:50:32Z

Test build #92414 has finished for PR 21556 at commit 0b5d0e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-06-28T16:24:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .booleanConf
+      .createWithDefault(true)
+
+  val PARQUET_READ_LEGACY_FORMAT = buildConf("spark.sql.parquet.readLegacyFormat")


This property doesn't mention pushdown, but the description says it is only valid for push-down. Can you make the property name more clear?

rdblue · 2018-06-28T16:25:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().longValue()
+            .asInstanceOf[java.lang.Long]).orNull)
+    case decimal: DecimalType
+      if pushDownDecimal && ((DecimalType.is32BitDecimalType(decimal) && readLegacyFormat)


Please add comments here to explain what differs when readLegacyFormat is true.

rdblue · 2018-06-28T16:26:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+      if pushDownDecimal && (DecimalType.is32BitDecimalType(decimal) && !readLegacyFormat) =>
+      (n: String, v: Any) => FilterApi.eq(
+        intColumn(n),
+        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().intValue()


Does this need to validate the scale of the decimal, or is scale adjusted in the analyzer?

rdblue · 2018-06-28T16:31:25Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+  test("filter pushdown - decimal") {
+    Seq(true, false).foreach { legacyFormat =>
+      withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) {
+        Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType


Since this is providing a column name, it would be better to use something more readable than _1.

rdblue · 2018-06-28T16:36:28Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

+    }
+  }
+
+  test("incompatible parquet file format will throw exeception") {


If we can detect the case where the data is written with the legacy format, then why do we need a property to read with the legacy format? Why not do the right thing without a property?

Have create a PR: #21696
After this PR. Support decimal should be like this: https://github.com/wangyum/spark/blob/refactor-decimal-pushdown/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L118-L146

maropu · 2018-07-02T07:58:20Z

@wangyum Thanks for the benchmarks!
@dongjoon-hyun In the benchmarks above, the results of ORC except for the case decimal(9, 2) have worse performance values as compared to the Parquet ones. Is this expected?

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

wangyum · 2018-07-04T16:27:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

        intColumn(n),
        Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+
+    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>


DecimalType contains variable: decimalMetadata. It seems difficult to make a constants like before.

rdblue · 2018-07-04T17:49:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().longValue()
+          .asInstanceOf[java.lang.Long]).orNull)
+    // Legacy DecimalType
+    case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, decimal) if pushDownDecimal &&


The binary used for the legacy type and for fixed-length storage should be the same, so I don't understand why there are two different conversion methods. Also, because this is using the Parquet schema now, there's no need to base the length of this binary on what older versions of Spark did -- in other words, if the underlying Parquet type is fixed, then just convert the decimal to that size fixed without worrying about legacy types.

I think this should pass in the fixed array's length and convert the BigDecimal value to that length array for all cases. That works no matter what the file contains.

rdblue · 2018-07-04T17:51:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

        intColumn(n),
        Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+
+    case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>


Since this uses the file schema, I think it should validate that the file uses the same scale as the value passed in. That's a cheap sanity check to ensure correctness.

Seems invalidate value already filtered by:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

Line 439 in e76b012

protected[sql] def translateFilter(predicate: Expression): Option[Filter] = {

That doesn't validate the value against the decimal scale from the file, which is what I'm suggesting. The decimal scale must match exactly and this is a good place to check because this has the file information. If the scale doesn't match, then the schema used to read this file is incorrect, which would cause data corruption.

In my opinion, it is better to add a check if it is cheap instead of debating whether or not some other part of the code covers the case. If this were happening per record then I would opt for a different strategy, but because this is at the file level it is a good idea to add it here.

I see. I will do it.

Add check method to canMakeFilterOn and add a test case:

val decimal = new JBigDecimal(10).setScale(scale) assert(decimal.scale() === scale) assertResult(Some(lt(intColumn("cdecimal1"), 1000: Integer))) { parquetFilters.createFilter(parquetSchema, sources.LessThan("cdecimal1", decimal)) } val decimal1 = new JBigDecimal(10).setScale(scale + 1) assert(decimal1.scale() === scale + 1) assertResult(None) { parquetFilters.createFilter(parquetSchema, sources.LessThan("cdecimal1", decimal1)) }

SparkQA · 2018-07-04T20:13:22Z

Test build #92619 has finished for PR 21556 at commit f160648.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-07-05T08:48:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

  private case class ParquetSchemaType(
      originalType: OriginalType,
      primitiveTypeName: PrimitiveTypeName,
-      decimalMetadata: DecimalMetadata)


Don't need DecimalMetadata.

SparkQA · 2018-07-05T11:39:38Z

Test build #92641 has finished for PR 21556 at commit c7308ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-11T11:45:43Z

Test build #92843 has finished for PR 21556 at commit 16528f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-07-11T16:15:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+      case _ => false
+    }
+
+    // Since SPARK-24716, ParquetFilter accepts parquet file schema to convert to


Is this issue reference correct? The PR says this is for SPARK-24549.

rdblue · 2018-07-11T16:16:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

      (n: String, v: Any) =>
        FilterApi.gtEq(intColumn(n), dateToDays(v.asInstanceOf[Date]).asInstanceOf[Integer])
+
+    case ParquetSchemaType(DECIMAL, INT32, 0, _) if pushDownDecimal =>


Why match 0 instead of _?

In fact, the length is always 0, I replaced it to _.

rdblue · 2018-07-11T16:17:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        case ParquetStringType => value.isInstanceOf[String]
+        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
+        case ParquetDateType => value.isInstanceOf[Date]
+        case ParquetSchemaType(DECIMAL, INT32, 0, decimalMeta) =>


Can the decimal cases be collapsed to a single case on ParquetSchemaType(DECIMAL, _, _, decimalMetadata)?

Have you tried not using | and ignoring the physical type with _?

rdblue · 2018-07-11T16:17:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+        case ParquetDoubleType => value.isInstanceOf[JDouble]
+        case ParquetStringType => value.isInstanceOf[String]
+        case ParquetBinaryType => value.isInstanceOf[Array[Byte]]
+        case ParquetDateType => value.isInstanceOf[Date]


Why is there no support for timestamp?

Originally it is not supported. Do we need to support it?

Not in this PR that adds Decimal support. We should consider it in the future, though.

rdblue · 2018-07-11T16:21:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+      System.arraycopy(bytes, 0, decimalBuffer, numBytes - bytes.length, bytes.length)
+      decimalBuffer
+    }
+    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)


This byte array is not reused, it is allocated each time this function runs. This should use the fromConstantByteArray variant. That tells Parquet that it isn't necessary to make defensive copies of the bytes.

rdblue · 2018-07-11T16:36:39Z

sql/core/benchmarks/FilterPushdownBenchmark-results.txt

-Native ORC Vectorized                         3981 / 4049          4.0         253.1       1.0X
-Native ORC Vectorized (Pushdown)               702 /  735         22.4          44.6       5.4X
+Parquet Vectorized                            4407 / 4852          3.6         280.2       1.0X
+Parquet Vectorized (Pushdown)                 1602 / 1634          9.8         101.8       2.8X


Any thoughts on why this is slower than the other tests with decimal(18, 2) and decimal(38, 2)? This seems very strange to me.

Maybe it is that the data is more dense, so we need to read more values in the row group that contains the one we're looking for?

Because 1024 * 1024 * 15 is out of decimal(9, 2) range. so no stats for that column. I will update benchmark.

I'm not sure I understand. That's less than 2^24, so it should fit in an int. It should also fit in 8 base-ten digits so decimal(9,2) should work. And last, if the values don't fit in an int, I'm not sure how we would be able to store them in the first place, regardless of how stats are handled.

Did you verify that there are no stats for the file produced here? If that's the case, it would make sense with these numbers. I think we just need to look for a different reason why stats are missing.

Here is a test:

// decimal(9, 2) max values is 9999999.99 // 1024 * 1024 * 15 = 15728640 val path = "/tmp/spark/parquet" spark.range(1024 * 1024 * 15).selectExpr("cast((id) as decimal(9, 2)) as id").orderBy("id").write.mode("overwrite").parquet(path)

The generated parquet metadata:

$ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta /tmp/spark/parquet file: file:/tmp/spark/parquet/part-00000-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 O:DECIMAL R:0 D:1 row group 1: RC:5728640 TS:36 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:38/36/0.95 VC:5728640 ENC:PLAIN,BIT_PACKED,RLE ST:[no stats for this column] file: file:/tmp/spark/parquet/part-00001-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 O:DECIMAL R:0 D:1 row group 1: RC:651016 TS:2604209 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:2604325/2604209/1.00 VC:651016 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 0.00, max: 651015.00, num_nulls: 0] file: file:/tmp/spark/parquet/part-00002-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 O:DECIMAL R:0 D:1 row group 1: RC:3231146 TS:12925219 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:12925864/12925219/1.00 VC:3231146 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 651016.00, max: 3882161.00, num_nulls: 0] file: file:/tmp/spark/parquet/part-00003-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 O:DECIMAL R:0 D:1 row group 1: RC:2887956 TS:11552408 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:11552986/11552408/1.00 VC:2887956 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 3882162.00, max: 6770117.00, num_nulls: 0] file: file:/tmp/spark/parquet/part-00004-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"decimal(9,2)","nullable":true,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- id: OPTIONAL INT32 O:DECIMAL R:0 D:1 row group 1: RC:3229882 TS:12920163 OFFSET:4 -------------------------------------------------------------------------------- id: INT32 SNAPPY DO:0 FPO:4 SZ:12920808/12920163/1.00 VC:3229882 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 6770118.00, max: 9999999.00, num_nulls: 0]

As you can see file:/tmp/spark/parquet/part-00000-26b38556-494a-4b89-923e-69ea73365488-c000.snappy.parquet have not generated stats for that column.

scala> spark.read.parquet(path).filter("id is null").count res0: Long = 5728640

Okay, I see. The tenths and hundredths are always 0, which makes the precision-8 numbers actually precision-10. It is still odd that this is causing Parquet to have no stats, but I'm happy with the fix. Thanks for explaining.

rdblue · 2018-07-11T16:42:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+    Binary.fromReusedByteArray(fixedLengthBytes, 0, numBytes)
+  }
+
  private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = {


Since makeEq is called for EqualsNullSafe and valueCanMakeFilterOn allows null values through, I think these could be null, like the String case. I think this should use the Option pattern from String for all values, unless I'm missing some reason why these will never be null.

ParquetBooleanType, ParquetLongType, ParquetFloatType and ParquetDoubleType do not need Option. Here is a example:

scala> import org.apache.parquet.io.api.Binary import org.apache.parquet.io.api.Binary scala> Option(null).map(s => Binary.fromString(s.asInstanceOf[String])).orNull res7: org.apache.parquet.io.api.Binary = null scala> Binary.fromString(null.asInstanceOf[String]) java.lang.NullPointerException at org.apache.parquet.io.api.Binary$FromStringBinary.encodeUTF8(Binary.java:224) at org.apache.parquet.io.api.Binary$FromStringBinary.<init>(Binary.java:214) at org.apache.parquet.io.api.Binary.fromString(Binary.java:554) ... 52 elided scala> null.asInstanceOf[java.lang.Long] res9: Long = null scala> null.asInstanceOf[java.lang.Boolean] res10: Boolean = null scala> Option(null).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull res11: Integer = null scala> null.asInstanceOf[Number].intValue.asInstanceOf[Integer] java.lang.NullPointerException ... 52 elided

Sounds good. Thanks!

rdblue · 2018-07-11T16:43:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

        makeEq.lift(nameToType(name)).map(_(name, value))
-      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name) =>
+      case sources.Not(sources.EqualNullSafe(name, value)) if canMakeFilterOn(name, value) =>
        makeNotEq.lift(nameToType(name)).map(_(name, value))


Since makeNotEq is also used for EqualNullSafe, I think it should handle null values as well.

I handled null values at valueCanMakeFilterOn:

def valueCanMakeFilterOn(name: String, value: Any): Boolean = { value == null || (nameToType(name) match { case ParquetBooleanType => value.isInstanceOf[JBoolean] case ParquetByteType | ParquetShortType | ParquetIntegerType => value.isInstanceOf[Number] case ParquetLongType => value.isInstanceOf[JLong] case ParquetFloatType => value.isInstanceOf[JFloat] case ParquetDoubleType => value.isInstanceOf[JDouble] case ParquetStringType => value.isInstanceOf[String] case ParquetBinaryType => value.isInstanceOf[Array[Byte]] case ParquetDateType => value.isInstanceOf[Date] case ParquetSchemaType(DECIMAL, INT32, _, decimalMeta) => isDecimalMatched(value, decimalMeta) case ParquetSchemaType(DECIMAL, INT64, _, decimalMeta) => isDecimalMatched(value, decimalMeta) case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, _, decimalMeta) => isDecimalMatched(value, decimalMeta) case _ => false }) }

Maybe I'm missing something, but that returns true for all null values.

SparkQA · 2018-07-12T17:58:54Z

Test build #92936 has finished for PR 21556 at commit 33d1f18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-12T19:40:53Z

Test build #92938 has finished for PR 21556 at commit f73eab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-07-12T23:31:47Z

@wangyum, can you explain what was happening with the decimal(9,2) benchmark more clearly? I asked additional questions, but the thread is on a line that changed so it's collapsed by default.

Also, valueCanMakeFilterOn returns true for all null values, so I think we still have a problem there. Conversion from EqualNullSafe needs to support null filter values.

HyukjinKwon · 2018-07-13T03:49:11Z

@rdblue, so basically you mean it looks both equality comparison and nullsafe equality comparison are identically pushed down and looks it should be distinguished; otherwise, there could be a potential problem? If so, yup. I agree with it.

I think we won't have actually a chance to push down equality comparison or nullsafe equality comparison with actual null value by the optimizer. However, sure, I think we shouldn't relay on it. I think actually we should disallow one of both nullsafe equality comparison or equality comparison with null in ParquetFilters.

Thing is, I remember I checked the inside of Parquet's equality comparison API itself is actually nullsafe a long ago like few years ago - this of course should be double checked.

Since this PR doesn't change the existing behaviour on this and looks needing some more investigation (e.g., checking if it is still (or it has been) true what I remembered and checked about Parquet's equality comparison), probably, it might be okay to leave it as is here and proceed separately.

HyukjinKwon · 2018-07-13T04:11:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+      case _ => false
+    }
+
+    // Decimal type must make sure that filter value's scale matched the file.


Shall we leave this comment around the decimal cases below or around isDecimalMatched?

HyukjinKwon · 2018-07-13T04:22:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

+
+    // Decimal type must make sure that filter value's scale matched the file.
+    // If doesn't matched, which would cause data corruption.
+    // Other types must make sure that filter value's type matched the file.


I would say like .. Parquet's type in the given file should be matched to the value's type in the pushed filter in order to push down the filter to Parquet.

rdblue · 2018-07-13T19:17:48Z

@HyukjinKwon, even if the values are null, the makeEq function only casts null to Java Integer so the handling is still safe. It just looks odd that null.asInstanceOf[JInt] is safe. Thanks to @wangyum for explaining it. Even if the null-safe equality predicate contains a null value, this should be safe.

And, passing null in an equals predicate is supported by Parquet.

rdblue · 2018-07-13T19:19:09Z

+1, I think this looks ready to go.

wangyum · 2018-07-14T01:20:26Z

cc @gatorsmile @cloud-fan @gengliangwang @michal-databricks @mswit-databricks

SparkQA · 2018-07-14T05:07:55Z

Test build #92996 has finished for PR 21556 at commit e713698.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-14T10:01:56Z

@rdblue, ah, I misunderstood then. thanks for clarifying it.

rdblue · 2018-07-14T20:35:26Z

I misunderstood how it was safe as well. It was Yuming's clarification that helped.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

SparkQA · 2018-07-15T07:05:02Z

Test build #93014 has finished for PR 21556 at commit e31c201.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaSummarizerExample
class SerializableConfiguration(@transient var value: Configuration)
class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
case class SchemaType(dataType: DataType, nullable: Boolean)
implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T])
implicit class AvroDataFrameReader(reader: DataFrameReader)
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],
trait ComplexTypeMergingExpression extends Expression
case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes
abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast
case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike

wangyum · 2018-07-15T08:44:00Z

retest this please

SparkQA · 2018-07-15T12:34:41Z

Test build #93017 has finished for PR 21556 at commit e31c201.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaSummarizerExample
class SerializableConfiguration(@transient var value: Configuration)
class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
case class SchemaType(dataType: DataType, nullable: Boolean)
implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T])
implicit class AvroDataFrameReader(reader: DataFrameReader)
class KMeansModel (@Since(\"1.0.0\") val clusterCenters: Array[Vector],
trait ComplexTypeMergingExpression extends Expression
case class Size(child: Expression) extends UnaryExpression with ExpectsInputTypes
abstract class ArraySetLike extends BinaryArrayExpressionWithImplicitCast
case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike

cloud-fan · 2018-07-16T07:45:53Z

thanks, merging to master!

32BitDecimalType and 64BitDecimalType support push down to the data s…

9832661

…ources

wangyum commented Jun 14, 2018

View reviewed changes

Add PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED conf

51d8540

Fully support decimal push-down.

0b5d0e7

wangyum changed the title ~~[SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType support push down~~ [SPARK-24549][SQL] Support Decimal type push down to the parquet data sources Jun 28, 2018

rdblue reviewed Jun 28, 2018

View reviewed changes

wangyum added 2 commits July 4, 2018 23:17

Fix compile errors

f160648

wangyum commented Jul 4, 2018

View reviewed changes

rdblue reviewed Jul 4, 2018

View reviewed changes

Merge Legacy DecimalType and ByteArrayDecimalType

c7308ab

wangyum commented Jul 5, 2018

View reviewed changes

rdblue reviewed Jul 11, 2018

View reviewed changes

wangyum added 2 commits July 12, 2018 22:00

Minor update

33d1f18

Fix decimal(9, 2) benchmark out of range

f73eab2

HyukjinKwon reviewed Jul 13, 2018

View reviewed changes

Improvement comment

e713698

asfgit closed this in 9549a28 Jul 16, 2018

[SPARK-24549][SQL] Support Decimal type push down to the parquet data sources #21556

[SPARK-24549][SQL] Support Decimal type push down to the parquet data sources #21556

Uh oh!

Conversation

wangyum commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

wangyum commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

wangyum commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 28, 2018

Uh oh!

wangyum commented Jun 28, 2018

Uh oh!

maropu commented Jun 28, 2018

Uh oh!

SparkQA commented Jun 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2018

Uh oh!

SparkQA commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wangyum commented Jun 13, 2018 •

edited

Loading

rdblue Jul 11, 2018 •

edited

Loading

wangyum Jul 12, 2018 •

edited

Loading

wangyum Jul 13, 2018 •

edited

Loading