[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side #15049

HyukjinKwon · 2016-09-11T04:56:26Z

What changes were proposed in this pull request?

There is a concern that Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and additional fuction calls which Spark's one tries to avoid.

So, this PR adds an option to disable/enable record-by-record filtering in Parquet side.

It sets the default to false to take the advantage of the improvement.

This was also discussed in #14671.

How was this patch tested?

Manually benchmarks were performed. I generated a billion (1,000,000,000) records and tested equality comparison concatenated with OR. This filter combinations were made from 5 to 30.

It seem indeed Spark-filtering is faster in the test case and the gap increased as the filter tree becomes larger.

The details are as below:

Code

test("Parquet-side filter vs Spark-side filter - record by record") {
  withTempPath { path =>
    val N = 1000 * 1000 * 1000
    val df = spark.range(N).toDF("a")
    df.write.parquet(path.getAbsolutePath)

    val benchmark = new Benchmark("Parquet-side vs Spark-side", N)
    Seq(5, 10, 20, 30).foreach { num =>
      val filterExpr = (0 to num).map(i => s"a = $i").mkString(" OR ")

      benchmark.addCase(s"Parquet-side filter - number of filters [$num]", 3) { _ =>
        withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString,
          SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> true.toString) {

          // We should strip Spark-side filter to compare correctly.
          stripSparkFilter(
            spark.read.parquet(path.getAbsolutePath).filter(filterExpr)).count()
        }
      }

      benchmark.addCase(s"Spark-side filter - number of filters [$num]", 3) { _ =>
        withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> false.toString,
          SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> false.toString) {

          spark.read.parquet(path.getAbsolutePath).filter(filterExpr).count()
        }
      }
    }

    benchmark.run()
  }
}

Result

Parquet-side vs Spark-side:              Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet-side filter - number of filters [5]      4268 / 4367        234.3           4.3       0.8X
Spark-side filter - number of filters [5]      3709 / 3741        269.6           3.7       0.9X
Parquet-side filter - number of filters [10]      5673 / 5727        176.3           5.7       0.6X
Spark-side filter - number of filters [10]      3588 / 3632        278.7           3.6       0.9X
Parquet-side filter - number of filters [20]      8024 / 8440        124.6           8.0       0.4X
Spark-side filter - number of filters [20]      3912 / 3946        255.6           3.9       0.8X
Parquet-side filter - number of filters [30]    11936 / 12041         83.8          11.9       0.3X
Spark-side filter - number of filters [30]      3929 / 3978        254.5           3.9       0.8X

HyukjinKwon · 2016-09-11T04:59:56Z

cc @davies @andreweduffy @rdblue

SparkQA · 2016-09-11T06:27:14Z

Test build #65218 has finished for PR 15049 at commit 7b2e27e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-11T06:49:12Z

Test build #65221 has finished for PR 15049 at commit 0f615d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-11T08:51:29Z

Test build #65222 has finished for PR 15049 at commit f11bff4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-11T10:00:02Z

Cc @yhuai too. I remember we had a talk about this in another (loosely) related PR before.

HyukjinKwon · 2016-09-16T04:04:34Z

ping @davies and @yhuai

HyukjinKwon · 2016-09-22T10:43:49Z

Hi @davies What you do think about this? I can add some more benchmarks if you think I need more.

HyukjinKwon · 2016-09-27T16:57:18Z

Hi @liancheng , I would like to cc you here too if you don't mind. Could you please add some comments?

HyukjinKwon · 2016-10-08T08:28:48Z

gentle ping @liancheng and @davies

HyukjinKwon · 2016-10-15T13:28:53Z

ping @liancheng and @yhuai...

HyukjinKwon · 2016-10-22T14:18:12Z

gentle ping @davies @liancheng @yhuai

HyukjinKwon · 2016-11-04T08:52:36Z

ping..

SparkQA · 2017-01-05T15:01:33Z

Test build #70925 has finished for PR 15049 at commit 6397cbd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-06T00:38:58Z

Hi all, could you please guide me a bit further about what I should do in this PR?

SparkQA · 2017-03-29T17:13:12Z

Test build #75360 has finished for PR 15049 at commit 402a051.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-04-18T01:47:38Z

I simply changed the default to true to prevent behaviour change for now.

SparkQA · 2017-04-18T03:56:30Z

Test build #75878 has finished for PR 15049 at commit e034695.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-18T03:57:54Z

Test build #75877 has finished for PR 15049 at commit 9167eda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-11T13:29:39Z

Test build #76810 has finished for PR 15049 at commit 747aa54.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-05-11T13:37:20Z

gentle ping. I guess this would not harm as it is true by default. Could anyone review this?

SparkQA · 2017-05-11T15:46:45Z

Test build #76811 has finished for PR 15049 at commit f9b454b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-28T00:10:08Z

cc @jiangxb1987

HyukjinKwon · 2017-10-28T09:31:41Z

retest this please

SparkQA · 2017-10-28T12:19:46Z

Test build #83161 has finished for PR 15049 at commit f9b454b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-10-28T13:34:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

I'm curious that Orc filter pushdown also shows similar pattern, i.e., Spark side filtering is faster?

They do look similar in block filtering. However, ORC's filter pushdown does not support filtering record by record but only skipping the blocks (stripe), up to my knowledge. I am aware of bloom filter in ORC too. My untested rough wild guess is, it is faster than Spark side filtering.

viirya · 2017-10-28T13:35:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Add something like: When spark.sql.parquet.filterPushdown is disabled, this config doesn't have any effect?

Sure, let me add more details about it soon.

jiangxb1987 · 2017-10-28T14:32:17Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

Will this also disables parquet native row group filtering?

Nope, I am sure it still enables group filtering. I will double check this and be back within few days.

@jiangxb1987 If you do not like the solution here, could you submit the one you propose?

Hm, I am active here. Could you share what's the problem here in this solution you'd imagine and discuss first?

I think there's no point of disabling row group filtering and, @jiangxb1987 asked if it actually disables row group filtering too, which might downgrade the performance. Current change does not do this.

I added a test for this concern - #15049 (comment).

If we want to disable both, I think we can simply disable Parquet predicate pushdown BTW.

It should be fine to make this change, I was thinking we could make this change by setting the value of ParquetInputFormat.RECORD_FILTERING_ENABLED to false. Both way works and I don't have strong preference. Sorry for the late response.

Oh, I have been looking at the JIRA as well but I thought that only exists in 1.9.0 as specified in the JIRA, also given #14671 (comment). Looks that flag also exists in Parquet 1.8.2 as well.

Yup, I don't have strong preference too.

HyukjinKwon · 2017-10-29T07:52:54Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

@jiangxb1987, here I added a test to make sure disabling 'spark.sql.parquet.recordFilter' still enables row group level filtering.

SparkQA · 2017-10-29T10:48:56Z

Test build #83184 has finished for PR 15049 at commit 86c863a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-29T11:51:12Z

Test build #83186 has finished for PR 15049 at commit cdbdbf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-12T02:51:53Z

cc @cloud-fan too.

cloud-fan · 2017-11-12T23:32:44Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

change the config in beforeEach or beforeAll?

Sure, let me try.

viirya · 2017-11-13T04:15:29Z

LGTM

SparkQA · 2017-11-13T06:34:53Z

Test build #83764 has finished for PR 15049 at commit ece76d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-13T22:33:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordFilter")
+    .doc("Whether to allow the record-level filtering via Parquet API. When " +
+      "'spark.sql.parquet.filterPushdown' is disabled, this configuration does not " +
+      "have any effect.")


If true, enable parquet native record level filtering using the pushed down filters. This conf only has an effect when both 'spark.sql.parquet.filterPushdown' and 'spark.sql.parquet.enableVectorizedReader' are enabled.

I think we are currently doing only group level filtering if spark.sql.parquet.enableVectorizedReader is enabled.

and even if we enable spark.sql.parquet.enableVectorizedReader, we can still fallback if it contains non-AtomicType. Let me just take out ' and 'spark.sql.parquet.enableVectorizedReader' are enabled if you wouldn't mind.

gatorsmile · 2017-11-13T22:35:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(false)

+  val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordFilter")


spark.sql.parquet.recordLevelFilter.enabled

SparkQA · 2017-11-14T02:28:09Z

Test build #83815 has finished for PR 15049 at commit 0b195eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-11-14T05:55:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " +
+      "is enabled.")
+    .booleanConf
+    .createWithDefault(true)


How about make the default value false? Since we'll always do record filtering in Spark side. WDYT?

Yup, that was the initial proposal but I switched this to true back (#15049 (comment)) long ago to make this PR sound safer and more compelling. I would like to change it to false if it's fine to you, @gatorsmile, @viirya and @cloud-fan too.

I think it make a lot sense to default turn off parquet record level filtering.

Let me change it to false for now. Thanks @jiangxb1987.

BTW, for this default value, there was another small discussion here before - #14671 (comment) too.

From the benchmark numbers, looks Spark-side filtering is always better. This default value should not change final results too. So a default value false should make sense.

cloud-fan · 2017-11-14T07:33:16Z

LGTM

SparkQA · 2017-11-14T08:05:01Z

Test build #83827 has finished for PR 15049 at commit 8b3f4c7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-14T08:08:55Z

retest this please.

SparkQA · 2017-11-14T11:01:19Z

Test build #83836 has finished for PR 15049 at commit 8b3f4c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-14T11:34:50Z

thanks, merging to master!

HyukjinKwon · 2017-11-14T11:38:07Z

Thanks @gatorsmile, @viirya, @jiangxb1987 and @cloud-fan, sincerely. It was alsmost 1.5 years PR!

HyukjinKwon changed the title ~~[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side~~ [SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side and disable it by default Sep 11, 2016

HyukjinKwon force-pushed the SPARK-17310 branch from f11bff4 to 6397cbd Compare January 5, 2017 12:49

HyukjinKwon force-pushed the SPARK-17310 branch from 6397cbd to 402a051 Compare March 29, 2017 14:54

HyukjinKwon force-pushed the SPARK-17310 branch from 402a051 to 9167eda Compare April 18, 2017 01:38

HyukjinKwon changed the title ~~[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side and disable it by default~~ [SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side May 11, 2017

HyukjinKwon force-pushed the SPARK-17310 branch from e034695 to 747aa54 Compare May 11, 2017 13:25

viirya reviewed Oct 28, 2017

View reviewed changes

jiangxb1987 reviewed Oct 28, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-17310 branch from f9b454b to 86c863a Compare October 29, 2017 07:52

HyukjinKwon commented Oct 29, 2017

View reviewed changes

cloud-fan reviewed Nov 12, 2017

View reviewed changes

Add an option to disable record-level filter in Parquet-side

ece76d0

HyukjinKwon force-pushed the SPARK-17310 branch from cdbdbf7 to ece76d0 Compare November 13, 2017 03:38

gatorsmile reviewed Nov 13, 2017

View reviewed changes

Address comments

0b195eb

jiangxb1987 reviewed Nov 14, 2017

View reviewed changes

Switches it back to false

8b3f4c7

asfgit closed this in 673c670 Nov 14, 2017

HyukjinKwon mentioned this pull request Nov 14, 2017

[SPARK-17091] Add rule to convert IN predicate to equivalent Parquet filter. #18424

Closed

HyukjinKwon deleted the SPARK-17310 branch January 2, 2018 03:37

[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side #15049

[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side #15049

Uh oh!

Conversation

HyukjinKwon commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

SparkQA commented Sep 11, 2016

Uh oh!

HyukjinKwon commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 16, 2016

Uh oh!

HyukjinKwon commented Sep 22, 2016

Uh oh!

HyukjinKwon commented Sep 27, 2016

Uh oh!

HyukjinKwon commented Oct 8, 2016

Uh oh!

HyukjinKwon commented Oct 15, 2016

Uh oh!

HyukjinKwon commented Oct 22, 2016

Uh oh!

HyukjinKwon commented Nov 4, 2016

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

HyukjinKwon commented Jan 6, 2017

Uh oh!

SparkQA commented Mar 29, 2017

Uh oh!

HyukjinKwon commented Apr 18, 2017

Uh oh!

SparkQA commented Apr 18, 2017

Uh oh!

SparkQA commented Apr 18, 2017

Uh oh!

SparkQA commented May 11, 2017

Uh oh!

HyukjinKwon commented May 11, 2017

Uh oh!

SparkQA commented May 11, 2017

Uh oh!

gatorsmile commented Oct 28, 2017

Uh oh!

HyukjinKwon commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 11, 2016 •

edited

Loading

HyukjinKwon commented Sep 11, 2016 •

edited

Loading

HyukjinKwon Oct 28, 2017 •

edited

Loading

HyukjinKwon Nov 13, 2017 •

edited

Loading

HyukjinKwon Nov 14, 2017 •

edited

Loading