-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17310][SQL] Add an option to disable record-level filter in Parquet-side #15049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65218 has finished for PR 15049 at commit
|
|
Test build #65221 has finished for PR 15049 at commit
|
|
Test build #65222 has finished for PR 15049 at commit
|
|
Cc @yhuai too. I remember we had a talk about this in another (loosely) related PR before. |
|
Hi @davies What you do think about this? I can add some more benchmarks if you think I need more. |
|
Hi @liancheng , I would like to cc you here too if you don't mind. Could you please add some comments? |
|
gentle ping @liancheng and @davies |
|
ping @liancheng and @yhuai... |
|
gentle ping @davies @liancheng @yhuai |
|
ping.. |
f11bff4 to
6397cbd
Compare
|
Test build #70925 has finished for PR 15049 at commit
|
|
Hi all, could you please guide me a bit further about what I should do in this PR? |
6397cbd to
402a051
Compare
|
Test build #75360 has finished for PR 15049 at commit
|
402a051 to
9167eda
Compare
|
I simply changed the default to |
|
Test build #75878 has finished for PR 15049 at commit
|
|
Test build #75877 has finished for PR 15049 at commit
|
|
Test build #76810 has finished for PR 15049 at commit
|
|
gentle ping. I guess this would not harm as it is |
|
Test build #76811 has finished for PR 15049 at commit
|
|
cc @jiangxb1987 |
|
retest this please |
|
Test build #83161 has finished for PR 15049 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious that Orc filter pushdown also shows similar pattern, i.e., Spark side filtering is faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do look similar in block filtering. However, ORC's filter pushdown does not support filtering record by record but only skipping the blocks (stripe), up to my knowledge. I am aware of bloom filter in ORC too. My untested rough wild guess is, it is faster than Spark side filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something like: When spark.sql.parquet.filterPushdown is disabled, this config doesn't have any effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me add more details about it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this also disables parquet native row group filtering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I am sure it still enables group filtering. I will double check this and be back within few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangxb1987 If you do not like the solution here, could you submit the one you propose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I am active here. Could you share what's the problem here in this solution you'd imagine and discuss first?
I think there's no point of disabling row group filtering and, @jiangxb1987 asked if it actually disables row group filtering too, which might downgrade the performance. Current change does not do this.
I added a test for this concern - #15049 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to disable both, I think we can simply disable Parquet predicate pushdown BTW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine to make this change, I was thinking we could make this change by setting the value of ParquetInputFormat.RECORD_FILTERING_ENABLED to false. Both way works and I don't have strong preference. Sorry for the late response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I have been looking at the JIRA as well but I thought that only exists in 1.9.0 as specified in the JIRA, also given #14671 (comment). Looks that flag also exists in Parquet 1.8.2 as well.
Yup, I don't have strong preference too.
f9b454b to
86c863a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangxb1987, here I added a test to make sure disabling 'spark.sql.parquet.recordFilter' still enables row group level filtering.
|
Test build #83184 has finished for PR 15049 at commit
|
|
Test build #83186 has finished for PR 15049 at commit
|
|
cc @cloud-fan too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the config in beforeEach or beforeAll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me try.
cdbdbf7 to
ece76d0
Compare
|
LGTM |
|
Test build #83764 has finished for PR 15049 at commit
|
| val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordFilter") | ||
| .doc("Whether to allow the record-level filtering via Parquet API. When " + | ||
| "'spark.sql.parquet.filterPushdown' is disabled, this configuration does not " + | ||
| "have any effect.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If true, enable parquet native record level filtering using the pushed down filters. This conf only has an effect when both 'spark.sql.parquet.filterPushdown' and 'spark.sql.parquet.enableVectorizedReader' are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are currently doing only group level filtering if spark.sql.parquet.enableVectorizedReader is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and even if we enable spark.sql.parquet.enableVectorizedReader, we can still fallback if it contains non-AtomicType. Let me just take out ' and 'spark.sql.parquet.enableVectorizedReader' are enabled if you wouldn't mind.
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val PARQUET_RECORD_FILTER_ENABLED = buildConf("spark.sql.parquet.recordFilter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.parquet.recordLevelFilter.enabled
|
Test build #83815 has finished for PR 15049 at commit
|
| "filters. This configuration only has an effect when 'spark.sql.parquet.filterPushdown' " + | ||
| "is enabled.") | ||
| .booleanConf | ||
| .createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about make the default value false? Since we'll always do record filtering in Spark side. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that was the initial proposal but I switched this to true back (#15049 (comment)) long ago to make this PR sound safer and more compelling. I would like to change it to false if it's fine to you, @gatorsmile, @viirya and @cloud-fan too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it make a lot sense to default turn off parquet record level filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me change it to false for now. Thanks @jiangxb1987.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, for this default value, there was another small discussion here before - #14671 (comment) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the benchmark numbers, looks Spark-side filtering is always better. This default value should not change final results too. So a default value false should make sense.
|
LGTM |
|
Test build #83827 has finished for PR 15049 at commit
|
|
retest this please. |
|
Test build #83836 has finished for PR 15049 at commit
|
|
thanks, merging to master! |
|
Thanks @gatorsmile, @viirya, @jiangxb1987 and @cloud-fan, sincerely. It was alsmost 1.5 years PR! |
What changes were proposed in this pull request?
There is a concern that Spark-side codegen row-by-row filtering might be faster than Parquet's one in general due to type-boxing and additional fuction calls which Spark's one tries to avoid.
So, this PR adds an option to disable/enable record-by-record filtering in Parquet side.
It sets the default to
falseto take the advantage of the improvement.This was also discussed in #14671.
How was this patch tested?
Manually benchmarks were performed. I generated a billion (1,000,000,000) records and tested equality comparison concatenated with
OR. This filter combinations were made from 5 to 30.It seem indeed Spark-filtering is faster in the test case and the gap increased as the filter tree becomes larger.
The details are as below:
Code
Result