-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption #22427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…memory assumption
|
Could you review this, @gatorsmile , @cloud-fan , @dbtsai , @HyukjinKwon , @maropu and @wangyum ? |
|
Test build #96092 has finished for PR 22427 at commit
|
|
cc @rdblue |
|
Just a question; I'm not familiar with both internal logics though, these parameters ( |
|
Thank you for review, @maropu .
|
|
Thanks for the explanation! The change looks good to me. |
|
Thank you, @maropu ! |
|
Merged to master/2.4. |
…memory assumption ## What changes were proposed in this pull request? This PR aims to fix three things in `FilterPushdownBenchmark`. **1. Use the same memory assumption.** The following configurations are used in ORC and Parquet. - Memory buffer for writing - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) - Compression chunk size - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. **2. Dictionary encoding should not be enforced for all cases.** SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`. **3. Generate test result on AWS r3.xlarge** SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. ## How was this patch tested? Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes. Closes #22427 from dongjoon-hyun/SPARK-25438. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit fefaa3c) Signed-off-by: Dongjoon Hyun <[email protected]>
| Parquet Vectorized 11499 / 11539 1.4 731.1 1.0X | ||
| Parquet Vectorized (Pushdown) 669 / 672 23.5 42.5 17.2X | ||
| Native ORC Vectorized 7343 / 7363 2.1 466.8 1.6X | ||
| Native ORC Vectorized (Pushdown) 7559 / 7568 2.1 480.6 1.5X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does orc support StringStartsWith pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems ORC doesn't support custom filter yet: #21623 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC doesn't support customer filter pushdown yet. It's expected and consistent from the previous result, @cloud-fan . :) Also, thank you for bringing the previous my comment, @wangyum .
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was a bit busy. late LGTM.
What changes were proposed in this pull request?
This PR aims to fix three things in
FilterPushdownBenchmark.1. Use the same memory assumption.
The following configurations are used in ORC and Parquet.
Memory buffer for writing
Compression chunk size
SPARK-24692 used 1MB, the default value of
parquet.page.size, forparquet.block.sizeandorc.stripe.size. But, it missed to matchorc.compress.size. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of
prepareStringDictTable.3. Generate test result on AWS r3.xlarge
SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.
How was this patch tested?
Manual. Enable the test cases and run
FilterPushdownBenchmarkonAWS r3.xlarge. It takes about 4 hours 15 minutes.