[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption #22427

dongjoon-hyun · 2018-09-15T09:16:46Z

What changes were proposed in this pull request?

This PR aims to fix three things in FilterPushdownBenchmark.

1. Use the same memory assumption.
The following configurations are used in ORC and Parquet.

Memory buffer for writing
- parquet.block.size (default: 128MB)
- orc.stripe.size (default: 64MB)
Compression chunk size
- parquet.page.size (default: 1MB)
- orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of parquet.page.size, for parquet.block.size and orc.stripe.size. But, it missed to match orc.compress.size. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.

2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of prepareStringDictTable.

3. Generate test result on AWS r3.xlarge
SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.

How was this patch tested?

Manual. Enable the test cases and run FilterPushdownBenchmark on AWS r3.xlarge. It takes about 4 hours 15 minutes.

…memory assumption

dongjoon-hyun · 2018-09-15T09:17:52Z

Could you review this, @gatorsmile , @cloud-fan , @dbtsai , @HyukjinKwon , @maropu and @wangyum ?

SparkQA · 2018-09-15T13:07:27Z

Test build #96092 has finished for PR 22427 at commit fb14cd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-09-15T22:05:00Z

cc @rdblue

maropu · 2018-09-15T23:52:24Z

Just a question; I'm not familiar with both internal logics though, these parameters (Memory buffer for writing and Compression chunk size) are internally treated in the same manner? Also, they are performace-sensitive parameters?

dongjoon-hyun · 2018-09-16T00:36:04Z

Thank you for review, @maropu .

Yes. It's the same. The first one limits the memory usage for write operation. The second one limits the memory usage for compression operation.
Yes. As you see in this PR, it's performance sensitive. Actually, all parameters of Parquet/ORC are performance sensitive.

maropu · 2018-09-16T00:42:12Z

Thanks for the explanation! The change looks good to me.

dongjoon-hyun · 2018-09-16T00:46:05Z

Thank you, @maropu !

dongjoon-hyun · 2018-09-16T00:47:53Z

Merged to master/2.4.

…memory assumption ## What changes were proposed in this pull request? This PR aims to fix three things in `FilterPushdownBenchmark`. **1. Use the same memory assumption.** The following configurations are used in ORC and Parquet. - Memory buffer for writing - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) - Compression chunk size - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. **2. Dictionary encoding should not be enforced for all cases.** SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`. **3. Generate test result on AWS r3.xlarge** SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. ## How was this patch tested? Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes. Closes #22427 from dongjoon-hyun/SPARK-25438. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit fefaa3c) Signed-off-by: Dongjoon Hyun <[email protected]>

cloud-fan · 2018-09-16T14:43:54Z

sql/core/benchmarks/FilterPushdownBenchmark-results.txt

+Parquet Vectorized                          11499 / 11539          1.4         731.1       1.0X
+Parquet Vectorized (Pushdown)                  669 /  672         23.5          42.5      17.2X
+Native ORC Vectorized                         7343 / 7363          2.1         466.8       1.6X
+Native ORC Vectorized (Pushdown)              7559 / 7568          2.1         480.6       1.5X


Does orc support StringStartsWith pushdown?

It seems ORC doesn't support custom filter yet: #21623 (comment)

ORC doesn't support customer filter pushdown yet. It's expected and consistent from the previous result, @cloud-fan . :) Also, thank you for bringing the previous my comment, @wangyum .

HyukjinKwon

Sorry, I was a bit busy. late LGTM.

[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same …

fb14cd5

…memory assumption

asfgit closed this in fefaa3c Sep 16, 2018

dongjoon-hyun deleted the SPARK-25438 branch September 16, 2018 00:58

cloud-fan reviewed Sep 16, 2018

View reviewed changes

HyukjinKwon reviewed Sep 27, 2018

View reviewed changes

[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption #22427

[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption #22427

Uh oh!

Conversation

dongjoon-hyun commented Sep 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Sep 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 15, 2018

Uh oh!

gatorsmile commented Sep 15, 2018

Uh oh!

maropu commented Sep 15, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

maropu commented Sep 16, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

cloud-fan Sep 16, 2018

Choose a reason for hiding this comment

Uh oh!

wangyum Sep 16, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 16, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dongjoon-hyun commented Sep 15, 2018 •

edited

Loading

dongjoon-hyun commented Sep 15, 2018 •

edited

Loading