Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 15, 2018

What changes were proposed in this pull request?

This PR aims to fix three things in FilterPushdownBenchmark.

1. Use the same memory assumption.
The following configurations are used in ORC and Parquet.

  • Memory buffer for writing

    • parquet.block.size (default: 128MB)
    • orc.stripe.size (default: 64MB)
  • Compression chunk size

    • parquet.page.size (default: 1MB)
    • orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of parquet.page.size, for parquet.block.size and orc.stripe.size. But, it missed to match orc.compress.size. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.

2. Dictionary encoding should not be enforced for all cases.
SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of prepareStringDictTable.

3. Generate test result on AWS r3.xlarge
SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.

How was this patch tested?

Manual. Enable the test cases and run FilterPushdownBenchmark on AWS r3.xlarge. It takes about 4 hours 15 minutes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 15, 2018

Could you review this, @gatorsmile , @cloud-fan , @dbtsai , @HyukjinKwon , @maropu and @wangyum ?

@SparkQA
Copy link

SparkQA commented Sep 15, 2018

Test build #96092 has finished for PR 22427 at commit fb14cd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @rdblue

@maropu
Copy link
Member

maropu commented Sep 15, 2018

Just a question; I'm not familiar with both internal logics though, these parameters (Memory buffer for writing and Compression chunk size) are internally treated in the same manner? Also, they are performace-sensitive parameters?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @maropu .

  1. Yes. It's the same. The first one limits the memory usage for write operation. The second one limits the memory usage for compression operation.
  2. Yes. As you see in this PR, it's performance sensitive. Actually, all parameters of Parquet/ORC are performance sensitive.

@maropu
Copy link
Member

maropu commented Sep 16, 2018

Thanks for the explanation! The change looks good to me.

@dongjoon-hyun
Copy link
Member Author

Thank you, @maropu !

@dongjoon-hyun
Copy link
Member Author

Merged to master/2.4.

asfgit pushed a commit that referenced this pull request Sep 16, 2018
…memory assumption

## What changes were proposed in this pull request?

This PR aims to fix three things in `FilterPushdownBenchmark`.

**1. Use the same memory assumption.**
The following configurations are used in ORC and Parquet.

- Memory buffer for writing
  - parquet.block.size (default: 128MB)
  - orc.stripe.size (default: 64MB)

- Compression chunk size
  - parquet.page.size (default: 1MB)
  - orc.compress.size (default: 256KB)

SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent.

**2. Dictionary encoding should not be enforced for all cases.**
SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`.

**3. Generate test result on AWS r3.xlarge**
SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used.

## How was this patch tested?

Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes.

Closes #22427 from dongjoon-hyun/SPARK-25438.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fefaa3c)
Signed-off-by: Dongjoon Hyun <[email protected]>
@asfgit asfgit closed this in fefaa3c Sep 16, 2018
@dongjoon-hyun dongjoon-hyun deleted the SPARK-25438 branch September 16, 2018 00:58
Parquet Vectorized 11499 / 11539 1.4 731.1 1.0X
Parquet Vectorized (Pushdown) 669 / 672 23.5 42.5 17.2X
Native ORC Vectorized 7343 / 7363 2.1 466.8 1.6X
Native ORC Vectorized (Pushdown) 7559 / 7568 2.1 480.6 1.5X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does orc support StringStartsWith pushdown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ORC doesn't support custom filter yet: #21623 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC doesn't support customer filter pushdown yet. It's expected and consistent from the previous result, @cloud-fan . :) Also, thank you for bringing the previous my comment, @wangyum .

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was a bit busy. late LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants