-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code #21288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
223bf20 to
8f60902
Compare
|
Test build #90440 has finished for PR 21288 at commit
|
|
Test build #90441 has finished for PR 21288 at commit
|
|
retest this please |
|
Test build #90454 has finished for PR 21288 at commit
|
| conf.set("spark.sql.parquet.compression.codec", "snappy") | ||
| .setMaster("local[1]") | ||
| .setAppName("FilterPushdownBenchmark") | ||
| .set("spark.driver.memory", "3g") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these and master - change to setIfMissing()? I think it's great if these can be set via config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, ok. Looks good to me.
I just added this along with other benchmark code, e.g., TPCDSQueryBenchmark.
If no problem, I'll fix the other places in follow-up.
|
Test build #90571 has finished for PR 21288 at commit
|
| Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X | ||
| Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X | ||
| Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X | ||
| Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @maropu . Thank you for updating this with new Parquet 1.10. BTW, could you elaborate the EC2 description more clearly in the PR description? I want to reproduce this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I used m4.2xlarge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| val conf = new SparkConf() | ||
| conf.set("orc.compression", "snappy") | ||
| conf.set("spark.sql.parquet.compression.codec", "snappy") | ||
| .setMaster("local[1]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do .setIfMissing("spark.master", "local[1]")
that way perhaps we could get this to run on different backends too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #90878 has finished for PR 21288 at commit
|
|
retest this please |
|
Test build #90883 has finished for PR 21288 at commit
|
|
retest this please |
| } | ||
| } | ||
|
|
||
| // Pushdown for few distinct value case (use dictionary encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ORC, the ORC has the conf called orc.dictionary.key.threshold. Do we need to set the conf here? cc @dongjoon-hyun
DICTIONARY_KEY_SIZE_THRESHOLD("orc.dictionary.key.threshold",
"hive.exec.orc.dictionary.key.size.threshold",
0.8,
"If the number of distinct keys in a dictionary is greater than this\n" +
"fraction of the total number of non-null rows, turn off \n" +
"dictionary encoding. Use 1 to always use dictionary encoding.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, in Apache Spark project, we are testing with only default configurations. snappy will be the only exception because it's Spark's default compression and it's easy to get an idea in Parquet/ORC comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current data fits the threshold. I am just afraid the comment might be invalid if the underlying files are not using dictionary encoding. Even if we do not change the format, we still need to update the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it'd be better to set 1.0 at the option for safety, too.
But, currently we don't have the way to pass the option into the Orc output writer? @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us add a comment and also change the conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #90904 has finished for PR 21288 at commit
|
b7859ed to
2c0d5cb
Compare
|
Test build #91210 has finished for PR 21288 at commit
|
|
Test build #91211 has finished for PR 21288 at commit
|
|
retest this please |
|
Test build #91219 has finished for PR 21288 at commit
|
2c0d5cb to
d41e689
Compare
|
Test build #91228 has finished for PR 21288 at commit
|
| Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X | ||
| Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X | ||
| Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X | ||
| Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is huge. What happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I thinks so. But, not sure. I tried to run multiple times on the same env. (m4.2xlarge) though, I didn't get the old performance values... I'll check again later (I would be great if somebody double-checks on the same env.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tried it yet, but is it related to the recent change we made in the parquet reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be, but I feel the change was too big... I probably think that I had some mistakes in the last benchmark runs (I've not found why yet though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 2.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have time today, so I'll check v2.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result in v2.3.1: https://gist.github.com/maropu/88627246b7143ede5ab73c7183ab2128
That is not a regression, but I probably run the bench in wrong branch or commit.
I re-ran the bench in the current master and updated the pr.
how-to-run: I created a new m4.2xlarge instance, fetched this pr, rebased to master, and run the bench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating, @maropu .
|
Test build #91795 has finished for PR 21288 at commit
|
|
Test build #91815 has finished for PR 21288 at commit
|
|
@maropu Could you fix the style? BTW, based on the latest result, Parquet is generally faster than ORC. cc @dongjoon-hyun @rdblue |
|
ok |
fa53156 to
d3dd504
Compare
|
@gatorsmile and @maropu . I really appreciate this effort. Thanks. Since this is a cloud benchmark, I have one thing to recommend. Can we use There are three reasons.
The following is the result on As you see, the result is more consistent with the previous one and is different from this PR. Actually, I was reluctant to say this, but we had better have a standard way to generate a benchmark result on the cloud. If possible, I'd like to use |
|
One more thing; I prefer Macbook performance tests because the cost of EC2 is always a barrier to developers. |
|
Test build #91821 has finished for PR 21288 at commit
|
|
yea, I also agree with the opinion; we'd be better to run benchmarks on the same machine.
yea, I had no installation.
I think so, but is it some difficult to get consistent results on developer's different laptop envs? |
|
retest this please |
|
Test build #91857 has finished for PR 21288 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @maropu . I'm looking forward to seeing the new result from you.
|
I noticed why the big performance value changes happened in #21288 (comment); that's because the commit wrongly set I'll fix the bug and update the results in following prs. Sorry, all (I'm running all the benchmarks now). |
|
@dongjoon-hyun I got the same result in case of the same condition (enough memory), but, if The parquet has smaller memory footprint? I'm currently look into this (I updated the result in case of the enough memory). |
|
I've check the metrics and I found that GC happend in case of |
|
Test build #91914 has finished for PR 21288 at commit
|
|
Yep. Thank you for progressing this, @maropu ! |
|
retest this please |
|
Test build #91946 has finished for PR 21288 at commit
|
| val conf = new SparkConf() | ||
| .setAppName("FilterPushdownBenchmark") | ||
| // Since `spark.master` always exists, overrides this value | ||
| .set("spark.master", "local[1]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update m4.2xlarge in the PR description and add spark.master at line 34, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current pr, we cannot use spark.master in command line options. You suggest we drop .set("spark.master", "local[1]") and we always set spark.master in options for this benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I updated the description. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is adding --master local[1] at line 34, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that other developers might misunderstand how-to-use this?
spark-submit --master local[1] --class <this class> <spark sql test jar>
spark-submit --master local[*] --class <this class> <spark sql test jar>
In both case, the benchmark always uses local[1]. Or, you suggest the other point of view?
|
LGTM Thanks! Merged to master. |
|
Thanks for the check! btw, |
|
Sure |
What changes were proposed in this pull request?
This pr added benchmark code
FilterPushdownBenchmarkfor string pushdown and updated performance results on the AWSr3.xlarge.How was this patch tested?
N/A