[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks #25828

MaxGekk · 2019-09-18T08:03:08Z

What changes were proposed in this pull request?

Refactored SQL-related benchmark and made them depend on SqlBasedBenchmark. In particular, creation of Spark session are moved into override def getSparkSession: SparkSession.

Why are the changes needed?

This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the modified benchmarks.

SparkQA · 2019-09-18T12:58:30Z

Test build #110891 has finished for PR 25828 at commit 9a279a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you so much, @MaxGekk . This looks good.
As a verification, let me regenerate the result on EC2~

dongjoon-hyun · 2019-09-18T23:57:16Z

I updated partially. For the other benchmark test suites like FilterPushdownBenchmark, I'm still running.

dongjoon-hyun · 2019-09-19T00:42:36Z

For FilterPushdownBenchmark.scala, I'll create another PR. It seems that we had better reduce the number of min run.

dongjoon-hyun

+1, LGTM. Merged to master.
This first commit of this PR already pass the Jenkins.
The last two commits are the test result.

SparkQA · 2019-09-19T02:58:17Z

Test build #110951 has finished for PR 25828 at commit 786a59a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-19T04:55:27Z

Test build #110947 has finished for PR 25828 at commit 9c665a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-09-19T05:23:32Z

sql/core/benchmarks/DataSourceReadBenchmark-results.txt

+SQL Json                                           8908           9008         142          1.8         566.4       2.7X
+SQL Parquet Vectorized                              192            229          36         82.1          12.2     125.0X
+SQL Parquet MR                                     2356           2363          10          6.7         149.8      10.2X
+SQL ORC Vectorized                                  329            347          25         47.9          20.9      72.9X


ORC Vectorized is almost 2 times slower now. It would be interesting to find the root cause of this.

Yep. Of course!

Here is the JIRA ticket for that: https://issues.apache.org/jira/browse/SPARK-29169

MaxGekk · 2019-09-19T05:28:31Z

sql/core/benchmarks/DataSourceReadBenchmark-results.txt

+Data column - Parquet MR                           3378           3384           8          4.7         214.8      11.3X
+Data column - ORC Vectorized                        475            481           7         33.1          30.2      80.3X
+Data column - ORC MR                               2324           2356          46          6.8         147.7      16.4X
+Partition column - CSV                            14680          14742          88          1.1         933.3       2.6X


CSV and JSON below is 2 times slower now.

Here is the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-29170

MaxGekk · 2019-09-19T05:31:35Z

sql/core/benchmarks/DataSourceReadBenchmark-results.txt

+SQL CSV                                           14771          14817          65          0.1       14086.3       1.0X
+SQL Json                                          29677          29787         157          0.0       28302.0       0.5X
+SQL Parquet Vectorized                              182            191          13          5.8         173.8      81.1X
+SQL Parquet MR                                     1209           1213           5          0.9        1153.1      12.2X


More than 4 times slower

Here is the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-29171

dongjoon-hyun · 2019-09-19T06:29:34Z

Thank you for filing JIRAs. Please add the number directly into that JIRA, too.

dongjoon-hyun · 2019-09-19T06:34:54Z

For a record, the results were generated based on this PR. So, Scala 2.12.10 was not applied here.

MaxGekk added 8 commits September 18, 2019 12:10

Extend SqlBasedBenchmark by ExtractBenchmark

50fa5e1

Extend SqlBasedBenchmark by DataSourceReadBenchmark

3fce167

Extend SqlBasedBenchmark by FilterPushdownBenchmark

94dabfb

Extend SqlBasedBenchmark by PrimitiveArrayBenchmark

0e5f450

Remove SQLHelper from direct dependencies of AvroReadBenchmark

116b026

Extend SqlBasedBenchmark by ObjectHashAggregateExecBenchmark

0c01b04

Extend SqlBasedBenchmark by OrcReadBenchmark

9267efc

Rename spark -> sparkSession in DataSourceReadBenchmark

9a279a3

MaxGekk mentioned this pull request Sep 18, 2019

[SPARK-29065][SQL][TEST] Extend EXTRACT benchmark #25772

Closed

dongjoon-hyun added the SQL label Sep 18, 2019

dongjoon-hyun reviewed Sep 18, 2019

View reviewed changes

regen on EC2 (#20)

9c665a6

add more (#22)

786a59a

dongjoon-hyun approved these changes Sep 19, 2019

View reviewed changes

dongjoon-hyun closed this in a6a663c Sep 19, 2019

dongjoon-hyun deleted the sql-benchmarks-refactoring branch September 19, 2019 00:52

MaxGekk commented Sep 19, 2019

View reviewed changes

[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks #25828

[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks #25828

Uh oh!

Conversation

MaxGekk commented Sep 18, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 18, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 18, 2019

Uh oh!

dongjoon-hyun commented Sep 19, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 19, 2019

Uh oh!

dongjoon-hyun commented Sep 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MaxGekk Sep 19, 2019 •

edited

Loading