Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 18, 2019

What changes were proposed in this pull request?

Refactored SQL-related benchmark and made them depend on SqlBasedBenchmark. In particular, creation of Spark session are moved into override def getSparkSession: SparkSession.

Why are the changes needed?

This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the modified benchmarks.

@SparkQA
Copy link

SparkQA commented Sep 18, 2019

Test build #110891 has finished for PR 25828 at commit 9a279a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @MaxGekk . This looks good.
As a verification, let me regenerate the result on EC2~

@dongjoon-hyun
Copy link
Member

I updated partially. For the other benchmark test suites like FilterPushdownBenchmark, I'm still running.

@dongjoon-hyun
Copy link
Member

For FilterPushdownBenchmark.scala, I'll create another PR. It seems that we had better reduce the number of min run.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
This first commit of this PR already pass the Jenkins.
The last two commits are the test result.

@dongjoon-hyun dongjoon-hyun deleted the sql-benchmarks-refactoring branch September 19, 2019 00:52
@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110951 has finished for PR 25828 at commit 786a59a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110947 has finished for PR 25828 at commit 9c665a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SQL Json 8908 9008 142 1.8 566.4 2.7X
SQL Parquet Vectorized 192 229 36 82.1 12.2 125.0X
SQL Parquet MR 2356 2363 10 6.7 149.8 10.2X
SQL ORC Vectorized 329 347 25 47.9 20.9 72.9X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC Vectorized is almost 2 times slower now. It would be interesting to find the root cause of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Of course!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the JIRA ticket for that: https://issues.apache.org/jira/browse/SPARK-29169

Data column - Parquet MR 3378 3384 8 4.7 214.8 11.3X
Data column - ORC Vectorized 475 481 7 33.1 30.2 80.3X
Data column - ORC MR 2324 2356 46 6.8 147.7 16.4X
Partition column - CSV 14680 14742 88 1.1 933.3 2.6X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV and JSON below is 2 times slower now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL CSV 14771 14817 65 0.1 14086.3 1.0X
SQL Json 29677 29787 157 0.0 28302.0 0.5X
SQL Parquet Vectorized 182 191 13 5.8 173.8 81.1X
SQL Parquet MR 1209 1213 5 0.9 1153.1 12.2X
Copy link
Member Author

@MaxGekk MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than 4 times slower

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Copy link
Member

Thank you for filing JIRAs. Please add the number directly into that JIRA, too.

@dongjoon-hyun
Copy link
Member

For a record, the results were generated based on this PR. So, Scala 2.12.10 was not applied here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants