-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29141][SQL][TEST] Use SqlBasedBenchmark in SQL benchmarks #25828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #110891 has finished for PR 25828 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much, @MaxGekk . This looks good.
As a verification, let me regenerate the result on EC2~
|
I updated partially. For the other benchmark test suites like |
|
For |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
This first commit of this PR already pass the Jenkins.
The last two commits are the test result.
|
Test build #110951 has finished for PR 25828 at commit
|
|
Test build #110947 has finished for PR 25828 at commit
|
| SQL Json 8908 9008 142 1.8 566.4 2.7X | ||
| SQL Parquet Vectorized 192 229 36 82.1 12.2 125.0X | ||
| SQL Parquet MR 2356 2363 10 6.7 149.8 10.2X | ||
| SQL ORC Vectorized 329 347 25 47.9 20.9 72.9X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC Vectorized is almost 2 times slower now. It would be interesting to find the root cause of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Of course!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the JIRA ticket for that: https://issues.apache.org/jira/browse/SPARK-29169
| Data column - Parquet MR 3378 3384 8 4.7 214.8 11.3X | ||
| Data column - ORC Vectorized 475 481 7 33.1 30.2 80.3X | ||
| Data column - ORC MR 2324 2356 46 6.8 147.7 16.4X | ||
| Partition column - CSV 14680 14742 88 1.1 933.3 2.6X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV and JSON below is 2 times slower now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-29170
| SQL CSV 14771 14817 65 0.1 14086.3 1.0X | ||
| SQL Json 29677 29787 157 0.0 28302.0 0.5X | ||
| SQL Parquet Vectorized 182 191 13 5.8 173.8 81.1X | ||
| SQL Parquet MR 1209 1213 5 0.9 1153.1 12.2X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than 4 times slower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-29171
|
Thank you for filing JIRAs. Please add the number directly into that JIRA, too. |
|
For a record, the results were generated based on this PR. So, Scala |
What changes were proposed in this pull request?
Refactored SQL-related benchmark and made them depend on
SqlBasedBenchmark. In particular, creation of Spark session are moved intooverride def getSparkSession: SparkSession.Why are the changes needed?
This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.
Does this PR introduce any user-facing change?
No
How was this patch tested?
By running the modified benchmarks.