-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34815][SQL] Update CSVBenchmark #31917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update Java 11 result together?
$ ls CSV*
CSVBenchmark-jdk11-results.txt
CSVBenchmark-results.txt
|
Sure, running now 👍 |
77f753a to
750f92b
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136322 has finished for PR 31917 at commit
|
|
@HyukjinKwon Could you update PR's description and point out the environment in which you run the benchmark, please. |
|
I think the benchmark results include that. |
|
@HyukjinKwon The purpose is to give others enough info about the environment to get the same benchmark results. Do you really think that: is enough? ok, how much memory should I have? 1MB RAM is enough? |
|
@MaxGekk If we care about that, it would be great if we include that in benchmark results. |
|
@HyukjinKwon I care of reproducible benchmark results. Currently, you don't provide enough info to reproduce the same. I would prefer to follow scientific approach, and have a chance to verify your results if it is needed. |
|
Kubernetes integration test starting |
|
@MaxGekk, We should better have a way to do that, or at least document that we should do extra steps. All I read is: spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala Lines 30 to 40 in d65f534
If there are extra steps to do it, let's start another discussion and document it (FWIW I personally don't agree with having extra steps). It would be great if we have an automated script. Until we have them, I don't think it's something required. I already see other envs were used in the past benchmark results. |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
I had an offline discussion with @MaxGekk. I'm thinking about setting a GitHub Actions workflow like "Running tests in your forked repository using GitHub Actions" https://spark.apache.org/developer-tools.html, and we run the benchmark always in GA machines. I guess the machine specifications are still not guaranteed to be same but would expect less variance compared to non-pinned env, and should be very easy for other people to run (just go to your fork, run a benchmark by UI, and download the benchmark results). I will try to take a look probably this week. Meanwhile, I think we can just unblock this PR and go ahead. |
|
+1, LGTM, Merging this to master. |
|
Thank you @MaxGekk! |
|
I filed a JIRA for that: SPARK-34821 |
|
Test build #136331 has finished for PR 31917 at commit
|
What changes were proposed in this pull request?
This PR updates CSVBenchmark especially we have a fix like #31858 that could potentially improve the performance.
Why are the changes needed?
To have the updated benchmark results.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manually ran the benchmark