Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR updates CSVBenchmark especially we have a fix like #31858 that could potentially improve the performance.

Why are the changes needed?

To have the updated benchmark results.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually ran the benchmark

@HyukjinKwon HyukjinKwon requested a review from MaxGekk March 22, 2021 02:37
@github-actions github-actions bot added the SQL label Mar 22, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update Java 11 result together?

$ ls CSV*
CSVBenchmark-jdk11-results.txt
CSVBenchmark-results.txt

@HyukjinKwon
Copy link
Member Author

Sure, running now 👍

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Test build #136322 has finished for PR 31917 at commit 750f92b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2021

@HyukjinKwon Could you update PR's description and point out the environment in which you run the benchmark, please.

@HyukjinKwon
Copy link
Member Author

I think the benchmark results include that.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2021

@HyukjinKwon The purpose is to give others enough info about the environment to get the same benchmark results. Do you really think that:

Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz

is enough? ok, how much memory should I have? 1MB RAM is enough?

@HyukjinKwon
Copy link
Member Author

@MaxGekk If we care about that, it would be great if we include that in benchmark results.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2021

@HyukjinKwon I care of reproducible benchmark results. Currently, you don't provide enough info to reproduce the same. I would prefer to follow scientific approach, and have a chance to verify your results if it is needed.

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 22, 2021

@MaxGekk, We should better have a way to do that, or at least document that we should do extra steps. All I read is:

* Benchmark to measure CSV read/write performance.
* To run this benchmark:
* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar>,
* <spark catalyst test jar> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
* 3. generate result:
* SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
* Results will be written to "benchmarks/CSVBenchmark-results.txt".
* }}}

If there are extra steps to do it, let's start another discussion and document it (FWIW I personally don't agree with having extra steps). It would be great if we have an automated script.

Until we have them, I don't think it's something required. I already see other envs were used in the past benchmark results.

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/

@HyukjinKwon
Copy link
Member Author

I had an offline discussion with @MaxGekk.

I'm thinking about setting a GitHub Actions workflow like "Running tests in your forked repository using GitHub Actions" https://spark.apache.org/developer-tools.html, and we run the benchmark always in GA machines.

I guess the machine specifications are still not guaranteed to be same but would expect less variance compared to non-pinned env, and should be very easy for other people to run (just go to your fork, run a benchmark by UI, and download the benchmark results). I will try to take a look probably this week.

Meanwhile, I think we can just unblock this PR and go ahead.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2021

+1, LGTM, Merging this to master.

@HyukjinKwon
Copy link
Member Author

Thank you @MaxGekk!

@MaxGekk MaxGekk closed this in ec70467 Mar 22, 2021
@HyukjinKwon
Copy link
Member Author

I filed a JIRA for that: SPARK-34821

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Test build #136331 has finished for PR 31917 at commit 3575e48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-34815 branch January 4, 2022 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants