[SPARK-34815][SQL] Update CSVBenchmark #31917

HyukjinKwon · 2021-03-22T02:37:35Z

What changes were proposed in this pull request?

This PR updates CSVBenchmark especially we have a fix like #31858 that could potentially improve the performance.

Why are the changes needed?

To have the updated benchmark results.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually ran the benchmark

dongjoon-hyun

Could you update Java 11 result together?

$ ls CSV*
CSVBenchmark-jdk11-results.txt
CSVBenchmark-results.txt

HyukjinKwon · 2021-03-22T03:44:33Z

Sure, running now 👍

SparkQA · 2021-03-22T04:25:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/

SparkQA · 2021-03-22T04:31:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/

SparkQA · 2021-03-22T05:03:28Z

Test build #136322 has finished for PR 31917 at commit 750f92b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/benchmarks/CSVBenchmark-results.txt

MaxGekk · 2021-03-22T05:49:50Z

@HyukjinKwon Could you update PR's description and point out the environment in which you run the benchmark, please.

HyukjinKwon · 2021-03-22T05:54:01Z

I think the benchmark results include that.

MaxGekk · 2021-03-22T05:59:32Z

@HyukjinKwon The purpose is to give others enough info about the environment to get the same benchmark results. Do you really think that:

Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz

is enough? ok, how much memory should I have? 1MB RAM is enough?

HyukjinKwon · 2021-03-22T06:03:44Z

@MaxGekk If we care about that, it would be great if we include that in benchmark results.

MaxGekk · 2021-03-22T06:12:00Z

@HyukjinKwon I care of reproducible benchmark results. Currently, you don't provide enough info to reproduce the same. I would prefer to follow scientific approach, and have a chance to verify your results if it is needed.

SparkQA · 2021-03-22T06:12:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/

HyukjinKwon · 2021-03-22T06:16:13Z

@MaxGekk, We should better have a way to do that, or at least document that we should do extra steps. All I read is:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala

Lines 30 to 40 in d65f534

    
            * Benchmark to measure CSV read/write performance. 
        
            * To run this benchmark: 
        
            * {{{ 
        
            *   1. without sbt: 
        
            *      bin/spark-submit --class <this class> --jars <spark core test jar>, 
        
            *       <spark catalyst test jar> <spark sql test jar> 
        
            *   2. build/sbt "sql/test:runMain <this class>" 
        
            *   3. generate result: 
        
            *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" 
        
            *      Results will be written to "benchmarks/CSVBenchmark-results.txt". 
        
            * }}}

If there are extra steps to do it, let's start another discussion and document it (FWIW I personally don't agree with having extra steps). It would be great if we have an automated script.

Until we have them, I don't think it's something required. I already see other envs were used in the past benchmark results.

SparkQA · 2021-03-22T06:20:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/

SparkQA · 2021-03-22T06:37:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/

SparkQA · 2021-03-22T06:50:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/

SparkQA · 2021-03-22T07:05:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/

SparkQA · 2021-03-22T07:15:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/

HyukjinKwon · 2021-03-22T07:42:03Z

I had an offline discussion with @MaxGekk.

I'm thinking about setting a GitHub Actions workflow like "Running tests in your forked repository using GitHub Actions" https://spark.apache.org/developer-tools.html, and we run the benchmark always in GA machines.

I guess the machine specifications are still not guaranteed to be same but would expect less variance compared to non-pinned env, and should be very easy for other people to run (just go to your fork, run a benchmark by UI, and download the benchmark results). I will try to take a look probably this week.

Meanwhile, I think we can just unblock this PR and go ahead.

MaxGekk · 2021-03-22T07:48:53Z

+1, LGTM, Merging this to master.

HyukjinKwon · 2021-03-22T07:49:46Z

Thank you @MaxGekk!

HyukjinKwon · 2021-03-22T07:53:21Z

I filed a JIRA for that: SPARK-34821

wangyum · 2021-03-22T09:46:46Z

@HyukjinKwon @MaxGekk We can use Hosting your own runners. This is an example:
https://github.com/wangyum/spark/blob/test-ci/.github/workflows/benchmark.yml#L11
https://github.com/wangyum/spark/runs/2164700670

SparkQA · 2021-03-22T09:49:55Z

Test build #136331 has finished for PR 31917 at commit 3575e48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Update CSVBenchmark

750f92b

HyukjinKwon requested a review from MaxGekk March 22, 2021 02:37

github-actions bot added the SQL label Mar 22, 2021

wangyum approved these changes Mar 22, 2021

View reviewed changes

dongjoon-hyun reviewed Mar 22, 2021

View reviewed changes

HyukjinKwon force-pushed the SPARK-34815 branch from 77f753a to 750f92b Compare March 22, 2021 04:25

MaxGekk reviewed Mar 22, 2021

View reviewed changes

sql/core/benchmarks/CSVBenchmark-results.txt Show resolved Hide resolved

Update JDK 11 results

3575e48

MaxGekk approved these changes Mar 22, 2021

View reviewed changes

MaxGekk closed this in ec70467 Mar 22, 2021

HyukjinKwon deleted the SPARK-34815 branch January 4, 2022 00:54

[SPARK-34815][SQL] Update CSVBenchmark #31917

[SPARK-34815][SQL] Update CSVBenchmark #31917

Uh oh!

Conversation

HyukjinKwon commented Mar 22, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

Uh oh!

MaxGekk commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

MaxGekk commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

MaxGekk commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

MaxGekk commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

HyukjinKwon commented Mar 22, 2021

Uh oh!

wangyum commented Mar 22, 2021

Uh oh!

SparkQA commented Mar 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Mar 22, 2021 •

edited

Loading