Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented May 10, 2018

What changes were proposed in this pull request?

This pr added benchmark code FilterPushdownBenchmark for string pushdown and updated performance results on the AWS r3.xlarge.

How was this patch tested?

N/A

@maropu maropu force-pushed the UpdateParquetBenchmark branch from 223bf20 to 8f60902 Compare May 10, 2018 05:13
@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90440 has finished for PR 21288 at commit 223bf20.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90441 has finished for PR 21288 at commit 8f60902.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 10, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90454 has finished for PR 21288 at commit 8f60902.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

conf.set("spark.sql.parquet.compression.codec", "snappy")
.setMaster("local[1]")
.setAppName("FilterPushdownBenchmark")
.set("spark.driver.memory", "3g")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these and master - change to setIfMissing()? I think it's great if these can be set via config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, ok. Looks good to me.
I just added this along with other benchmark code, e.g., TPCDSQueryBenchmark.
If no problem, I'll fix the other places in follow-up.

@SparkQA
Copy link

SparkQA commented May 14, 2018

Test build #90571 has finished for PR 21288 at commit 4520044.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Parquet Vectorized (Pushdown) 15015 / 15047 1.0 954.6 1.0X
Native ORC Vectorized 12090 / 12259 1.3 768.7 1.2X
Native ORC Vectorized (Pushdown) 12021 / 12096 1.3 764.2 1.2X
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @maropu . Thank you for updating this with new Parquet 1.10. BTW, could you elaborate the EC2 description more clearly in the PR description? I want to reproduce this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I used m4.2xlarge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

val conf = new SparkConf()
conf.set("orc.compression", "snappy")
conf.set("spark.sql.parquet.compression.codec", "snappy")
.setMaster("local[1]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do .setIfMissing("spark.master", "local[1]")
that way perhaps we could get this to run on different backends too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90878 has finished for PR 21288 at commit 39e5a50.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 21, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90883 has finished for PR 21288 at commit 39e5a50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 21, 2018

retest this please

}
}

// Pushdown for few distinct value case (use dictionary encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ORC, the ORC has the conf called orc.dictionary.key.threshold. Do we need to set the conf here? cc @dongjoon-hyun

  DICTIONARY_KEY_SIZE_THRESHOLD("orc.dictionary.key.threshold",
      "hive.exec.orc.dictionary.key.size.threshold",
      0.8,
      "If the number of distinct keys in a dictionary is greater than this\n" +
          "fraction of the total number of non-null rows, turn off \n" +
          "dictionary encoding.  Use 1 to always use dictionary encoding.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, in Apache Spark project, we are testing with only default configurations. snappy will be the only exception because it's Spark's default compression and it's easy to get an idea in Parquet/ORC comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current data fits the threshold. I am just afraid the comment might be invalid if the underlying files are not using dictionary encoding. Even if we do not change the format, we still need to update the comment.

Copy link
Member Author

@maropu maropu May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it'd be better to set 1.0 at the option for safety, too.
But, currently we don't have the way to pass the option into the Orc output writer? @dongjoon-hyun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add a comment and also change the conf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90904 has finished for PR 21288 at commit 39e5a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the UpdateParquetBenchmark branch from b7859ed to 2c0d5cb Compare May 28, 2018 04:39
@SparkQA
Copy link

SparkQA commented May 28, 2018

Test build #91210 has finished for PR 21288 at commit b7859ed.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 28, 2018

Test build #91211 has finished for PR 21288 at commit 2c0d5cb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 28, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 28, 2018

Test build #91219 has finished for PR 21288 at commit 2c0d5cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the UpdateParquetBenchmark branch from 2c0d5cb to d41e689 Compare May 28, 2018 13:25
@SparkQA
Copy link

SparkQA commented May 28, 2018

Test build #91228 has finished for PR 21288 at commit d41e689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Native ORC Vectorized 8167 / 8185 1.9 519.3 1.0X
Native ORC Vectorized (Pushdown) 365 / 379 43.1 23.2 23.1X
Parquet Vectorized 2961 / 3123 5.3 188.3 1.0X
Parquet Vectorized (Pushdown) 3057 / 3121 5.1 194.4 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is huge. What happened?

Copy link
Member Author

@maropu maropu May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I thinks so. But, not sure. I tried to run multiple times on the same env. (m4.2xlarge) though, I didn't get the old performance values... I'll check again later (I would be great if somebody double-checks on the same env.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried it yet, but is it related to the recent change we made in the parquet reader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be, but I feel the change was too big... I probably think that I had some mistakes in the last benchmark runs (I've not found why yet though).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 2.3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have time today, so I'll check v2.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result in v2.3.1: https://gist.github.com/maropu/88627246b7143ede5ab73c7183ab2128

That is not a regression, but I probably run the bench in wrong branch or commit.
I re-ran the bench in the current master and updated the pr.

how-to-run: I created a new m4.2xlarge instance, fetched this pr, rebased to master, and run the bench.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating, @maropu .

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91795 has finished for PR 21288 at commit d41e689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91815 has finished for PR 21288 at commit fa53156.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@maropu Could you fix the style?

BTW, based on the latest result, Parquet is generally faster than ORC. cc @dongjoon-hyun @rdblue

@maropu
Copy link
Member Author

maropu commented Jun 14, 2018

ok

@maropu maropu force-pushed the UpdateParquetBenchmark branch from fa53156 to d3dd504 Compare June 14, 2018 06:25
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 14, 2018

@gatorsmile and @maropu . I really appreciate this effort. Thanks.

Since this is a cloud benchmark, I have one thing to recommend. Can we use r3.xlarge for all benchmarks consistently? As we know, it's difficult to compare the results from different machines.

There are three reasons.

  1. r3.xlarge is cheaper than m4.2xlarge.
  2. Previous benchmark result cames from Macbook (SSD). r3.xlarge also provides SSD.
  3. r3.xlarge is used at Databricks TPCDS benchmark, too.

The following is the result on r3.xlarge; I launched the machine and build this PR on the latest master and run bin/spark-submit --master local[1] --driver-memory 10G --conf spark.ui.enabled=false --class org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark sql/core/target/scala-2.11/spark-sql_2.11-2.4.0-SNAPSHOT-tests.jar. (There is no hadoop installation. I guess @maropu also does.)

OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL):     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9133 / 9275          1.7         580.6       1.0X
Parquet Vectorized (Pushdown)                   85 /  100        185.2           5.4     107.6X
Native ORC Vectorized                         8760 / 8843          1.8         556.9       1.0X
Native ORC Vectorized (Pushdown)               115 /  130        136.4           7.3      79.2X


OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9254 / 9276          1.7         588.4       1.0X
Parquet Vectorized (Pushdown)                  912 /  922         17.2          58.0      10.1X
Native ORC Vectorized                         8966 / 9013          1.8         570.1       1.0X
Native ORC Vectorized (Pushdown)               254 /  276         61.8          16.2      36.4X


OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row (value = '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9106 / 9136          1.7         578.9       1.0X
Parquet Vectorized (Pushdown)                  897 /  910         17.5          57.0      10.2X
Native ORC Vectorized                         8846 / 8889          1.8         562.4       1.0X
Native ORC Vectorized (Pushdown)               254 /  267         61.9          16.2      35.8X


OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row (value <=> '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9095 / 9124          1.7         578.3       1.0X
Parquet Vectorized (Pushdown)                  891 /  899         17.7          56.6      10.2X
Native ORC Vectorized                         8853 / 8941          1.8         562.8       1.0X
Native ORC Vectorized (Pushdown)               246 /  254         64.0          15.6      37.0X


OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 1 string row ('7864320' <= value <= '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9236 / 9273          1.7         587.2       1.0X
Parquet Vectorized (Pushdown)                  902 /  910         17.4          57.4      10.2X
Native ORC Vectorized                         8944 / 8965          1.8         568.6       1.0X
Native ORC Vectorized (Pushdown)               248 /  262         63.4          15.8      37.2X


OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 3.10.0-693.5.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select all string rows (value IS NOT NULL): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                          20309 / 20381          0.8        1291.2       1.0X
Parquet Vectorized (Pushdown)               20437 / 20477          0.8        1299.3       1.0X
Native ORC Vectorized                       24929 / 24999          0.6        1585.0       0.8X
Native ORC Vectorized (Pushdown)            24918 / 25040          0.6        1584.3       0.8X

As you see, the result is more consistent with the previous one and is different from this PR. Actually, I was reluctant to say this, but we had better have a standard way to generate a benchmark result on the cloud. If possible, I'd like to use r3.xlarge.

@dongjoon-hyun
Copy link
Member

One more thing; I prefer Macbook performance tests because the cost of EC2 is always a barrier to developers.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91821 has finished for PR 21288 at commit d3dd504.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 14, 2018

yea, I also agree with the opinion; we'd be better to run benchmarks on the same machine.
I'll re-run the benchmark on r3.xlarge to check if I could get the same result.

There is no hadoop installation. I guess @maropu also does

yea, I had no installation.

One more thing; I prefer Macbook performance tests because the cost of EC2 is always a barrier to developers.

I think so, but is it some difficult to get consistent results on developer's different laptop envs?
Btw, I feel it might help to have a script to run all the micro-benchmarks to keep consistent results, e.g., ./dev/run-micro-benchmarks. Then, the script outputs all the results somewhere (e.g., ./dev/micro-benchmark-results).

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91857 has finished for PR 21288 at commit d3dd504.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @maropu . I'm looking forward to seeing the new result from you.

@maropu
Copy link
Member Author

maropu commented Jun 15, 2018

I noticed why the big performance value changes happened in #21288 (comment); that's because the commit wrongly set local[*] (global default) at spark.master instead of local[1];

// Performance results on r3.xlarge 

// --master local[1] --driver-memory 10G --conf spark.ui.enabled=false (This is the same condition with the @dongjoon-hyun one)
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9292 / 9315          1.7         590.8       1.0X
Parquet Vectorized (Pushdown)                  921 /  933         17.1          58.6      10.1X
Native ORC Vectorized                         9001 / 9021          1.7         572.3       1.0X
Native ORC Vectorized (Pushdown)               257 /  265         61.2          16.3      36.2X

Select 1 string row (value = '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9151 / 9162          1.7         581.8       1.0X
Parquet Vectorized (Pushdown)                  902 /  917         17.4          57.3      10.1X
Native ORC Vectorized                         8870 / 8882          1.8         564.0       1.0X
Native ORC Vectorized (Pushdown)               254 /  268         61.9          16.1      36.0X
...


// --master local[*] --driver-memory 10G --conf spark.ui.enabled=false
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL):     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            3959 / 4067          4.0         251.7       1.0X
Parquet Vectorized (Pushdown)                  202 /  245         77.7          12.9      19.6X
Native ORC Vectorized                         3973 / 4055          4.0         252.6       1.0X
Native ORC Vectorized (Pushdown)               286 /  345         55.0          18.2      13.8X

OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            3985 / 4022          3.9         253.4       1.0X
Parquet Vectorized (Pushdown)                  249 /  274         63.3          15.8      16.0X
Native ORC Vectorized                         4066 / 4122          3.9         258.5       1.0X
Native ORC Vectorized (Pushdown)               257 /  310         61.3          16.3      15.5X

I'll fix the bug and update the results in following prs. Sorry, all (I'm running all the benchmarks now).

@maropu
Copy link
Member Author

maropu commented Jun 15, 2018

@dongjoon-hyun I got the same result in case of the same condition (enough memory), but, if --diriver-memory 3g (smaller memory), I got a little different results;

// --diriver-memory=3g (default)
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                          10084 / 10154          1.6         641.1       1.0X
Parquet Vectorized (Pushdown)                  967 / 1008         16.3          61.5      10.4X
Native ORC Vectorized                       11088 / 11116          1.4         705.0       0.9X
Native ORC Vectorized (Pushdown)               270 /  278         58.2          17.2      37.3X

Select 1 string row (value = '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                          10032 / 10085          1.6         637.8       1.0X
Parquet Vectorized (Pushdown)                  959 /  998         16.4          61.0      10.5X
Native ORC Vectorized                       11104 / 11128          1.4         706.0       0.9X
Native ORC Vectorized (Pushdown)               259 /  277         60.6          16.5      38.7X
...


// --diriver-memory=10g
OpenJDK 64-Bit Server VM 1.8.0_171-b10 on Linux 4.14.33-51.37.amzn1.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Select 0 string row (value IS NULL):     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9201 / 9300          1.7         585.0       1.0X
Parquet Vectorized (Pushdown)                   89 /  105        176.3           5.7     103.1X
Native ORC Vectorized                         8886 / 8898          1.8         564.9       1.0X
Native ORC Vectorized (Pushdown)               110 /  128        143.4           7.0      83.9X

Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Parquet Vectorized                            9336 / 9357          1.7         593.6       1.0X
Parquet Vectorized (Pushdown)                  927 /  937         17.0          58.9      10.1X
Native ORC Vectorized                         9026 / 9041          1.7         573.9       1.0X
Native ORC Vectorized (Pushdown)               257 /  272         61.1          16.4      36.3X
...

The parquet has smaller memory footprint? I'm currently look into this (I updated the result in case of the enough memory).

@maropu
Copy link
Member Author

maropu commented Jun 15, 2018

I've check the metrics and I found that GC happend in case of --diriver-memory 3g.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91914 has finished for PR 21288 at commit 4a9cec9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Yep. Thank you for progressing this, @maropu !

@maropu
Copy link
Member Author

maropu commented Jun 15, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2018

Test build #91946 has finished for PR 21288 at commit 4a9cec9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val conf = new SparkConf()
.setAppName("FilterPushdownBenchmark")
// Since `spark.master` always exists, overrides this value
.set("spark.master", "local[1]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update m4.2xlarge in the PR description and add spark.master at line 34, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current pr, we cannot use spark.master in command line options. You suggest we drop .set("spark.master", "local[1]") and we always set spark.master in options for this benchmark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I updated the description. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is adding --master local[1] at line 34, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that other developers might misunderstand how-to-use this?

spark-submit --master local[1] --class <this class> <spark sql test jar>
spark-submit --master local[*] --class <this class> <spark sql test jar>

In both case, the benchmark always uses local[1]. Or, you suggest the other point of view?

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in 98f363b Jun 24, 2018
@maropu
Copy link
Member Author

maropu commented Jun 24, 2018

Thanks for the check! btw, DataSourceReadBenchmark has the same issue (spark.master setup), so is it ok to fix this as follow-up?
master...maropu:FixDataSourceReadBenchmark
Also, I update the bench on r3.xlarge.

@gatorsmile
Copy link
Member

Sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants