Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Reflected review comments
  • Loading branch information
sarutak committed Jul 11, 2017
commit 14b188a0a872e55b7d2ddc81a6b7e5e244e54052
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@

package org.apache.spark.sql.execution.benchmark

import java.io.File

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.TableIdentifier
Expand All @@ -31,7 +29,7 @@ import org.apache.spark.util.Benchmark
/**
* Benchmark to measure TPCDS query performance.
* To run this:
* spark-submit --class <this class> --jars <spark sql test jar>
* spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
*/
object TPCDSQueryBenchmark {
val conf =
Expand Down Expand Up @@ -61,12 +59,10 @@ object TPCDSQueryBenchmark {
}

def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
require(dataLocation.nonEmpty,
"please modify the value of dataLocation to point to your local TPCDS data")
val tableSizes = setupTables(dataLocation)
queries.foreach { name =>
val queryString = resourceToString(s"tpcds/$name.sql", "UTF-8",
Thread.currentThread().getContextClassLoader)
val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)

// This is an indirect hack to estimate the size of each query's input by traversing the
// logical plan and adding up the sizes of all tables that appear in the plan. Note that this
Expand Down Expand Up @@ -102,7 +98,14 @@ object TPCDSQueryBenchmark {
if (args.length < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also allow another way to run this benchmark?

We can hardcode the value of dataLocation and run it in IntelliJ directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak kindly ping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the argument through the run-configuration even though we use IDE like IntelliJ right?
Or, how about give dataLocation through a new property?

Copy link
Member

@gatorsmile gatorsmile Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak @maropu Could we do something like https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala?

Later, we also can add another argument for outputing the plans of TPC-DS queries, instead of running the actual queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add TPCDSQueryBenchmarkArguments.

// scalastyle:off println
println(
"Usage: spark-submit --class <this class> --jars <spark sql test jar> <data location>")
s"""
|Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
|
|In order to run this benchmark, please follow the instructions at
|https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data
|locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of
|dataLocation below needs to be set to the location where the generated data is stored.
""".stripMargin)
// scalastyle:on println
System.exit(1)
}
Expand All @@ -120,10 +123,6 @@ object TPCDSQueryBenchmark {
"q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
"q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")

// In order to run this benchmark, please follow the instructions at
// https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data
// locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of
// dataLocation below needs to be set to the location where the generated data is stored.
val dataLocation = args(0)

tpcdsAll(dataLocation, queries = tpcdsQueries)
Expand Down