Skip to content
Closed
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@

package org.apache.spark.sql.execution.benchmark

import java.io.File

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.TableIdentifier
Expand All @@ -31,7 +29,7 @@ import org.apache.spark.util.Benchmark
/**
* Benchmark to measure TPCDS query performance.
* To run this:
* spark-submit --class <this class> --jars <spark sql test jar>
* spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
*/
object TPCDSQueryBenchmark {
val conf =
Expand Down Expand Up @@ -61,12 +59,10 @@ object TPCDSQueryBenchmark {
}

def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
require(dataLocation.nonEmpty,
"please modify the value of dataLocation to point to your local TPCDS data")
val tableSizes = setupTables(dataLocation)
queries.foreach { name =>
val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz drop import java.io.File.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.

.getResource(s"tpcds/$name.sql").getFile))
val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)

// This is an indirect hack to estimate the size of each query's input by traversing the
// logical plan and adding up the sizes of all tables that appear in the plan. Note that this
Expand Down Expand Up @@ -99,6 +95,20 @@ object TPCDSQueryBenchmark {
}

def main(args: Array[String]): Unit = {
if (args.length < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also allow another way to run this benchmark?

We can hardcode the value of dataLocation and run it in IntelliJ directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak kindly ping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the argument through the run-configuration even though we use IDE like IntelliJ right?
Or, how about give dataLocation through a new property?

Copy link
Member

@gatorsmile gatorsmile Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak @maropu Could we do something like https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala?

Later, we also can add another argument for outputing the plans of TPC-DS queries, instead of running the actual queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add TPCDSQueryBenchmarkArguments.

// scalastyle:off
println(
s"""
|Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
|
|In order to run this benchmark, please follow the instructions at
|https://github.com/databricks/spark-sql-perf/blob/master/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I took a look at this page. It is not easy to understand the instructions. Maybe we also need to resolve that part too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a bit played around the generator part in spark-sql-perf and I think the part is small among the package. So, any plan to make a new repository to generate the data only in the databricks repository or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! Will try it this weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, thanks!

|to generate the TPCDS data locally (preferably with a scale factor of 5 for benchmarking).
|Thereafter, the value of <TPCDS data location> needs to be set to the location where the generated data is stored.
""".stripMargin)
// scalastyle:on
System.exit(1)
}

// List of all TPC-DS queries
val tpcdsQueries = Seq(
Expand All @@ -113,11 +123,7 @@ object TPCDSQueryBenchmark {
"q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
"q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")

// In order to run this benchmark, please follow the instructions at
// https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data
// locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of
// dataLocation below needs to be set to the location where the generated data is stored.
val dataLocation = ""
val dataLocation = args(0)

tpcdsAll(dataLocation, queries = tpcdsQueries)
}
Expand Down