Intel-bigdata · carsonwang · Mar 31, 2020 · Mar 19, 2020 · Mar 19, 2020 · Mar 23, 2020
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@
 ---
 ### OVERVIEW ###
 
-HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Flink, Storm and Gearpump.
+HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Repartition, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Flink, Storm and Gearpump.
 
 ### Getting Started ###
  * [Build HiBench](docs/build-hibench.md)
@@ -38,12 +38,16 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate
 3. TeraSort (terasort)
 
     TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
+
+4. Repartition (micro/repartition)
+
+    This workload benchmarks shuffle performance. Input data is generated by Hadoop TeraGen. It is firstly cached in memory by default, then shuffle write and read in order to repartition. The last 2 stages solely reflects shuffle's performance, excluding I/O and other compute. Note: The parameter hibench.repartition.cacheinmemory(default is true) is provided, to allow reading from storage in the 1st stage without caching
 
-4. Sleep (sleep)
+5. Sleep (sleep)
 
     This workload sleep an amount of seconds in each task to test framework scheduler.
 
-5. enhanced DFSIO (dfsioe)
+6. enhanced DFSIO (dfsioe)
 
     Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
 
@@ -124,9 +128,9 @@ There are totally 19 workloads in HiBench. The workloads are divided into 6 cate
 
     This workload reads input data from Kafka and then writes result to Kafka immediately, there is no complex business logic involved.
 
-2. Repartition (repartition)
+2. Repartition (streaming/repartition)
 
-    This workload reads input data from Kafka and changes the level of parallelism by creating more or fewer partitionstests. It tests the efficiency of data shuffle in the streaming frameworks.
+    This workload reads input data from Kafka and changes the level of parallelism by creating more or fewer partitions. It tests the efficiency of data shuffle in the streaming frameworks.
 
 3. Stateful Wordcount (wordcount)
 

diff --git a/bin/functions/hibench_prop_env_mapping.py b/bin/functions/hibench_prop_env_mapping.py
@@ -67,8 +67,10 @@
     MAP_SLEEP_TIME="hibench.sleep.mapper.seconds",
     RED_SLEEP_TIME="hibench.sleep.reducer.seconds",
     HADOOP_SLEEP_JAR="hibench.sleep.job.jar",
-    # For Sort, Terasort, Wordcount
+    # For Sort, Terasort, Wordcount, Repartition
     DATASIZE="hibench.workload.datasize",
+    # For repartition
+    CACHE_IN_MEMORY="hibench.repartition.cacheinmemory",
 
     # For hive related workload, data scale
     PAGES="hibench.workload.pages",

diff --git a/bin/workloads/micro/repartition/prepare/prepare.sh b/bin/workloads/micro/repartition/prepare/prepare.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+current_dir=`dirname "$0"`
+current_dir=`cd "$current_dir"; pwd`
+root_dir=${current_dir}/../../../../..
+workload_config=${root_dir}/conf/workloads/micro/repartition.conf
+. "${root_dir}/bin/functions/load_bench_config.sh"
+
+enter_bench HadoopPrepareRepartition ${workload_config} ${current_dir}
+show_bannar start
+
+rmr_hdfs $INPUT_HDFS || true
+START_TIME=`timestamp`
+run_hadoop_job ${HADOOP_EXAMPLES_JAR} teragen \
+    -D mapreduce.job.maps=${NUM_MAPS} \
+    -D mapreduce.job.reduces=${NUM_REDS} \
+    ${DATASIZE} ${INPUT_HDFS}
+END_TIME=`timestamp`
+
+show_bannar finish
+leave_bench
diff --git a/bin/workloads/micro/repartition/spark/run.sh b/bin/workloads/micro/repartition/spark/run.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+current_dir=`dirname "$0"`
+current_dir=`cd "$current_dir"; pwd`
+root_dir=${current_dir}/../../../../..
+workload_config=${root_dir}/conf/workloads/micro/repartition.conf
+. "${root_dir}/bin/functions/load_bench_config.sh"
+
+enter_bench ScalaRepartition ${workload_config} ${current_dir}
+show_bannar start
+
+rmr_hdfs $OUTPUT_HDFS || true
+
+SIZE=`dir_size $INPUT_HDFS`
+START_TIME=`timestamp`
+run_spark_job com.intel.hibench.sparkbench.micro.ScalaRepartition $INPUT_HDFS $OUTPUT_HDFS $CACHE_IN_MEMORY
+END_TIME=`timestamp`
+
+gen_report ${START_TIME} ${END_TIME} ${SIZE}
+show_bannar finish
+leave_bench
+
diff --git a/conf/benchmarks.lst b/conf/benchmarks.lst
@@ -2,6 +2,7 @@ micro.sleep
 micro.sort
 micro.terasort
 micro.wordcount
+micro.repartition
 micro.dfsioe
 
 sql.aggregation

diff --git a/conf/workloads/micro/repartition.conf b/conf/workloads/micro/repartition.conf
@@ -0,0 +1,15 @@
+#datagen
+hibench.repartition.tiny.datasize			100000
+hibench.repartition.small.datasize			10000000
+hibench.repartition.large.datasize			100000000
+hibench.repartition.huge.datasize			1000000000
+hibench.repartition.gigantic.datasize		10000000000
+hibench.repartition.bigdata.datasize		60000000000
+
+hibench.workload.datasize		${hibench.repartition.${hibench.scale.profile}.datasize}
+
+# export for shell script
+hibench.workload.input			${hibench.hdfs.data.dir}/Repartition/Input
+hibench.workload.output			${hibench.hdfs.data.dir}/Repartition/Output
+
+hibench.repartition.cacheinmemory           true
diff --git a/conf/workloads/micro/terasort.conf b/conf/workloads/micro/terasort.conf
@@ -10,4 +10,4 @@ hibench.workload.datasize		${hibench.terasort.${hibench.scale.profile}.datasize}
 
 # export for shell script
 hibench.workload.input			${hibench.hdfs.data.dir}/Terasort/Input
-hibench.workload.output			${hibench.hdfs.data.dir}/Terasort/Output
+hibench.workload.output			${hibench.hdfs.data.dir}/Terasort/Output
diff --git a/sparkbench/micro/src/main/scala/com/intel/sparkbench/micro/ScalaRepartition.scala b/sparkbench/micro/src/main/scala/com/intel/sparkbench/micro/ScalaRepartition.scala
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.intel.hibench.sparkbench.micro
+
+import java.util.Random
+
+import com.intel.hibench.sparkbench.common.IOCommon
+import org.apache.hadoop.examples.terasort.TeraInputFormat
+import org.apache.hadoop.io.Text
+import org.apache.spark._
+import org.apache.spark.rdd.{CoalescedRDD, RDD, ShuffledRDD}
+import org.apache.spark.storage.StorageLevel
+
+object ScalaRepartition {
+
+  def main(args: Array[String]) {
+    if (args.length != 3) {
+      System.err.println(
+        s"Usage: $ScalaRepartition <INPUT_HDFS> <OUTPUT_HDFS> <CACHE_IN_MEMORY>"
+      )
+      System.exit(1)
+    }
+    val sparkConf = new SparkConf().setAppName("ScalaRepartition")
+    val sc = new SparkContext(sparkConf)
+
+    val data = sc.newAPIHadoopFile[Text, Text, TeraInputFormat](args(0)).map {
+      case (k,v) => k.copyBytes ++ v.copyBytes
+    }
+
+    if (args(2) == "true") {
+      data.persist(StorageLevel.MEMORY_ONLY)
+      data.count()
+    } else if (args(2) != "false") {
+      throw new IllegalArgumentException(
+        s"Unrecognizable parameter CACHE_IN_MEMORY: ${args(2)}, should be true or false")
+    }
+
+    val mapParallelism = sc.getConf.getInt("spark.default.parallelism", sc.defaultParallelism)
+    val reduceParallelism  = IOCommon.getProperty("hibench.default.shuffle.parallelism")
+      .getOrElse((mapParallelism / 2).toString).toInt
+
+    reparition(data, reduceParallelism).foreach(_ => {})
+
+    sc.stop()
+  }
+
+  // Save a CoalescedRDD than RDD.repartition API
+  private def reparition(previous: RDD[Array[Byte]], numReducers: Int): ShuffledRDD[Int, Array[Byte], Array[Byte]] = {
+    /** Distributes elements evenly across output partitions, starting from a random partition. */
+    val distributePartition = (index: Int, items: Iterator[Array[Byte]]) => {
+      var position = (new Random(index)).nextInt(numReducers)
+      items.map { t =>
+        // Note that the hash code of the key will just be the key itself. The HashPartitioner
+        // will mod it with the number of total partitions.
+        position = position + 1
+        (position, t)
+      }
+    } : Iterator[(Int, Array[Byte])]
+
+    // include a shuffle step so that our upstream tasks are still distributed
+    new ShuffledRDD[Int, Array[Byte], Array[Byte]](previous.mapPartitionsWithIndex(distributePartition),
+      new HashPartitioner(numReducers))
+  }
+
+}