@@ -337,7 +337,7 @@ modify `config.yaml` to have your server configurations,
337337and build the application with ` ss-a ` , send the JAR to your cluster
338338with ` ss-sy ` , and start Spindle with ` ss-st ` .
339339
340- # Benchmarking Results
340+ # Experimental Results
341341All experiments leverage a homogeneous six node production cluster
342342of HP ProLiant DL360p Gen8 blades.
343343Each node has 32GB of DDR3 memory at 1333MHz,
@@ -354,6 +354,39 @@ in Parquet.
354354The YAML formatted results, scripts, and resulting figures
355355are in the [ benchmark-scripts] [ benchmark-scripts ] directory.
356356
357+ ## Scaling HDFS and Spark workers.
358+ Predicting the optimal resource allocation to minimize query latency for
359+ distributed applications is difficult. No production software can accurately
360+ predict the optimal number of Spark and HDFS nodes for a given application.
361+ This experiment observes the execution time of queries as the number of Spark
362+ and HDFS workers is increased. We manually scale and rebalance the HDFS data.
363+
364+ The following figure shows the time to load all columns the queries
365+ use for the week of data as the Spark and HDFS workers are scaled. The data is
366+ loaded by caching the Spark RDD and performing a null operation on them, such
367+ as ` rdd.cache.foreach{x ` =>{}}. The downward trend of the data load times
368+ indicate that using more Spark or HDFS workers will decrease the time to load
369+ data.
370+
371+ ![ ] ( https://raw.githubusercontent.com/adobe-research/spindle/master/benchmark-scripts/scaling/dataLoad.png )
372+
373+ The following table shows the execution time of the queries
374+ with cached data when scaling the HDFS and Spark workers.
375+ The bold data indicates where adding a
376+ Spark and HDFS worker hurts performance. The surprising results show that
377+ adding a single Spark or HDFS worker commonly hurts query performance, and
378+ interestingly, no query experiences minimal execution time when using all 6
379+ workers. Our future work is to further experiment by tuning Spark to understand
380+ the performance degradation, which might be caused by network traffic or
381+ imbalanced workloads.
382+
383+ Q2 and Q3 are similar queries and consequently have similar performance as
384+ scaling the Spark and HDFS workers, but has an anomaly when using 3 workers
385+ where Q2 executes in 17.10s and Q3 executes in 55.15s. Q6’s execution time
386+ increases by 10.67 seconds between three and six Spark and HDFS workers.
387+
388+ ![ ] ( https://github.com/adobe-research/spindle/raw/master/images/scaling-spark-hdfs.png )
389+
357390## Intermediate data partitioning.
358391Spark cannot optimize the number of records in the partitions
359392because counting the number of records in the initial and
@@ -458,6 +491,11 @@ The slowdowns for two concurrent queries indicate further query optimizations
458491could better balance the work between all Spark workers and
459492likely result in better query execution time.
460493
494+ # Contributing and Development Status
495+ Spindle is not currently under active development by Adobe.
496+ However, we are happy to review and respond to issues,
497+ questions, and pull requests.
498+
461499# License
462500Bundled applications are copyright their respective owners.
463501[ Twitter Bootstrap] [ bootstrap ] and
0 commit comments