Skip to content

Commit 0ff2c5f

Browse files
committed
Update the README in response to #1.
1 parent d8e50a2 commit 0ff2c5f

File tree

2 files changed

+39
-1
lines changed

2 files changed

+39
-1
lines changed

README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,7 @@ modify `config.yaml` to have your server configurations,
337337
and build the application with `ss-a`, send the JAR to your cluster
338338
with `ss-sy`, and start Spindle with `ss-st`.
339339

340-
# Benchmarking Results
340+
# Experimental Results
341341
All experiments leverage a homogeneous six node production cluster
342342
of HP ProLiant DL360p Gen8 blades.
343343
Each node has 32GB of DDR3 memory at 1333MHz,
@@ -354,6 +354,39 @@ in Parquet.
354354
The YAML formatted results, scripts, and resulting figures
355355
are in the [benchmark-scripts][benchmark-scripts] directory.
356356

357+
## Scaling HDFS and Spark workers.
358+
Predicting the optimal resource allocation to minimize query latency for
359+
distributed applications is difficult. No production software can accurately
360+
predict the optimal number of Spark and HDFS nodes for a given application.
361+
This experiment observes the execution time of queries as the number of Spark
362+
and HDFS workers is increased. We manually scale and rebalance the HDFS data.
363+
364+
The following figure shows the time to load all columns the queries
365+
use for the week of data as the Spark and HDFS workers are scaled. The data is
366+
loaded by caching the Spark RDD and performing a null operation on them, such
367+
as `rdd.cache.foreach{x` =>{}}. The downward trend of the data load times
368+
indicate that using more Spark or HDFS workers will decrease the time to load
369+
data.
370+
371+
![](https://raw.githubusercontent.com/adobe-research/spindle/master/benchmark-scripts/scaling/dataLoad.png)
372+
373+
The following table shows the execution time of the queries
374+
with cached data when scaling the HDFS and Spark workers.
375+
The bold data indicates where adding a
376+
Spark and HDFS worker hurts performance. The surprising results show that
377+
adding a single Spark or HDFS worker commonly hurts query performance, and
378+
interestingly, no query experiences minimal execution time when using all 6
379+
workers. Our future work is to further experiment by tuning Spark to understand
380+
the performance degradation, which might be caused by network traffic or
381+
imbalanced workloads.
382+
383+
Q2 and Q3 are similar queries and consequently have similar performance as
384+
scaling the Spark and HDFS workers, but has an anomaly when using 3 workers
385+
where Q2 executes in 17.10s and Q3 executes in 55.15s. Q6’s execution time
386+
increases by 10.67 seconds between three and six Spark and HDFS workers.
387+
388+
![](https://github.com/adobe-research/spindle/raw/master/images/scaling-spark-hdfs.png)
389+
357390
## Intermediate data partitioning.
358391
Spark cannot optimize the number of records in the partitions
359392
because counting the number of records in the initial and
@@ -458,6 +491,11 @@ The slowdowns for two concurrent queries indicate further query optimizations
458491
could better balance the work between all Spark workers and
459492
likely result in better query execution time.
460493

494+
# Contributing and Development Status
495+
Spindle is not currently under active development by Adobe.
496+
However, we are happy to review and respond to issues,
497+
questions, and pull requests.
498+
461499
# License
462500
Bundled applications are copyright their respective owners.
463501
[Twitter Bootstrap][bootstrap] and

images/scaling-spark-hdfs.png

68.9 KB
Loading

0 commit comments

Comments
 (0)