You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Spark Cluster with Docker & docker-compose(2021 ver.)
2
2
3
3
# General
4
4
5
5
A simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment.
6
6
7
7
The Docker compose will create the following containers:
8
8
9
-
container|Ip address
9
+
container|Exposed ports
10
10
---|---
11
-
spark-master|10.5.0.2
12
-
spark-worker-1|10.5.0.3
13
-
spark-worker-2|10.5.0.4
14
-
spark-worker-3|10.5.0.5
11
+
spark-master|9090 7077
12
+
spark-worker-1|9091
13
+
spark-worker-2|9092
14
+
demo-database|5432
15
15
16
16
# Installation
17
17
@@ -23,35 +23,19 @@ The following steps will make you run your spark cluster's containers.
23
23
24
24
* Docker compose installed
25
25
26
-
* A spark Application Jar to play with(Optional)
26
+
## Build the image
27
27
28
-
## Build the images
29
-
30
-
The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the *build-images.sh* script.
31
-
32
-
The executions is as simple as the following steps:
33
28
34
29
```sh
35
-
chmod +x build-images.sh
36
-
./build-images.sh
30
+
docker build -t cluster-apache-spark:3.0.2 .
37
31
```
38
32
39
-
This will create the following docker images:
40
-
41
-
* spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1
42
-
43
-
* spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers.
44
-
45
-
* spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers.
46
-
47
-
* spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).
48
-
49
33
## Run the docker-compose
50
34
51
35
The final step to create your test cluster will be to run the compose file:
52
36
53
37
```sh
54
-
docker-compose up --scale spark-worker=3
38
+
docker-compose up -d
55
39
```
56
40
57
41
## Validate your cluster
@@ -60,27 +44,22 @@ Just validate your cluster accesing the spark UI on each worker & master URL.
@@ -102,130 +81,101 @@ To make app running easier I've shipped two volume mounts described in the follo
102
81
103
82
Host Mount|Container Mount|Purposse
104
83
---|---|---
105
-
/mnt/spark-apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
106
-
/mnt/spark-data|/opt/spark-data| Used to make available your app's data on all workers & master
84
+
apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
85
+
data|/opt/spark-data| Used to make available your app's data on all workers & master
107
86
108
87
This is basically a dummy DFS created from docker Volumes...(maybe not...)
109
88
110
-
# Run a sample application
111
-
112
-
Now let`s make a **wild spark submit** to validate the distributed nature of our new toy following these steps:
113
-
114
-
## Create a Scala spark app
89
+
# Run Sample applications
115
90
116
-
The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..).
117
91
118
-
In my case I am using an app called [crimes-app](https://). You can make or use your own scala app, I 've just used this one because I had it at hand.
92
+
## NY Bus Stops Data [Pyspark]
119
93
94
+
This programs just loads archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and apply basic filters using spark sql, the result are persisted into a postgresql table.
120
95
121
-
## Ship your jar & dependencies on the Workers and Master
96
+
The loaded table will contain the following structure:
122
97
123
-
A necesary step to make a **spark-submit** is to copy your application bundle into all workers, also any configuration file or input file you need.
Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files.
102
+
To submit the app connect to one of the workers or the master and execute:
126
103
127
-
```bash
128
-
#Copy spark application into all workers's app folder
## Check the successful copy of the data and app jar (Optional)
112
+

139
113
140
-
This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit.
114
+
## MTA Bus Analytics[Scala]
141
115
142
-
```sh
143
-
# Worker 1 Validations
144
-
docker exec -ti spark-worker-1 ls -l /opt/spark-apps
116
+
This program takes the archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and make some aggregations on it, the calculated results are persisted on postgresql tables.
145
117
146
-
docker exec -ti spark-worker-1 ls -l /opt/spark-data
118
+
Each persisted table correspond to a particullar aggregation:
147
119
148
-
# Worker 2 Validations
149
-
docker exec -ti spark-worker-2 ls -l /opt/spark-apps
120
+
Table|Aggregation
121
+
---|---
122
+
day_summary|A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles)
123
+
speed_excesses|Speed excesses calculated in a 5 minute window
124
+
average_speed|Average speed by vehicle
125
+
distance_traveled|Total Distance traveled by vehicle
150
126
151
-
docker exec -ti spark-worker-2 ls -l /opt/spark-data
152
127
153
-
# Worker 3 Validations
154
-
docker exec -ti spark-worker-3 ls -l /opt/spark-apps
128
+
To submit the app connect to one of the workers or the master and execute:
155
129
156
-
docker exec -ti spark-worker-3 ls -l /opt/spark-data
You will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)
180
144
181
-
After running this you will see an output pretty much like this:
182
-
183
-
```bash
184
-
Running Spark using the REST application submission protocol.
185
-
2018-09-23 15:17:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066.
186
-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state...
187
-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submitting a request forthe status of submission driver-20180923151753-0000in spark://spark-master:6066.
188
-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING.
189
-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381.
190
-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
191
-
{
192
-
"action":"CreateSubmissionResponse",
193
-
"message":"Driver successfully submitted as driver-20180923151753-0000",
194
-
"serverSparkVersion":"2.3.1",
195
-
"submissionId":"driver-20180923151753-0000",
196
-
"success":true
197
-
}
198
-
```
145
+

199
146
200
-
# Summary (What have I done :O?)
201
147
202
-
* We compiled the necessary docker images to run spark master and worker containers.
148
+
# Summary
203
149
204
-
* We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose.
150
+
* We compiled the necessary docker image to run spark master and worker containers.
205
151
206
-
*Copied the resources necessary to run a sample application.
152
+
*We created a spark standalone cluster using 2 worker nodes and 1 master node using docker && docker-compose.
207
153
208
-
*Submitted an application to the cluster using a **spark-submit** docker image.
154
+
*Copied the resources necessary to run demo applications.
209
155
210
156
* We ran a distributed application at home(just need enough cpu cores and RAM to do so).
211
157
212
158
# Why a standalone cluster?
213
159
214
160
* This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.
215
161
216
-
* Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(.
217
-
218
162
* This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)
219
163
220
164
# Steps to connect and use a pyspark shell interactively
221
165
222
166
* Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker.
0 commit comments