Skip to content

Commit f2f8b17

Browse files
authored
Merge pull request mvillarrealb#36 from mvillarrealb/feature/spark-version
Feature/spark version merged
2 parents 77729ef + 4ffe666 commit f2f8b17

30 files changed

+526
-278
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Ignore data files
2+
*.csv

Dockerfile

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
FROM openjdk:11.0.11-jre-slim-buster as builder
2+
3+
# Add Dependencies for PySpark
4+
RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy
5+
6+
RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1
7+
8+
# Fix the value of PYTHONHASHSEED
9+
# Note: this is needed when you use Python 3.3 or greater
10+
ENV SPARK_VERSION=3.0.2 \
11+
HADOOP_VERSION=3.2 \
12+
SPARK_HOME=/opt/spark \
13+
PYTHONHASHSEED=1
14+
15+
RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
16+
&& mkdir -p /opt/spark \
17+
&& tar -xf apache-spark.tgz -C /opt/spark --strip-components=1 \
18+
&& rm apache-spark.tgz
19+
20+
21+
FROM builder as apache-spark
22+
23+
WORKDIR /opt/spark
24+
25+
ENV SPARK_MASTER_PORT=7077 \
26+
SPARK_MASTER_WEBUI_PORT=8080 \
27+
SPARK_LOG_DIR=/opt/spark/logs \
28+
SPARK_MASTER_LOG=/opt/spark/logs/spark-master.out \
29+
SPARK_WORKER_LOG=/opt/spark/logs/spark-worker.out \
30+
SPARK_WORKER_WEBUI_PORT=8080 \
31+
SPARK_WORKER_PORT=7000 \
32+
SPARK_MASTER="spark://spark-master:7077" \
33+
SPARK_WORKLOAD="master"
34+
35+
EXPOSE 8080 7077 7000
36+
37+
RUN mkdir -p $SPARK_LOG_DIR && \
38+
touch $SPARK_MASTER_LOG && \
39+
touch $SPARK_WORKER_LOG && \
40+
ln -sf /dev/stdout $SPARK_MASTER_LOG && \
41+
ln -sf /dev/stdout $SPARK_WORKER_LOG
42+
43+
COPY start-spark.sh /
44+
45+
CMD ["/bin/bash", "/start-spark.sh"]

README.md

Lines changed: 63 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
1-
# Spark Cluster with Docker & docker-compose
1+
# Spark Cluster with Docker & docker-compose(2021 ver.)
22

33
# General
44

55
A simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment.
66

77
The Docker compose will create the following containers:
88

9-
container|Ip address
9+
container|Exposed ports
1010
---|---
11-
spark-master|10.5.0.2
12-
spark-worker-1|10.5.0.3
13-
spark-worker-2|10.5.0.4
14-
spark-worker-3|10.5.0.5
11+
spark-master|9090 7077
12+
spark-worker-1|9091
13+
spark-worker-2|9092
14+
demo-database|5432
1515

1616
# Installation
1717

@@ -23,35 +23,19 @@ The following steps will make you run your spark cluster's containers.
2323

2424
* Docker compose installed
2525

26-
* A spark Application Jar to play with(Optional)
26+
## Build the image
2727

28-
## Build the images
29-
30-
The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the *build-images.sh* script.
31-
32-
The executions is as simple as the following steps:
3328

3429
```sh
35-
chmod +x build-images.sh
36-
./build-images.sh
30+
docker build -t cluster-apache-spark:3.0.2 .
3731
```
3832

39-
This will create the following docker images:
40-
41-
* spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1
42-
43-
* spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers.
44-
45-
* spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers.
46-
47-
* spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).
48-
4933
## Run the docker-compose
5034

5135
The final step to create your test cluster will be to run the compose file:
5236

5337
```sh
54-
docker-compose up --scale spark-worker=3
38+
docker-compose up -d
5539
```
5640

5741
## Validate your cluster
@@ -60,27 +44,22 @@ Just validate your cluster accesing the spark UI on each worker & master URL.
6044

6145
### Spark Master
6246

63-
http://10.5.0.2:8080/
47+
http://localhost:9090/
6448

6549
![alt text](docs/spark-master.png "Spark master UI")
6650

6751
### Spark Worker 1
6852

69-
http://10.5.0.3:8081/
53+
http://localhost:9091/
7054

7155
![alt text](docs/spark-worker-1.png "Spark worker 1 UI")
7256

7357
### Spark Worker 2
7458

75-
http://10.5.0.4:8081/
59+
http://localhost:9092/
7660

7761
![alt text](docs/spark-worker-2.png "Spark worker 2 UI")
7862

79-
### Spark Worker 3
80-
81-
http://10.5.0.5:8081/
82-
83-
![alt text](docs/spark-worker-3.png "Spark worker 3 UI")
8463

8564
# Resource Allocation
8665

@@ -102,130 +81,101 @@ To make app running easier I've shipped two volume mounts described in the follo
10281

10382
Host Mount|Container Mount|Purposse
10483
---|---|---
105-
/mnt/spark-apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
106-
/mnt/spark-data|/opt/spark-data| Used to make available your app's data on all workers & master
84+
apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
85+
data|/opt/spark-data| Used to make available your app's data on all workers & master
10786

10887
This is basically a dummy DFS created from docker Volumes...(maybe not...)
10988

110-
# Run a sample application
111-
112-
Now let`s make a **wild spark submit** to validate the distributed nature of our new toy following these steps:
113-
114-
## Create a Scala spark app
89+
# Run Sample applications
11590

116-
The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..).
11791

118-
In my case I am using an app called [crimes-app](https://). You can make or use your own scala app, I 've just used this one because I had it at hand.
92+
## NY Bus Stops Data [Pyspark]
11993

94+
This programs just loads archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and apply basic filters using spark sql, the result are persisted into a postgresql table.
12095

121-
## Ship your jar & dependencies on the Workers and Master
96+
The loaded table will contain the following structure:
12297

123-
A necesary step to make a **spark-submit** is to copy your application bundle into all workers, also any configuration file or input file you need.
98+
latitude|longitude|time_received|vehicle_id|distance_along_trip|inferred_direction_id|inferred_phase|inferred_route_id|inferred_trip_id|next_scheduled_stop_distance|next_scheduled_stop_id|report_hour|report_date
99+
---|---|---|---|---|---|---|---|---|---|---|---|---
100+
40.668602|-73.986697|2014-08-01 04:00:01|469|4135.34710710144|1|IN_PROGRESS|MTA NYCT_B63|MTA NYCT_JG_C4-Weekday-141500_B63_123|2.63183804205619|MTA_305423|2014-08-01 04:00:00|2014-08-01
124101

125-
Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files.
102+
To submit the app connect to one of the workers or the master and execute:
126103

127-
```bash
128-
#Copy spark application into all workers's app folder
129-
cp /home/workspace/crimes-app/build/libs/crimes-app.jar /mnt/spark-apps
130-
131-
#Copy spark application configs into all workers's app folder
132-
cp -r /home/workspace/crimes-app/config /mnt/spark-apps
133-
134-
# Copy the file to be processed to all workers's data folder
135-
cp /home/Crimes_-_2001_to_present.csv /mnt/spark-files
104+
```sh
105+
/opt/spark/bin/spark-submit --master spark://spark-master:7077 \
106+
--jars /opt/spark-apps/postgresql-42.2.22.jar \
107+
--driver-memory 1G \
108+
--executor-memory 1G \
109+
/opt/spark-apps/main.py
136110
```
137111

138-
## Check the successful copy of the data and app jar (Optional)
112+
![alt text](./articles/images/pyspark-demo.png "Spark UI with pyspark program running")
139113

140-
This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit.
114+
## MTA Bus Analytics[Scala]
141115

142-
```sh
143-
# Worker 1 Validations
144-
docker exec -ti spark-worker-1 ls -l /opt/spark-apps
116+
This program takes the archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and make some aggregations on it, the calculated results are persisted on postgresql tables.
145117

146-
docker exec -ti spark-worker-1 ls -l /opt/spark-data
118+
Each persisted table correspond to a particullar aggregation:
147119

148-
# Worker 2 Validations
149-
docker exec -ti spark-worker-2 ls -l /opt/spark-apps
120+
Table|Aggregation
121+
---|---
122+
day_summary|A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles)
123+
speed_excesses|Speed excesses calculated in a 5 minute window
124+
average_speed|Average speed by vehicle
125+
distance_traveled|Total Distance traveled by vehicle
150126

151-
docker exec -ti spark-worker-2 ls -l /opt/spark-data
152127

153-
# Worker 3 Validations
154-
docker exec -ti spark-worker-3 ls -l /opt/spark-apps
128+
To submit the app connect to one of the workers or the master and execute:
155129

156-
docker exec -ti spark-worker-3 ls -l /opt/spark-data
130+
```sh
131+
/opt/spark/bin/spark-submit --deploy-mode cluster \
132+
--master spark://spark-master:7077 \
133+
--total-executor-cores 1 \
134+
--class mta.processing.MTAStatisticsApp \
135+
--driver-memory 1G \
136+
--executor-memory 1G \
137+
--jars /opt/spark-apps/postgresql-42.2.22.jar \
138+
--conf spark.driver.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
139+
--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
140+
/opt/spark-apps/mta-processing.jar
157141
```
158-
After running one of this commands you have to see your app's jar and files.
159-
160-
161-
## Use docker spark-submit
162-
163-
```bash
164-
#Creating some variables to make the docker run command more readable
165-
#App jar environment used by the spark-submit image
166-
SPARK_APPLICATION_JAR_LOCATION="/opt/spark-apps/crimes-app.jar"
167-
#App main class environment used by the spark-submit image
168-
SPARK_APPLICATION_MAIN_CLASS="org.mvb.applications.CrimesApp"
169-
#Extra submit args used by the spark-submit image
170-
SPARK_SUBMIT_ARGS="--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/dev/config.conf'"
171-
172-
#We have to use the same network as the spark cluster(internally the image resolves spark master as spark://spark-master:7077)
173-
docker run --network docker-spark-cluster_spark-network \
174-
-v /mnt/spark-apps:/opt/spark-apps \
175-
--env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION \
176-
--env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS \
177-
spark-submit:2.3.1
178142

179-
```
143+
You will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)
180144

181-
After running this you will see an output pretty much like this:
182-
183-
```bash
184-
Running Spark using the REST application submission protocol.
185-
2018-09-23 15:17:52 INFO RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066.
186-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state...
187-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180923151753-0000 in spark://spark-master:6066.
188-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING.
189-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381.
190-
2018-09-23 15:17:53 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
191-
{
192-
"action" : "CreateSubmissionResponse",
193-
"message" : "Driver successfully submitted as driver-20180923151753-0000",
194-
"serverSparkVersion" : "2.3.1",
195-
"submissionId" : "driver-20180923151753-0000",
196-
"success" : true
197-
}
198-
```
145+
![alt text](./articles/images/stats-app.png "Spark UI with scala program running")
199146

200-
# Summary (What have I done :O?)
201147

202-
* We compiled the necessary docker images to run spark master and worker containers.
148+
# Summary
203149

204-
* We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose.
150+
* We compiled the necessary docker image to run spark master and worker containers.
205151

206-
* Copied the resources necessary to run a sample application.
152+
* We created a spark standalone cluster using 2 worker nodes and 1 master node using docker && docker-compose.
207153

208-
* Submitted an application to the cluster using a **spark-submit** docker image.
154+
* Copied the resources necessary to run demo applications.
209155

210156
* We ran a distributed application at home(just need enough cpu cores and RAM to do so).
211157

212158
# Why a standalone cluster?
213159

214160
* This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.
215161

216-
* Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(.
217-
218162
* This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)
219163

220164
# Steps to connect and use a pyspark shell interactively
221165

222166
* Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker.
223167

224-
```bash
168+
```sh
225169
docker-compose up --scale spark-worker=1
226170
docker exec -it docker-spark-cluster_spark-worker_1 bash
227171
apt update
228172
apt install python3-pip
229173
pip3 install pyspark
230174
pyspark
231175
```
176+
177+
# What's left to do?
178+
179+
* Right now to run applications in deploy-mode cluster is necessary to specify arbitrary driver port.
180+
181+
* The spark submit entry in the start-spark.sh is unimplemented, the submit used in the demos can be triggered from any worker

apps/main.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
from pyspark.sql import SparkSession
2+
from pyspark.sql.functions import col,date_format
3+
4+
def init_spark():
5+
sql = SparkSession.builder\
6+
.appName("trip-app")\
7+
.config("spark.jars", "/opt/spark-apps/postgresql-42.2.22.jar")\
8+
.getOrCreate()
9+
sc = sql.sparkContext
10+
return sql,sc
11+
12+
def main():
13+
url = "jdbc:postgresql://demo-database:5432/mta_data"
14+
properties = {
15+
"user": "postgres",
16+
"password": "casa1234",
17+
"driver": "org.postgresql.Driver"
18+
}
19+
file = "/opt/spark-data/MTA_2014_08_01.csv"
20+
sql,sc = init_spark()
21+
22+
df = sql.read.load(file,format = "csv", inferSchema="true", sep="\t", header="true"
23+
) \
24+
.withColumn("report_hour",date_format(col("time_received"),"yyyy-MM-dd HH:00:00")) \
25+
.withColumn("report_date",date_format(col("time_received"),"yyyy-MM-dd"))
26+
27+
# Filter invalid coordinates
28+
df.where("latitude <= 90 AND latitude >= -90 AND longitude <= 180 AND longitude >= -180") \
29+
.where("latitude != 0.000000 OR longitude != 0.000000 ") \
30+
.write \
31+
.jdbc(url=url, table="mta_reports", mode='append', properties=properties) \
32+
.save()
33+
34+
if __name__ == '__main__':
35+
main()

apps/mta-processing.jar

286 KB
Binary file not shown.

apps/mta.conf

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
app {
2+
input {
3+
file="/opt/spark-data/MTA_2014_08_01.csv"
4+
options {
5+
header=true
6+
delimiter="\t"
7+
nullValue="null"
8+
}
9+
}
10+
11+
spark {
12+
conf {
13+
"spark.driver.port": "50243"
14+
}
15+
}
16+
}

apps/postgresql-42.2.22.jar

982 KB
Binary file not shown.

0 commit comments

Comments
 (0)