Fabryprog
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎Dockerfile‎
Lines changed: 45 additions & 0 deletions b/‎Dockerfile‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 63 additions & 113 deletions b/‎README.md‎
Lines changed: 63 additions & 113 deletions
diff --git a/‎apps/main.py‎
Lines changed: 35 additions & 0 deletions b/‎apps/main.py‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎apps/mta-processing.jar‎
286 KB b/‎apps/mta-processing.jar‎
286 KB
diff --git a/‎apps/mta.conf‎
Lines changed: 16 additions & 0 deletions b/‎apps/mta.conf‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎apps/postgresql-42.2.22.jar‎
982 KB b/‎apps/postgresql-42.2.22.jar‎
982 KB
@@ -0,0 +1,2 @@
+# Ignore data files
+*.csv
@@ -0,0 +1,45 @@
+FROM openjdk:11.0.11-jre-slim-buster as builder
+
+# Add Dependencies for PySpark
+RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy
+
+RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1
+
+# Fix the value of PYTHONHASHSEED
+# Note: this is needed when you use Python 3.3 or greater
+ENV SPARK_VERSION=3.0.2 \
+HADOOP_VERSION=3.2 \
+SPARK_HOME=/opt/spark \
+PYTHONHASHSEED=1
+
+RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
+&& mkdir -p /opt/spark \
+&& tar -xf apache-spark.tgz -C /opt/spark --strip-components=1 \
+&& rm apache-spark.tgz
+
+
+FROM builder as apache-spark
+
+WORKDIR /opt/spark
+
+ENV SPARK_MASTER_PORT=7077 \
+SPARK_MASTER_WEBUI_PORT=8080 \
+SPARK_LOG_DIR=/opt/spark/logs \
+SPARK_MASTER_LOG=/opt/spark/logs/spark-master.out \
+SPARK_WORKER_LOG=/opt/spark/logs/spark-worker.out \
+SPARK_WORKER_WEBUI_PORT=8080 \
+SPARK_WORKER_PORT=7000 \
+SPARK_MASTER="spark://spark-master:7077" \
+SPARK_WORKLOAD="master"
+
+EXPOSE 8080 7077 7000
+
+RUN mkdir -p $SPARK_LOG_DIR && \
+touch $SPARK_MASTER_LOG && \
+touch $SPARK_WORKER_LOG && \
+ln -sf /dev/stdout $SPARK_MASTER_LOG && \
+ln -sf /dev/stdout $SPARK_WORKER_LOG
+
+COPY start-spark.sh /
+
+CMD ["/bin/bash", "/start-spark.sh"]
@@ -1,17 +1,17 @@
-# Spark Cluster with Docker & docker-compose
+# Spark Cluster with Docker & docker-compose(2021 ver.)
 
 # General
 
 A simple spark standalone cluster for your testing environment purposses. A *docker-compose up* away from you solution for your spark development environment.
 
 The Docker compose will create the following containers:
 
-container|Ip address
+container|Exposed ports
 ---|---
-spark-master|10.5.0.2
-spark-worker-1|10.5.0.3
-spark-worker-2|10.5.0.4
-spark-worker-3|10.5.0.5
+spark-master|9090 7077
+spark-worker-1|9091
+spark-worker-2|9092
+demo-database|5432
 
 # Installation
 
@@ -23,35 +23,19 @@ The following steps will make you run your spark cluster's containers.
 
 * Docker compose  installed
 
-* A spark Application Jar to play with(Optional)
+## Build the image
 
-## Build the images
-
-The first step to deploy the cluster will be the build of the custom images, these builds can be performed with the *build-images.sh* script. 
-
-The executions is as simple as the following steps:
 
 ```sh
-chmod +x build-images.sh
-./build-images.sh
+docker build -t cluster-apache-spark:3.0.2 .
 ```
 
-This will create the following docker images:
-
-* spark-base:2.3.1: A base image based on java:alpine-jdk-8 wich ships scala, python3 and spark 2.3.1
-
-* spark-master:2.3.1: A image based on the previously created spark image, used to create a spark master containers.
-
-* spark-worker:2.3.1: A image based on the previously created spark image, used to create spark worker containers.
-
-* spark-submit:2.3.1: A image based on the previously created spark image, used to create spark submit containers(run, deliver driver and die gracefully).
-
 ## Run the docker-compose
 
 The final step to create your test cluster will be to run the compose file:
 
 ```sh
-docker-compose up --scale spark-worker=3
+docker-compose up -d
 ```
 
 ## Validate your cluster
@@ -60,27 +44,22 @@ Just validate your cluster accesing the spark UI on each worker & master URL.
 
 ### Spark Master
 
-http://10.5.0.2:8080/
+http://localhost:9090/
 
 ![alt text](docs/spark-master.png "Spark master UI")
 
 ### Spark Worker 1
 
-http://10.5.0.3:8081/
+http://localhost:9091/
 
 ![alt text](docs/spark-worker-1.png "Spark worker 1 UI")
 
 ### Spark Worker 2
 
-http://10.5.0.4:8081/
+http://localhost:9092/
 
 ![alt text](docs/spark-worker-2.png "Spark worker 2 UI")
 
-### Spark Worker 3
-
-http://10.5.0.5:8081/
-
-![alt text](docs/spark-worker-3.png "Spark worker 3 UI")
 
 # Resource Allocation 
 
@@ -102,130 +81,101 @@ To make app running easier I've shipped two volume mounts described in the follo
 
 Host Mount|Container Mount|Purposse
 ---|---|---
-/mnt/spark-apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
-/mnt/spark-data|/opt/spark-data| Used to make available your app's data on all workers & master
+apps|/opt/spark-apps|Used to make available your app's jars on all workers & master
+data|/opt/spark-data| Used to make available your app's data on all workers & master
 
 This is basically a dummy DFS created from docker Volumes...(maybe not...)
 
-# Run a sample application
-
-Now let`s make a **wild spark submit** to validate the distributed nature of our new toy following these steps:
-
-## Create a Scala spark app
+# Run Sample applications
 
-The first thing you need to do is to make a spark application. Our spark-submit image is designed to run scala code (soon will ship pyspark support guess I was just lazy to do so..).
 
-In my case I am using an app called  [crimes-app](https://). You can make or use your own scala app, I 've just used this one because I had it at hand.
+## NY Bus Stops Data [Pyspark]
 
+This programs just loads archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and apply basic filters using spark sql, the result are persisted into a postgresql table.
 
-## Ship your jar & dependencies on the Workers and Master
+The loaded table will contain the following structure:
 
-A necesary step to make a **spark-submit** is to copy your application bundle into all workers, also any configuration file or input file you need.
+latitude|longitude|time_received|vehicle_id|distance_along_trip|inferred_direction_id|inferred_phase|inferred_route_id|inferred_trip_id|next_scheduled_stop_distance|next_scheduled_stop_id|report_hour|report_date
+---|---|---|---|---|---|---|---|---|---|---|---|---
+40.668602|-73.986697|2014-08-01 04:00:01|469|4135.34710710144|1|IN_PROGRESS|MTA NYCT_B63|MTA NYCT_JG_C4-Weekday-141500_B63_123|2.63183804205619|MTA_305423|2014-08-01 04:00:00|2014-08-01
 
-Luckily for us we are using docker volumes so, you just have to copy your app and configs into /mnt/spark-apps, and your input files into /mnt/spark-files.
+To submit the app connect to one of the workers or the master and execute:
 
-```bash
-#Copy spark application into all workers's app folder
-cp /home/workspace/crimes-app/build/libs/crimes-app.jar /mnt/spark-apps
-
-#Copy spark application configs into all workers's app folder
-cp -r /home/workspace/crimes-app/config /mnt/spark-apps
-
-# Copy the file to be processed to all workers's data folder
-cp /home/Crimes_-_2001_to_present.csv /mnt/spark-files
+```sh
+/opt/spark/bin/spark-submit --master spark://spark-master:7077 \
+--jars /opt/spark-apps/postgresql-42.2.22.jar \
+--driver-memory 1G \
+--executor-memory 1G \
+/opt/spark-apps/main.py
 ```
 
-## Check the successful copy of the data and app jar (Optional)
+![alt text](./articles/images/pyspark-demo.png "Spark UI with pyspark program running")
 
-This is not a necessary step, just if you are curious you can check if your app code and files are in place before running the spark-submit.
+## MTA Bus Analytics[Scala]
 
-```sh
-# Worker 1 Validations
-docker exec -ti spark-worker-1 ls -l /opt/spark-apps
+This program takes the archived data from [MTA Bus Time](http://web.mta.info/developers/MTA-Bus-Time-historical-data.html) and make some aggregations on it, the calculated results are persisted on postgresql tables.
 
-docker exec -ti spark-worker-1 ls -l /opt/spark-data
+Each persisted table correspond to a particullar aggregation:
 
-# Worker 2 Validations
-docker exec -ti spark-worker-2 ls -l /opt/spark-apps
+Table|Aggregation
+---|---
+day_summary|A summary of vehicles reporting, stops visited, average speed and distance traveled(all vehicles)
+speed_excesses|Speed excesses calculated in a 5 minute window
+average_speed|Average speed by vehicle
+distance_traveled|Total Distance traveled by vehicle
 
-docker exec -ti spark-worker-2 ls -l /opt/spark-data
 
-# Worker 3 Validations
-docker exec -ti spark-worker-3 ls -l /opt/spark-apps
+To submit the app connect to one of the workers or the master and execute:
 
-docker exec -ti spark-worker-3 ls -l /opt/spark-data
+```sh
+/opt/spark/bin/spark-submit --deploy-mode cluster \
+--master spark://spark-master:7077 \
+--total-executor-cores 1 \
+--class mta.processing.MTAStatisticsApp \
+--driver-memory 1G \
+--executor-memory 1G \
+--jars /opt/spark-apps/postgresql-42.2.22.jar \
+--conf spark.driver.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
+--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/mta.conf' \
+/opt/spark-apps/mta-processing.jar
 ```
-After running one of this commands you have to see your app's jar and files.
-
-
-## Use docker spark-submit
-
-```bash
-#Creating some variables to make the docker run command more readable
-#App jar environment used by the spark-submit image
-SPARK_APPLICATION_JAR_LOCATION="/opt/spark-apps/crimes-app.jar"
-#App main class environment used by the spark-submit image
-SPARK_APPLICATION_MAIN_CLASS="org.mvb.applications.CrimesApp"
-#Extra submit args used by the spark-submit image
-SPARK_SUBMIT_ARGS="--conf spark.executor.extraJavaOptions='-Dconfig-path=/opt/spark-apps/dev/config.conf'"
-
-#We have to use the same network as the spark cluster(internally the image resolves spark master as spark://spark-master:7077)
-docker run --network docker-spark-cluster_spark-network \
--v /mnt/spark-apps:/opt/spark-apps \
---env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION \
---env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS \
-spark-submit:2.3.1
 
-```
+You will notice on the spark-ui a driver program and executor program running(In scala we can use deploy-mode cluster)
 
-After running this you will see an output pretty much like this:
-
-```bash
-Running Spark using the REST application submission protocol.
-2018-09-23 15:17:52 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in spark://spark-master:6066.
-2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20180923151753-0000. Polling submission state...
-2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180923151753-0000 in spark://spark-master:6066.
-2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - State of driver driver-20180923151753-0000 is now RUNNING.
-2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Driver is running on worker worker-20180923151711-10.5.0.4-45381 at 10.5.0.4:45381.
-2018-09-23 15:17:53 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
-{
-  "action" : "CreateSubmissionResponse",
-  "message" : "Driver successfully submitted as driver-20180923151753-0000",
-  "serverSparkVersion" : "2.3.1",
-  "submissionId" : "driver-20180923151753-0000",
-  "success" : true
-}
-```
+![alt text](./articles/images/stats-app.png "Spark UI with scala program running")
 
-# Summary (What have I done :O?)
 
-* We compiled the necessary docker images to run spark master and worker containers.
+# Summary
 
-* We created a spark standalone cluster using 3 worker nodes and 1 master node using docker && docker-compose.
+* We compiled the necessary docker image to run spark master and worker containers.
 
-* Copied the resources necessary to run a sample application.
+* We created a spark standalone cluster using 2 worker nodes and 1 master node using docker && docker-compose.
 
-* Submitted an application to the cluster using a **spark-submit** docker image.
+* Copied the resources necessary to run demo applications.
 
 * We ran a distributed application at home(just need enough cpu cores and RAM to do so).
 
 # Why a standalone cluster?
 
 * This is intended to be used for test purposes, basically a way of running distributed spark apps on your laptop or desktop.
 
-* Right now I don't have enough resources to make a Yarn, Mesos or Kubernetes based cluster :(.
-
 * This will be useful to use CI/CD pipelines for your spark apps(A really difficult and hot topic)
 
 # Steps to connect and use a pyspark shell interactively
 
 * Follow the steps to run the docker-compose file. You can scale this down if needed to 1 worker. 
 
-```bash
+```sh
 docker-compose up --scale spark-worker=1
 docker exec -it docker-spark-cluster_spark-worker_1 bash
 apt update
 apt install python3-pip
 pip3 install pyspark
 pyspark
 ```
+
+# What's left to do?
+
+* Right now to run applications in deploy-mode cluster is necessary to specify arbitrary driver port.
+
+* The spark submit entry in the start-spark.sh is unimplemented, the submit used in the demos can be triggered from any worker
@@ -0,0 +1,35 @@
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import col,date_format
+
+def init_spark():
+  sql = SparkSession.builder\
+    .appName("trip-app")\
+    .config("spark.jars", "/opt/spark-apps/postgresql-42.2.22.jar")\
+    .getOrCreate()
+  sc = sql.sparkContext
+  return sql,sc
+
+def main():
+  url = "jdbc:postgresql://demo-database:5432/mta_data"
+  properties = {
+    "user": "postgres",
+    "password": "casa1234",
+    "driver": "org.postgresql.Driver"
+  }
+  file = "/opt/spark-data/MTA_2014_08_01.csv"
+  sql,sc = init_spark()
+
+  df = sql.read.load(file,format = "csv", inferSchema="true", sep="\t", header="true"
+      ) \
+      .withColumn("report_hour",date_format(col("time_received"),"yyyy-MM-dd HH:00:00")) \
+      .withColumn("report_date",date_format(col("time_received"),"yyyy-MM-dd"))
+  
+  # Filter invalid coordinates
+  df.where("latitude <= 90 AND latitude >= -90 AND longitude <= 180 AND longitude >= -180") \
+    .where("latitude != 0.000000 OR longitude !=  0.000000 ") \
+    .write \
+    .jdbc(url=url, table="mta_reports", mode='append', properties=properties) \
+    .save()
+  
+if __name__ == '__main__':
+  main()
@@ -0,0 +1,16 @@
+app {
+  input {
+    file="/opt/spark-data/MTA_2014_08_01.csv"
+    options {
+      header=true
+      delimiter="\t"
+      nullValue="null"
+    }
+  }
+
+  spark {
+    conf {
+      "spark.driver.port": "50243"
+    }
+  }
+}