Skip to content

Commit d7768ba

Browse files
authored
Merge pull request #1 from Fabryprog/spark-3.4.0
Spark 3.4.0
2 parents f2f8b17 + 199b1d0 commit d7768ba

File tree

6 files changed

+17
-19
lines changed

6 files changed

+17
-19
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# Ignore data files
2-
*.csv
2+
#*.csv

Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)"
77

88
# Fix the value of PYTHONHASHSEED
99
# Note: this is needed when you use Python 3.3 or greater
10-
ENV SPARK_VERSION=3.0.2 \
11-
HADOOP_VERSION=3.2 \
10+
ENV SPARK_VERSION=3.4.0 \
11+
HADOOP_VERSION=3 \
1212
SPARK_HOME=/opt/spark \
1313
PYTHONHASHSEED=1
1414

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The following steps will make you run your spark cluster's containers.
2727

2828

2929
```sh
30-
docker build -t cluster-apache-spark:3.0.2 .
30+
docker build -t cluster-apache-spark:3.4.0 .
3131
```
3232

3333
## Run the docker-compose

apps/main.py

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ def init_spark():
1010
return sql,sc
1111

1212
def main():
13-
url = "jdbc:postgresql://demo-database:5432/mta_data"
13+
url = "jdbc:postgresql://demo-database:5432/postgres"
1414
properties = {
1515
"user": "postgres",
1616
"password": "casa1234",
@@ -19,17 +19,12 @@ def main():
1919
file = "/opt/spark-data/MTA_2014_08_01.csv"
2020
sql,sc = init_spark()
2121

22-
df = sql.read.load(file,format = "csv", inferSchema="true", sep="\t", header="true"
23-
) \
24-
.withColumn("report_hour",date_format(col("time_received"),"yyyy-MM-dd HH:00:00")) \
25-
.withColumn("report_date",date_format(col("time_received"),"yyyy-MM-dd"))
22+
df = sql.read.load(file,format = "csv", inferSchema="true", sep="\t", header="true")
2623

2724
# Filter invalid coordinates
28-
df.where("latitude <= 90 AND latitude >= -90 AND longitude <= 180 AND longitude >= -180") \
29-
.where("latitude != 0.000000 OR longitude != 0.000000 ") \
30-
.write \
25+
df.write \
3126
.jdbc(url=url, table="mta_reports", mode='append', properties=properties) \
3227
.save()
3328

3429
if __name__ == '__main__':
35-
main()
30+
main()

data/MTA_2014_08_01.csv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Nome;Cognome
2+
Mario;Rossi
3+
Giacomo;Bianchi

docker-compose.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
version: "3.3"
22
services:
33
spark-master:
4-
image: cluster-apache-spark:3.0.2
4+
image: cluster-apache-spark:3.4.0
55
ports:
66
- "9090:8080"
77
- "7077:7077"
@@ -12,10 +12,10 @@ services:
1212
- SPARK_LOCAL_IP=spark-master
1313
- SPARK_WORKLOAD=master
1414
spark-worker-a:
15-
image: cluster-apache-spark:3.0.2
15+
image: cluster-apache-spark:3.4.0
1616
ports:
1717
- "9091:8080"
18-
- "7000:7000"
18+
- "7001:7000"
1919
depends_on:
2020
- spark-master
2121
environment:
@@ -30,10 +30,10 @@ services:
3030
- ./apps:/opt/spark-apps
3131
- ./data:/opt/spark-data
3232
spark-worker-b:
33-
image: cluster-apache-spark:3.0.2
33+
image: cluster-apache-spark:3.4.0
3434
ports:
3535
- "9092:8080"
36-
- "7001:7000"
36+
- "7002:7000"
3737
depends_on:
3838
- spark-master
3939
environment:
@@ -48,7 +48,7 @@ services:
4848
- ./apps:/opt/spark-apps
4949
- ./data:/opt/spark-data
5050
demo-database:
51-
image: postgres:11.7-alpine
51+
image: postgres:15
5252
ports:
5353
- "5432:5432"
5454
environment:

0 commit comments

Comments
 (0)