scottcode / pyspark-uploader Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Enables rapid development of packages to be used via PySpark on a Spark cluster by uploading a local Python package to the cluster.

MIT license

2 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pyspark_uploader		pyspark_uploader
tests		tests
README.md		README.md

Repository files navigation

pyspark-uploader

Enables rapid development of packages to be used via PySpark on a Spark cluster by uploading a local Python package to the cluster.

Author: Scott Hajek

Example usage

Assuming you have a development module or package that is in your PYTHONPATH, e.g. dev_pkg, and a SparkSession instance named spark_session, then you can do the following:

import pyspark.sql.functions as F
from pyspark.sql.types import StringType

from dev_pkg.text import clean_text

from pyspark_uploader.udf import udf_from_module

clean_text_udf = udf_from_module(clean_text, StringType(), spark_session)

df2 = df.withColumn('cleaned', clean_text_udf(F.col('text')))
df2.write.saveAsTable('result_table')