-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s #21092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
fb5b9ed
b7b3db0
98cef8c
dc670dc
eabe4b9
8d3debb
91e2a2c
5761ee8
98cc044
678d381
bf738dc
c59068d
0344f90
306f3ed
f2fc53e
6f66d60
914ff75
d400607
72953a3
7bedeb6
1801e96
24a704e
6a6d69d
ab92913
a61d897
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
…r python3 to be specified
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -181,7 +181,18 @@ private[spark] object Config extends Logging { | |
| .doc("This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs " + | ||
| "which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs") | ||
| .doubleConf | ||
| .createWithDefault(0.10) | ||
| .checkValue(mem_overhead => mem_overhead >= 0 && mem_overhead < 1, | ||
| "Ensure that memory overhead is a double between 0 --> 1.0") | ||
| .createOptional | ||
|
|
||
| val PYSPARK_PYTHON_VERSION = | ||
|
||
| ConfigBuilder("spark.kubernetes.pyspark.pythonversion") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for leaving a comment in an ancient PR but I couldn't hold it. Why did we add a configuration to control Python version instead of using the existent Doing this in a configuration breaks or disables many things, for example, PEX (https://medium.com/criteo-labs/packaging-code-with-pex-a-pyspark-example-9057f9f144f3) that requires to set
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @dongjoon-hyun too FYI. Conda / virtualenv support enabled by #30486 wouldn't work in Kubernates because of this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @HyukjinKwon sounds reasonable to include support for that, we just need to agree on a policy for which takes precedence. |
||
| .doc("This sets the python version. Either 2 or 3. (Python2 or Python3)") | ||
| .stringConf | ||
| .checkValue(pv => List("2", "3").contains(pv), | ||
| "Ensure that Python Version is either Python2 or Python3") | ||
| .createWithDefault("2") | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I reading this right that the default is Python 2? Is there a reason for that? Thanks!
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No particular reason. I just thought that the major version should default to 2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is only ~18 months of support left for Python 2. Python 3 has been around for 10 years and unless there’s a good reason, I think it should be the default.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am willing to do that: thoughts @holdenk ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with either as the default. While Py2 is officially EOL I think we'll still see PySpark Py2 apps for awhile after. |
||
|
|
||
|
|
||
| val KUBERNETES_AUTH_SUBMISSION_CONF_PREFIX = | ||
| "spark.kubernetes.authenticate.submission" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,6 +76,9 @@ private[spark] case class KubernetesConf[T <: KubernetesRoleSpecificConf]( | |
| def pySparkAppArgs(): Option[String] = sparkConf | ||
| .get(KUBERNETES_PYSPARK_APP_ARGS) | ||
|
|
||
| def pySparkPythonVersion(): String = sparkConf | ||
| .get(PYSPARK_PYTHON_VERSION) | ||
|
|
||
| def imagePullPolicy(): String = sparkConf.get(CONTAINER_IMAGE_PULL_POLICY) | ||
|
|
||
| def imagePullSecrets(): Seq[LocalObjectReference] = { | ||
|
|
@@ -131,7 +134,7 @@ private[spark] object KubernetesConf { | |
| sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res) | ||
| sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_APP_ARGS, appArgs.mkString(" ")) | ||
| } | ||
| sparkConfWithMainAppJar.set(MEMORY_OVERHEAD_FACTOR, 0.4) | ||
| sparkConfWithMainAppJar.setIfMissing(MEMORY_OVERHEAD_FACTOR, 0.4) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to set this in the JVM case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is set later in BaseDriverStep |
||
| } | ||
|
|
||
| val driverCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -53,7 +53,7 @@ private[spark] class BasicDriverFeatureStep( | |
| private val driverMemoryMiB = conf.get(DRIVER_MEMORY) | ||
| private val memoryOverheadMiB = conf | ||
| .get(DRIVER_MEMORY_OVERHEAD) | ||
| .getOrElse(math.max((conf.get(MEMORY_OVERHEAD_FACTOR) * driverMemoryMiB).toInt, | ||
| .getOrElse(math.max((conf.get(MEMORY_OVERHEAD_FACTOR).getOrElse(0.1) * driverMemoryMiB).toInt, | ||
|
||
| MEMORY_OVERHEAD_MIN_MIB)) | ||
| private val driverMemoryWithOverheadMiB = driverMemoryMiB + memoryOverheadMiB | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,7 +54,8 @@ private[spark] class BasicExecutorFeatureStep( | |
|
|
||
| private val memoryOverheadMiB = kubernetesConf | ||
| .get(EXECUTOR_MEMORY_OVERHEAD) | ||
| .getOrElse(math.max((kubernetesConf.get(MEMORY_OVERHEAD_FACTOR) * executorMemoryMiB).toInt, | ||
| .getOrElse(math.max( | ||
| (kubernetesConf.get(MEMORY_OVERHEAD_FACTOR).getOrElse(0.1) * executorMemoryMiB).toInt, | ||
|
||
| MEMORY_OVERHEAD_MIN_MIB)) | ||
| private val executorMemoryWithOverhead = executorMemoryMiB + memoryOverheadMiB | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,7 +20,7 @@ import io.fabric8.kubernetes.api.model.ContainerBuilder | |
| import io.fabric8.kubernetes.api.model.HasMetadata | ||
|
|
||
| import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesRoleSpecificConf, SparkPod} | ||
| import org.apache.spark.deploy.k8s.Constants.{ENV_PYSPARK_ARGS, ENV_PYSPARK_FILES, ENV_PYSPARK_PRIMARY} | ||
| import org.apache.spark.deploy.k8s.Constants._ | ||
| import org.apache.spark.deploy.k8s.KubernetesUtils | ||
| import org.apache.spark.deploy.k8s.features.KubernetesFeatureConfigStep | ||
|
|
||
|
|
@@ -32,7 +32,7 @@ private[spark] class PythonDriverFeatureStep( | |
| require(mainResource.isDefined, "PySpark Main Resource must be defined") | ||
| val otherPyFiles = kubernetesConf.pyFiles().map(pyFile => | ||
| KubernetesUtils.resolveFileUrisAndPath(pyFile.split(",")) | ||
| .mkString(",")).getOrElse("") | ||
| .mkString(":")).getOrElse("") | ||
|
||
| val withPythonPrimaryFileContainer = new ContainerBuilder(pod.container) | ||
| .addNewEnv() | ||
| .withName(ENV_PYSPARK_ARGS) | ||
|
|
@@ -46,6 +46,10 @@ private[spark] class PythonDriverFeatureStep( | |
| .withName(ENV_PYSPARK_FILES) | ||
| .withValue(if (otherPyFiles == "") {""} else otherPyFiles) | ||
|
||
| .endEnv() | ||
| .addNewEnv() | ||
| .withName(ENV_PYSPARK_PYTHON_VERSION) | ||
| .withValue(kubernetesConf.pySparkPythonVersion()) | ||
| .endEnv() | ||
| .build() | ||
| SparkPod(pod.pod, withPythonPrimaryFileContainer) | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -93,7 +93,7 @@ class KubernetesConfSuite extends SparkFunSuite { | |
| None) | ||
| assert(kubernetesConfWithoutMainJar.sparkConf.get("spark.jars").split(",") | ||
| === Array("local:///opt/spark/jar1.jar")) | ||
| assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.1) | ||
| assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR).isEmpty) | ||
| } | ||
|
|
||
| test("Creating driver conf with a python primary file") { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would like also see a unit test for with a PyFile and an overriden memory overhead.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Defaults are checked on 96 and 117. (But I need to ensure that it is possible to override as well. Will add)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a follow up we should have a test for with Python and overriding MEMORY_OVERHEAD_FACTOR (e.g. test to make sure that setIfMissing since we had it the other way earlier in the PR). |
||
|
|
@@ -114,14 +114,15 @@ class KubernetesConfSuite extends SparkFunSuite { | |
| Some(inputPyFiles.mkString(","))) | ||
| assert(kubernetesConfWithMainResource.sparkConf.get("spark.jars").split(",") | ||
| === Array("local:///opt/spark/jar1.jar")) | ||
| assert(kubernetesConfWithMainResource.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.4) | ||
| assert(kubernetesConfWithMainResource.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === Some(0.4)) | ||
| assert(kubernetesConfWithMainResource.sparkFiles | ||
| === Array("local:///opt/spark/example4.py", mainResourceFile) ++ inputPyFiles) | ||
| } | ||
|
|
||
|
|
||
| test("Resolve driver labels, annotations, secret mount paths, and envs.") { | ||
| test("Resolve driver labels, annotations, secret mount paths, envs, and memory overhead") { | ||
| val sparkConf = new SparkConf(false) | ||
| .set(MEMORY_OVERHEAD_FACTOR, 0.3) | ||
| CUSTOM_LABELS.foreach { case (key, value) => | ||
| sparkConf.set(s"$KUBERNETES_DRIVER_LABEL_PREFIX$key", value) | ||
| } | ||
|
|
@@ -151,6 +152,7 @@ class KubernetesConfSuite extends SparkFunSuite { | |
| assert(conf.roleAnnotations === CUSTOM_ANNOTATIONS) | ||
| assert(conf.roleSecretNamesToMountPaths === SECRET_NAMES_TO_MOUNT_PATHS) | ||
| assert(conf.roleEnvs === CUSTOM_ENVS) | ||
| assert(conf.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === Some(0.3)) | ||
| } | ||
|
|
||
| test("Basic executor translated fields.") { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,16 +18,17 @@ | |
| ARG base_img | ||
| FROM $base_img | ||
| WORKDIR / | ||
| COPY python /opt/spark/python | ||
| RUN mkdir ${SPARK_HOME}/python | ||
| COPY python/lib ${SPARK_HOME}/python/lib | ||
| RUN apk add --no-cache python && \ | ||
| apk add --no-cache python3 && \ | ||
| python -m ensurepip && \ | ||
| python3 -m ensurepip && \ | ||
| rm -r /usr/lib/python*/ensurepip && \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a comment about why this part? |
||
| pip install --upgrade pip setuptools && \ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this goes to python2 only, I think?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. Would love recommendations on dependency management in regards to ‘pip’ as it’s tricky to allow for both pip installation and pip3 installation. Unless I use two separate virtual environments for dependency management
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you can run both pip and pip3 with the same packages and they'll get installed in different directories and shouldn't stomp on top of eachother. That being said long term venvs are probably the way we want to go, but as we've discussed those are probably non-trivial and should go in a second PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ping
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will include pip3.6 installation for now until we figure out a long-term venv solution in the next PR |
||
| rm -r /root/.cache | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this just being done for space reasons?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
| ENV PYTHON_VERSION 2.7.13 | ||
| ENV PYSPARK_PYTHON python | ||
| ENV PYSPARK_DRIVER_PYTHON python | ||
| ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:${PYTHONPATH} | ||
|
|
||
| ENV PYTHONPATH ${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip | ||
|
|
||
| WORKDIR /opt/spark/work-dir | ||
| ENTRYPOINT [ "/opt/entrypoint.sh" ] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -53,12 +53,21 @@ if [ -n "$SPARK_MOUNTED_FILES_DIR" ]; then | |
| cp -R "$SPARK_MOUNTED_FILES_DIR/." . | ||
| fi | ||
|
|
||
| PYSPARK_SECONDARY="$PYSPARK_APP_ARGS" | ||
| if [ ! -z "$PYSPARK_FILES" ]; then | ||
| PYSPARK_SECONDARY="$PYSPARK_FILES $PYSPARK_APP_ARGS" | ||
| if [ -n "$PYSPARK_FILES" ]; then | ||
| PYTHONPATH="$PYTHONPATH:$PYSPARK_FILES" | ||
| fi | ||
|
|
||
|
|
||
| if [ "$PYSPARK_PYTHON_VERSION" == "2" ]; then | ||
|
||
| pyv="$(python -V 2>&1)" | ||
| export PYTHON_VERSION="${pyv:7}" | ||
| export PYSPARK_PYTHON="python" | ||
| export PYSPARK_DRIVER_PYTHON="python" | ||
| elif [ "$PYSPARK_PYTHON_VERSION" == "3" ]; then | ||
| pyv3="$(python3 -V 2>&1)" | ||
| export PYTHON_VERSION="${pyv3:7}" | ||
| export PYSPARK_PYTHON="python3" | ||
| export PYSPARK_DRIVER_PYTHON="python3" | ||
| fi | ||
|
|
||
| case "$SPARK_K8S_CMD" in | ||
| driver) | ||
|
|
@@ -74,7 +83,7 @@ case "$SPARK_K8S_CMD" in | |
| "$SPARK_HOME/bin/spark-submit" | ||
| --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" | ||
| --deploy-mode client | ||
| "$@" $PYSPARK_PRIMARY $PYSPARK_SECONDARY | ||
| "$@" $PYSPARK_PRIMARY $PYSPARK_APP_ARGS | ||
| ) | ||
| ;; | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this thanks for adding this.