Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
fb5b9ed
initial architecture for PySpark w/o dockerfile work
ifilonenko Apr 16, 2018
b7b3db0
included entrypoint logic
ifilonenko Apr 17, 2018
98cef8c
satisfying integration tests
ifilonenko Apr 18, 2018
dc670dc
end-to-end working pyspark
ifilonenko Apr 18, 2018
eabe4b9
Merge pull request #1 from ifilonenko/py-spark
ifilonenko Apr 18, 2018
8d3debb
resolved comments and fixed --pyfiles issue and allowed for python2 o…
ifilonenko May 2, 2018
91e2a2c
Merge pull request #2 from ifilonenko/py-spark
ifilonenko May 2, 2018
5761ee8
Merge branch 'master' of https://github.com/ifilonenko/spark
ifilonenko May 2, 2018
98cc044
restructured step based pattern to resolve comments
ifilonenko May 7, 2018
678d381
Merge pull request #3 from ifilonenko/py-spark
ifilonenko May 7, 2018
bf738dc
resolved comments
ifilonenko May 8, 2018
c59068d
Merge pull request #4 from ifilonenko/py-spark
ifilonenko May 8, 2018
0344f90
resolving style issues
ifilonenko May 9, 2018
306f3ed
Merge pull request #5 from ifilonenko/py-spark
ifilonenko May 9, 2018
f2fc53e
resolved commits
ifilonenko May 13, 2018
6f66d60
merge conflicts
ifilonenko May 13, 2018
914ff75
resolve rebase conflicts
ifilonenko May 11, 2018
d400607
import statements refactoring
ifilonenko May 13, 2018
72953a3
Merge branch 'py-spark'
ifilonenko May 13, 2018
7bedeb6
resolved comments
ifilonenko May 31, 2018
1801e96
merge conflicts
ifilonenko May 31, 2018
24a704e
setIfMissing
ifilonenko Jun 1, 2018
6a6d69d
added e2e tests on --py-files and inclusion of docs on config values
ifilonenko Jun 7, 2018
ab92913
style issues
ifilonenko Jun 7, 2018
a61d897
resolve comments on docs and addition of unit test
ifilonenko Jun 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
included entrypoint logic
  • Loading branch information
ifilonenko committed Apr 17, 2018
commit b7b3db0abfbf425120fa21cc61e603c5d766f8af
23 changes: 17 additions & 6 deletions bin/docker-image-tool.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,20 @@ function build {
if [ ! -d "$IMG_PATH" ]; then
error "Cannot find docker image. This script must be run from a runnable distribution of Apache Spark."
fi

local DOCKERFILE=${DOCKERFILE:-"$IMG_PATH/spark/Dockerfile"}
local BINDING_BUILD_ARGS=(
--build-arg
base_img=$(image_ref spark)
)
local BASEDOCKERFILE=${BASEDOCKERFILE:-"$IMG_PATH/spark/Dockerfile"}
local PYDOCKERFILE=${PYDOCKERFILE:-"$IMG_PATH/spark/bindings/python/Dockerfile"}

docker build "${BUILD_ARGS[@]}" \
-t $(image_ref spark) \
-f "$DOCKERFILE" .
-f "$BASEDOCKERFILE" .

docker build "${BINDING_BUILD_ARGS[@]}" \
-t $(image_ref spark-py) \
-f "$PYDOCKERFILE" .
}

function push {
Expand All @@ -86,7 +94,8 @@ Commands:
push Push a pre-built image to a registry. Requires a repository address to be provided.

Options:
-f file Dockerfile to build. By default builds the Dockerfile shipped with Spark.
-f file Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
-p file Dockerfile with Python baked in. By default builds the Dockerfile shipped with Spark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (future concern) is how we would to handle the overlay with both Python and R at the same time.

-r repo Repository address.
-t tag Tag to apply to the built image, or to identify the image to be pushed.
-m Use minikube's Docker daemon.
Expand Down Expand Up @@ -116,12 +125,14 @@ fi

REPO=
TAG=
DOCKERFILE=
BASEDOCKERFILE=
PYDOCKERFILE=
while getopts f:mr:t: option
do
case "${option}"
in
f) DOCKERFILE=${OPTARG};;
f) BASEDOCKERFILE=${OPTARG};;
p) PYDOCKERFILE=${OPTARG};;
r) REPO=${OPTARG};;
t) TAG=${OPTARG};;
m)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import org.apache.spark.SparkException
import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesDriverSpecificConf, KubernetesUtils, SparkPod}
import org.apache.spark.deploy.k8s.Config._
import org.apache.spark.deploy.k8s.Constants._
import org.apache.spark.deploy.k8s.submit._
import org.apache.spark.internal.config._
import org.apache.spark.launcher.SparkLauncher

Expand All @@ -44,6 +45,10 @@ private[spark] class BasicDriverFeatureStep(
private val driverCpuCores = conf.get("spark.driver.cores", "1")
private val driverLimitCores = conf.get(KUBERNETES_DRIVER_LIMIT_CORES)

private val driverDockerContainer = conf.roleSpecificConf.mainAppResource.map {
case JavaMainAppResource(_) => "driver"
case PythonMainAppResource(_) => "driver-py"
}.getOrElse(throw new SparkException("Must specify a JVM or Python Resource"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I therefore not throw an error here @mccheah and move this logic into the steps?

// Memory settings
private val driverMemoryMiB = conf.get(DRIVER_MEMORY)
private val memoryOverheadMiB = conf
Expand Down Expand Up @@ -89,7 +94,7 @@ private[spark] class BasicDriverFeatureStep(
.addToRequests("memory", driverMemoryQuantity)
.addToLimits("memory", driverMemoryQuantity)
.endResources()
.addToArgs("driver")
.addToArgs(driverDockerContainer)
.addToArgs("--properties-file", SPARK_CONF_PATH)
.addToArgs("--class", conf.roleSpecificConf.mainClass)
// The user application jar is merged into the spark.jars list and managed through that
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ import org.apache.spark.{SparkConf, SparkFunSuite}
import org.apache.spark.deploy.k8s.{KubernetesConf, KubernetesDriverSpecificConf, SparkPod}
import org.apache.spark.deploy.k8s.Config._
import org.apache.spark.deploy.k8s.Constants._
import org.apache.spark.deploy.k8s.submit.JavaMainAppResource
import org.apache.spark.deploy.k8s.submit.PythonMainAppResource

class BasicDriverFeatureStepSuite extends SparkFunSuite {

Expand All @@ -33,8 +35,7 @@ class BasicDriverFeatureStepSuite extends SparkFunSuite {
private val CONTAINER_IMAGE_PULL_POLICY = "IfNotPresent"
private val APP_NAME = "spark-test"
private val MAIN_CLASS = "org.apache.spark.examples.SparkPi"
private val PYTHON_MAIN_CLASS = "example.py"
private val EXAMPLE_PYTHON_FILES = Seq("example2.py", "example3.py")
private val PY_MAIN_CLASS = "org.apache.spark.deploy.PythonRunner"
private val APP_ARGS = Array("arg1", "arg2", "\"arg 3\"")
private val CUSTOM_ANNOTATION_KEY = "customAnnotation"
private val CUSTOM_ANNOTATION_VALUE = "customAnnotationValue"
Expand Down Expand Up @@ -62,7 +63,7 @@ class BasicDriverFeatureStepSuite extends SparkFunSuite {
val kubernetesConf = KubernetesConf(
sparkConf,
KubernetesDriverSpecificConf(
None,
Some(JavaMainAppResource("")),
APP_NAME,
MAIN_CLASS,
APP_ARGS),
Expand Down Expand Up @@ -112,7 +113,6 @@ class BasicDriverFeatureStepSuite extends SparkFunSuite {
assert(driverPodMetadata.getLabels.asScala === DRIVER_LABELS)
assert(driverPodMetadata.getAnnotations.asScala === DRIVER_ANNOTATIONS)
assert(configuredPod.pod.getSpec.getRestartPolicy === "Never")

val expectedSparkConf = Map(
KUBERNETES_DRIVER_POD_NAME.key -> "spark-driver-pod",
"spark.app.id" -> APP_ID,
Expand All @@ -121,6 +121,47 @@ class BasicDriverFeatureStepSuite extends SparkFunSuite {
assert(featureStep.getAdditionalPodSystemProperties() === expectedSparkConf)
}

test("Check appropriate entrypoint rerouting for various bindings") {
val sparkConf = new SparkConf()
.set(org.apache.spark.internal.config.DRIVER_MEMORY.key, "4g")
.set(CONTAINER_IMAGE, "spark-driver:latest")
val javaKubernetesConf = KubernetesConf(
sparkConf,
KubernetesDriverSpecificConf(
Some(JavaMainAppResource("")),
APP_NAME,
PY_MAIN_CLASS,
APP_ARGS),
RESOURCE_NAME_PREFIX,
APP_ID,
DRIVER_LABELS,
DRIVER_ANNOTATIONS,
Map.empty,
DRIVER_ENVS,
Seq.empty[String])
val pythonKubernetesConf = KubernetesConf(
sparkConf,
KubernetesDriverSpecificConf(
Some(PythonMainAppResource("")),
APP_NAME,
PY_MAIN_CLASS,
APP_ARGS),
RESOURCE_NAME_PREFIX,
APP_ID,
DRIVER_LABELS,
DRIVER_ANNOTATIONS,
Map.empty,
DRIVER_ENVS,
Seq.empty[String])
val javaFeatureStep = new BasicDriverFeatureStep(javaKubernetesConf)
val pythonFeatureStep = new BasicDriverFeatureStep(pythonKubernetesConf)
val basePod = SparkPod.initialPod()
val configuredJavaPod = javaFeatureStep.configurePod(basePod)
val configuredPythonPod = pythonFeatureStep.configurePod(basePod)
assert(configuredJavaPod.container.getArgs.get(0) === "driver")
assert(configuredPythonPod.container.getArgs.get(0) === "driver-py")
}

test("Additional system properties resolve jars and set cluster-mode confs.") {
val allJars = Seq("local:///opt/spark/jar1.jar", "hdfs:///opt/spark/jar2.jar")
val allFiles = Seq("https://localhost:9000/file1.txt", "local:///opt/spark/file2.txt")
Expand All @@ -132,7 +173,7 @@ class BasicDriverFeatureStepSuite extends SparkFunSuite {
val kubernetesConf = KubernetesConf(
sparkConf,
KubernetesDriverSpecificConf(
None,
Some(JavaMainAppResource("")),
APP_NAME,
MAIN_CLASS,
APP_ARGS),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

ARG base_img
FROM $base_img
WORKDIR /
COPY python /opt/spark/python
RUN apk add --no-cache python && \
python -m ensurepip && \
rm -r /usr/lib/python*/ensurepip && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about why this part?

pip install --upgrade pip setuptools && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goes to python2 only, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Would love recommendations on dependency management in regards to ‘pip’ as it’s tricky to allow for both pip installation and pip3 installation. Unless I use two separate virtual environments for dependency management

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can run both pip and pip3 with the same packages and they'll get installed in different directories and shouldn't stomp on top of eachother. That being said long term venvs are probably the way we want to go, but as we've discussed those are probably non-trivial and should go in a second PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will include pip3.6 installation for now until we figure out a long-term venv solution in the next PR

rm -r /root/.cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just being done for space reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

ENV PYTHON_VERSION 2.7.13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this, are we implicitly imposing a contract on the base image to have this particular version of python installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I brought up in the PR description. And why this still a WIP. I need to investigate the proper way to determine whether we ship these containers with Python2 or Python3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some OSes, python vs python3 symlink to the installed version of python, respectively for the version 2.x and 3.x, is that a better approach then hardcoding the version number?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it might make sense to build the container with both 2 & 3 since the container might be built by a vendor or cluster administrator and then used by a variety of people. What do folks think?

As for figuring out the env, if we wanted to do it that way we can call the current users python and ask it for its version version information (based on the Spark Python enviroment variables).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a canonical container should include both. My instinct is that a user should be able to "force" the use of one or the other. If someone is invoking spark-submit in cluster-mode, with a supplied python file, some kind of CLI argument (--conf or otherwise) seems like the only totally foolproof way to identify that for the eventual pod construction, but maybe there is a better way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps re-use PYSPARK_PYTHON?

ENV PYSPARK_PYTHON python
ENV PYSPARK_DRIVER_PYTHON python
ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:${PYTHONPATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need to mention the Py4J zip file needs to be updated here as well :(
Also open question if we want the PySpark.zip file in here instead of the python/, and or if we're trying to make "slim" images if we want to delete that zip file.


WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,14 @@ case "$SPARK_K8S_CMD" in
"$@"
)
;;
driver-py)
CMD=(
"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--deploy-mode client
$PYSPARK_PRIMARY $PYSPARK_FILES "$@"
)
;;

executor)
CMD=(
Expand Down