Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
[SPARK-1267][PYSPARK] Adds pip installer for pyspark
  • Loading branch information
Auberon Lopez committed Aug 19, 2015
commit a288923d7600055b9af346ed74f88c7be598fbb1
27 changes: 27 additions & 0 deletions python/pyspark/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,33 @@
Finer-grained cache persistence levels.

"""
import os
import sys

import xml.etree.ElementTree as ET

if (os.environ.get("SPARK_HOME", "not found") == "not found"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about starting to add some of the logic from findspark to autodetect SPARK_HOME?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the only autodetection logic that findspark has is to check where homebrew installs Spark on OSX. I think that for pyspark this is overly specific and brittle, and can lead to confusion if the user wanted to use pyspark with a different version than the one installed by homebrew. Having the user set SPARK_HOME themself makes the process unambiguous.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically.

Are people launching python apps frequently without using spark-submit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the common use cases for pyspark without spark-submit is running from a notebook environment. I think there is a decent number of people that do this, and that more will once it's easier to do so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible via the following:

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook'  spark-1.4.0-bin-hadoop2.4/bin/pyspark

Not completely discoverable, but it works =)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like

./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

does a lot of logic, especially when deploying against YARN that seems important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, it's been fairly easy to get the pyspark launcher to launch via a notebook, either doing what @justinuang said, or just setting IPYTHON=1 (which basically does the same).

But there are scenarios where it's useful to forgo the pyspark launcher entirely, i.e. launch with ipython and then do all the Spark-related stuff. One key use case is in containerized notebook deployments (like tmpnb) where we want a way to launch/deploy notebooks in a generic way (e.g. with ipython), but still let someone import and launch a SparkContext after the fact.

This PR is a great step, we could get closer to that goal by adding more autodetection / path setting logic (as pointed out by @Carreau ). But I agree with @alope107 that it might be too brittle, and it would definitely be some work to support all the config / deployment modes that SparkSubmit handles now (which themselves change across releases). I suspect that's why the core devs have tried to force the language APIs to go through a common launcher, but would be good to get more input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if os.environ.get("SPARK_HOME") is None:

raise ImportError("Environment variable SPARK_HOME is undefined.")

spark_home = os.environ['SPARK_HOME']
pom_xml_file_path = spark_home + '/pom.xml'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join


try:
tree = ET.parse(pom_xml_file_path)
root = tree.getroot()
version_tag = root[4].text
snapshot_version = version_tag[:5]
except:
raise ImportError("Could not read the spark version, because pom.xml file" +
" is not found in SPARK_HOME(%s) directory." % (spark_home))

from pyspark.pyspark_version import __version__
if (snapshot_version != __version__):
raise ImportError("Incompatible version of Spark(%s) and PySpark(%s)." %
(snapshot_version, __version__))

sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], "python/lib/py4j-0.8.1-src.zip"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore, presumably if they pip installed the package, then py4j will already be installed in site-packages.



from pyspark.conf import SparkConf
from pyspark.context import SparkContext
Expand Down
17 changes: 17 additions & 0 deletions python/pyspark/pyspark_version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
__version__ = '1.5.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo).

I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected.
Maybe a middle ground could be to include the version checks in both bundled and pip installations, but to include a check during PyPI package generation that the version has been properly set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the version number specified for the scala side now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Could someone with more experience with that side of the project chime in?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted.

19 changes: 19 additions & 0 deletions python/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/usr/bin/env python

from setuptools import setup

exec(compile(open("pyspark/pyspark_version.py").read(),
"pyspark/pyspark_version.py", 'exec'))
VERSION = __version__

setup(name = 'pyspark',
version = VERSION,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using three spaces for indentation?

also, pep8, there should be no spaces around the =

name='pyspark',
version=VERSION

description = 'Apache Spark Python API',
author = 'Prabin Banka',
author_email = '[email protected]',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this in the Apache repo, then these individual author tags should be replaced with references to Apache.

url = 'https://github.com/apache/spark/tree/master/python',
packages = ['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'],
data_files = [('pyspark', ['pyspark/pyspark_version.py'])],
install_requires = ['numpy>=1.7', 'py4j==0.8.2.1', 'pandas'],
license = 'http://www.apache.org/licenses/LICENSE-2.0',
)