-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-1267][PYSPARK] Adds pip installer for pyspark #8318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
a288923
5e02ee1
c71fb5a
7ad1a8d
55d7602
ff43762
3b864ce
794daf2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,6 +36,33 @@ | |
| Finer-grained cache persistence levels. | ||
|
|
||
| """ | ||
| import os | ||
| import sys | ||
|
|
||
| import xml.etree.ElementTree as ET | ||
|
|
||
| if (os.environ.get("SPARK_HOME", "not found") == "not found"): | ||
| raise ImportError("Environment variable SPARK_HOME is undefined.") | ||
|
|
||
| spark_home = os.environ['SPARK_HOME'] | ||
| pom_xml_file_path = spark_home + '/pom.xml' | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. os.path.join |
||
|
|
||
| try: | ||
| tree = ET.parse(pom_xml_file_path) | ||
| root = tree.getroot() | ||
| version_tag = root[4].text | ||
| snapshot_version = version_tag[:5] | ||
| except: | ||
| raise ImportError("Could not read the spark version, because pom.xml file" + | ||
| " is not found in SPARK_HOME(%s) directory." % (spark_home)) | ||
|
|
||
| from pyspark.pyspark_version import __version__ | ||
| if (snapshot_version != __version__): | ||
| raise ImportError("Incompatible version of Spark(%s) and PySpark(%s)." % | ||
| (snapshot_version, __version__)) | ||
|
|
||
| sys.path.insert(0, os.path.join(os.environ["SPARK_HOME"], "python/lib/py4j-0.8.1-src.zip")) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need this anymore, presumably if they pip installed the package, then py4j will already be installed in site-packages. |
||
|
|
||
|
|
||
| from pyspark.conf import SparkConf | ||
| from pyspark.context import SparkContext | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| __version__ = '1.5.0' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to source this from some existing place? That way we don't have to update the version string in multiple places. I forget where, but there should already be a central place where the version is set.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not seeing any version that's specific to pyspark, only a version for spark as a whole. I agree that we don't want to set a version in multiple places, but I think the one I introduced is the only version unique to pyspark. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative, but trickier, idea would be to have mvn's pom.xml version be the authoritative one, but during the build process, it somehow adds or modifies that file to match the version (maybe using mvn resource filtering?). This would break being able to just "pip install -e python" in development mode, since people would remember to have to run the mvn command to sync the file over, but at least there is no risk of them going out of sync in the build.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I entirely follow. Are you suggesting that when Spark is built, Maven creates this pyspark_version file as a part of the build process? If so, how does this affect a user who installs from PyPI? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still need to build a sdist and wheel, so we can just make sure that whatever process we use adds that file in. Not sure if it's really worth the complexity at this moment, but my team does something internally such that our python and java code both get semantic versions based off of the latest tag and the git hash.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's error-prone to have multiple copy of version in different places, if someone forget to update his, PySpark will break (even within the repo). I'd vote for generate the version during generating PyPI package. If PySpark came along with Spark, we don't need this check (at least it shouldn't fail or slow).
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So we remove the version checks entirely in the bundled version, and include them for the package uploaded to PyPI? I agree that this reduces the chance for maintainer error, but I'm worried about users upgrading versions of Spark. A user could install a bundled version of pyspark, and then later point their SPARK_HOME at a newer version of Spark. There would then be a version mismatch that wouldn't be detected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How is the version number specified for the scala side now?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. Could someone with more experience with that side of the project chime in? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am in favor of pyspark packaging the corresponding version of spark. As a user experience, this is cleaner, requires less steps, and is more natural/inline with other pip installable libraries. I have experience in packaging jars with python libraries in platform independent ways and would be happy to help if wanted. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/usr/bin/env python | ||
|
|
||
| from setuptools import setup | ||
|
|
||
| exec(compile(open("pyspark/pyspark_version.py").read(), | ||
| "pyspark/pyspark_version.py", 'exec')) | ||
| VERSION = __version__ | ||
|
|
||
| setup(name = 'pyspark', | ||
| version = VERSION, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we using three spaces for indentation? also, pep8, there should be no spaces around the = name='pyspark', |
||
| description = 'Apache Spark Python API', | ||
| author = 'Prabin Banka', | ||
| author_email = '[email protected]', | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we want this in the Apache repo, then these individual author tags should be replaced with references to Apache. |
||
| url = 'https://github.com/apache/spark/tree/master/python', | ||
| packages = ['pyspark', 'pyspark.mllib', 'pyspark.ml', 'pyspark.sql', 'pyspark.streaming'], | ||
| data_files = [('pyspark', ['pyspark/pyspark_version.py'])], | ||
| install_requires = ['numpy>=1.7', 'py4j==0.8.2.1', 'pandas'], | ||
| license = 'http://www.apache.org/licenses/LICENSE-2.0', | ||
| ) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about starting to add some of the logic from findspark to autodetect SPARK_HOME?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findspark repo: https://github.com/minrk/findspark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the only autodetection logic that findspark has is to check where homebrew installs Spark on OSX. I think that for pyspark this is overly specific and brittle, and can lead to confusion if the user wanted to use pyspark with a different version than the one installed by homebrew. Having the user set SPARK_HOME themself makes the process unambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @alope107. In addition, if people are using spark-submit, then this isn't necessary right? spark-submit sets up SPARK_HOME automatically.
Are people launching python apps frequently without using spark-submit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the common use cases for pyspark without spark-submit is running from a notebook environment. I think there is a decent number of people that do this, and that more will once it's easier to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible via the following:
Not completely discoverable, but it works =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this part of the code, so it would be nice to get some core devs to chime in, but it looks like
does a lot of logic, especially when deploying against YARN that seems important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, it's been fairly easy to get the
pysparklauncher to launch via a notebook, either doing what @justinuang said, or just settingIPYTHON=1(which basically does the same).But there are scenarios where it's useful to forgo the
pysparklauncher entirely, i.e. launch withipythonand then do all the Spark-related stuff. One key use case is in containerized notebook deployments (liketmpnb) where we want a way to launch/deploy notebooks in a generic way (e.g. withipython), but still let someone import and launch aSparkContextafter the fact.This PR is a great step, we could get closer to that goal by adding more autodetection / path setting logic (as pointed out by @Carreau ). But I agree with @alope107 that it might be too brittle, and it would definitely be some work to support all the config / deployment modes that
SparkSubmithandles now (which themselves change across releases). I suspect that's why the core devs have tried to force the language APIs to go through a common launcher, but would be good to get more input.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if os.environ.get("SPARK_HOME") is None: