-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-1004. PySpark on YARN #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,5 @@ | ||
| *.pyc | ||
| docs/ | ||
| pyspark.egg-info | ||
| build/ | ||
| dist/ |
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,10 +24,11 @@ | |
| from py4j.java_gateway import java_import, JavaGateway, GatewayClient | ||
|
|
||
|
|
||
| SPARK_HOME = os.environ["SPARK_HOME"] | ||
| def launch_gateway(): | ||
| SPARK_HOME = os.environ["SPARK_HOME"] | ||
|
|
||
| set_env_vars_for_yarn() | ||
|
|
||
| def launch_gateway(): | ||
| # Launch the Py4j gateway using Spark's run command so that we pick up the | ||
| # proper classpath and settings from spark-env.sh | ||
| on_windows = platform.system() == "Windows" | ||
|
|
@@ -70,3 +71,27 @@ def run(self): | |
| java_import(gateway.jvm, "org.apache.spark.sql.hive.TestHiveContext") | ||
| java_import(gateway.jvm, "scala.Tuple2") | ||
| return gateway | ||
|
|
||
| def set_env_vars_for_yarn(): | ||
| # Add the spark jar, which includes the pyspark files, to the python path | ||
| env_map = parse_env(os.environ.get("SPARK_YARN_USER_ENV", "")) | ||
| if "PYTHONPATH" in env_map: | ||
| env_map["PYTHONPATH"] += ":spark.jar" | ||
| else: | ||
| env_map["PYTHONPATH"] = "spark.jar" | ||
|
|
||
| os.environ["SPARK_YARN_USER_ENV"] = ",".join(k + '=' + v for (k, v) in env_map.items()) | ||
|
|
||
| def parse_env(env_str): | ||
| # Turns a comma-separated of env settings into a dict that maps env vars to | ||
| # their values. | ||
| env = {} | ||
| for var_str in env_str.split(","): | ||
| parts = var_str.split("=") | ||
| if len(parts) == 2: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it would be worth it to crash or throw an error when passed an invalid env string? |
||
| env[parts[0]] = parts[1] | ||
| elif len(var_str) > 0: | ||
| print "Invalid entry in SPARK_YARN_USER_ENV: " + var_str | ||
| sys.exit(1) | ||
|
|
||
| return env | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,3 +34,6 @@ this="$config_bin/$script" | |
| export SPARK_PREFIX=`dirname "$this"`/.. | ||
| export SPARK_HOME=${SPARK_PREFIX} | ||
| export SPARK_CONF_DIR="$SPARK_HOME/conf" | ||
| # Add the PySpark classes to the PYTHONPATH: | ||
| export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this script I think only gets called when launching the standalone daemons. Would it make more sense to put this in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point; I think we should move these lines to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like we never addressed this. Should we move this into |
||
| export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.1-src.zip:$PYTHONPATH | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make the proposed changes this could be simplified to just saying that you can run it in yarn-client mode.