Merge remote-tracking branch 'upstream/master' into pyspark-inputformats

Conflicts: docs/python-programming-guide.md
apache · MLnick · Dec 9, 2013 · Dec 12, 2013 · Dec 15, 2013 · Dec 15, 2013
commit 7caa73a2e9fe162c672432dbbd8d79e45d9c5c64
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
@@ -45,7 +45,7 @@ errors = logData.filter(is_error)
 
 PySpark will automatically ship these functions to executors, along with any objects that they reference.
 Instances of classes will be serialized and shipped to executors by PySpark, but classes themselves cannot be automatically distributed to executors.
-The [Standalone Use](#standalone-programs) section describes how to ship code dependencies to executors.
+The [Standalone Use](#standalone-use) section describes how to ship code dependencies to executors.
 
 In addition, PySpark fully supports interactive use---simply run `./bin/pyspark` to launch an interactive shell.
 
@@ -62,7 +62,7 @@ All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.
 
 # Interactive Use
 
-The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line:
+The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options:
 
 {% highlight bash %}
 $ sbt/sbt assembly
@@ -79,28 +79,24 @@ The Python shell can be used explore data interactively and is a simple way to l
 {% endhighlight %}
 
 By default, the `bin/pyspark` shell creates SparkContext that runs applications locally on all of
-your machine's logical cores. To connect to a non-local cluster, or to specify a number of cores,
-set the `--master` flag. For example, to use the `bin/pyspark` shell with a
-[standalone Spark cluster](spark-standalone.html):
+your machine's logical cores.
+To connect to a non-local cluster, or to specify a number of cores, set the `MASTER` environment variable.
+For example, to use the `bin/pyspark` shell with a [standalone Spark cluster](spark-standalone.html):
 
 {% highlight bash %}
-$ ./bin/pyspark --master spark://1.2.3.4:7077
+$ MASTER=spark://IP:PORT ./bin/pyspark
 {% endhighlight %}
 
 Or, to use exactly four cores on the local machine:
 
 {% highlight bash %}
-$ ./bin/pyspark --master local[4]
+$ MASTER=local[4] ./bin/pyspark
 {% endhighlight %}
 
-Under the hood `bin/pyspark` is a wrapper around the
-[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit), so these
-two scripts share the same list of options. For a complete list of options, run `bin/pyspark` with
-the `--help` option.
 
 ## IPython
 
-It is also possible to launch the PySpark shell in [IPython](http://ipython.org), the
+It is also possible to launch PySpark in [IPython](http://ipython.org), the
 enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To
 use IPython, set the `IPYTHON` variable to `1` when running `bin/pyspark`:
 
@@ -115,23 +111,23 @@ the [IPython Notebook](http://ipython.org/notebook.html) with PyLab graphing sup
 $ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
 {% endhighlight %}
 
-IPython also works on a cluster or on multiple cores if you set the `--master` flag.
+IPython also works on a cluster or on multiple cores if you set the `MASTER` environment variable.
 
 
 # Standalone Programs
 
-PySpark can also be used from standalone Python scripts by creating a SparkContext in your script
-and running the script using `bin/spark-submit`. The Quick Start guide includes a
-[complete example](quick-start.html#standalone-applications) of a standalone Python application.
+PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `bin/pyspark`.
+The Quick Start guide includes a [complete example](quick-start.html#a-standalone-app-in-python) of a standalone Python application.
 
-Code dependencies can be deployed by passing .zip or .egg files in the `--py-files` option of `spark-submit`:
+Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor:
 
-{% highlight bash %}
-./bin/spark-submit --py-files lib1.zip,lib2.zip my_script.py
+{% highlight python %}
+from pyspark import SparkContext
+sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
 {% endhighlight %}
 
 Files listed here will be added to the `PYTHONPATH` and shipped to remote worker machines.
-Code dependencies can also be added to an existing SparkContext at runtime using its `addPyFile()` method.
+Code dependencies can be added to an existing SparkContext using its `addPyFile()` method.
 
 You can set [configuration properties](configuration.html#spark-properties) by passing a
 [SparkConf](api/python/pyspark.conf.SparkConf-class.html) object to SparkContext:
@@ -218,6 +214,11 @@ Future support for 'wrapper' functions for keys/values that allows this to be wr
 and called from Python, as well as support for writing data out as SequenceFile format
 and other OutputFormats, is forthcoming.
 
+`spark-submit` supports launching Python applications on standalone, Mesos or YARN clusters, through
+its `--master` argument. However, it currently requires the Python driver program to run on the local
+machine, not the cluster (i.e. the `--deploy-mode` parameter cannot be `cluster`).
+
+
 # API Docs
 
 [API documentation](api/python/index.html) for PySpark is available as Epydoc.
@@ -231,9 +232,9 @@ some example applications.
 
 # Where to Go from Here
 
-PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
+PySpark also includes several sample programs in the [`python/examples` folder](https://github.com/apache/spark/tree/master/python/examples).
 You can run them by passing the files to `pyspark`; e.g.:
 
-    ./bin/spark-submit examples/src/main/python/wordcount.py README.md
+    ./bin/pyspark python/examples/wordcount.py
 
-Each program prints usage help when run without the sufficient arguments.
+Each program prints usage help when run without arguments.