Skip to content
Closed
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
d86325f
Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop …
MLnick Dec 9, 2013
4b0a43f
Refactoring utils into own objects. Cleaning up old commented-out code
MLnick Dec 12, 2013
c304cc8
Adding supporting sequncefiles for tests. Cleaning up
MLnick Dec 15, 2013
4e7c9e3
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Dec 15, 2013
818a1e6
Add seqencefile and Hadoop InputFormat support to PythonRDD
MLnick Dec 15, 2013
4294cbb
Add old Hadoop api methods. Clean up and expand comments. Clean up ar…
MLnick Dec 19, 2013
0f5cd84
Remove unused pair UTF8 class. Add comments to msgpack deserializer
MLnick Dec 19, 2013
f1d73e3
mergeConfs returns a copy rather than mutating one of the input argum…
MLnick Dec 19, 2013
4d7ef2e
Fix indentation
MLnick Dec 19, 2013
eb40036
Remove unused comment lines
MLnick Dec 19, 2013
1c8efbc
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jan 13, 2014
619c0fa
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jan 20, 2014
703ee65
Add back msgpack
MLnick Jan 20, 2014
174f520
Add back graphx settings
MLnick Jan 20, 2014
795a763
Change name to WriteInputFormatTestDataGenerator. Cleanup some var na…
MLnick Jan 20, 2014
2beeedb
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Feb 8, 2014
97ef708
Remove old writeToStream
MLnick Feb 14, 2014
41856a5
Merge branch 'master' into pyspark-inputformats
MLnick Mar 19, 2014
f2d76a0
Merge branch 'master' into pyspark-inputformats
MLnick Mar 19, 2014
e67212a
Add back msgpack dependency
MLnick Mar 19, 2014
dd57922
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Apr 10, 2014
d72bf18
msgpack
MLnick Apr 10, 2014
0c612e5
Merge branch 'master' into pyspark-inputformats
MLnick Apr 12, 2014
65360d5
Adding test SequenceFiles
MLnick Apr 18, 2014
25da1ca
Add generator for nulls, bools, bytes and maps
MLnick Apr 18, 2014
7237263
Add back msgpack serializer and hadoop file code lost during merging
MLnick Apr 18, 2014
a67dfad
Clean up Msgpack serialization and registering
MLnick Apr 18, 2014
1bbbfb0
Clean up SparkBuild from merge
MLnick Apr 18, 2014
9d2256e
Merge branch 'master' into pyspark-inputformats
MLnick Apr 18, 2014
f6aac55
Bring back msgpack
MLnick Apr 18, 2014
951c117
Merge branch 'master' into pyspark-inputformats
MLnick Apr 19, 2014
b20ec7e
Clean up merge duplicate dependencies
MLnick Apr 19, 2014
4e08983
Clean up docs for PySpark context methods
MLnick Apr 19, 2014
fc5099e
Add Apache license headers
MLnick Apr 19, 2014
31a2fff
Scalastyle fixes
MLnick Apr 21, 2014
450e0a2
Merge branch 'master' into pyspark-inputformats
MLnick Apr 21, 2014
f60959e
Remove msgpack dependency and serializer from PySpark
MLnick Apr 21, 2014
17a656b
remove binary sequencefile for tests
MLnick Apr 21, 2014
1d7c17c
Amend tests to auto-generate sequencefile data in temp dir
MLnick Apr 21, 2014
c0ebfb6
Change sequencefile test data generator to easily be called from PySp…
MLnick Apr 21, 2014
44f2857
Remove msgpack dependency and switch serialization to Pyrolite, plus …
MLnick Apr 21, 2014
e7552fa
Merge branch 'master' into pyspark-inputformats
MLnick Apr 22, 2014
64eb051
Scalastyle fix
MLnick Apr 22, 2014
78978d9
Add doc for SequenceFile and InputFormat support to Python programmin…
MLnick Apr 22, 2014
e001b94
Fix test failures due to ordering
MLnick Apr 23, 2014
bef3afb
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Apr 23, 2014
35b8e3a
Another fix for test ordering
MLnick Apr 23, 2014
5af4770
Merge branch 'master' into pyspark-inputformats
MLnick Apr 23, 2014
077ecb2
Recover earlier changes lost in previous merge for context.py
MLnick Apr 23, 2014
9ef1896
Recover earlier changes lost in previous merge for serializers.py
MLnick Apr 23, 2014
93ef995
Add back context.py changes
MLnick Apr 23, 2014
7caa73a
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick May 23, 2014
d0f52b6
Python programming guide
MLnick May 23, 2014
84fe8e3
Python programming guide space formatting
MLnick May 23, 2014
9fe6bd5
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick May 31, 2014
15a7d07
Remove default args for key/value classes. Arg names to camelCase
MLnick Jun 3, 2014
01e0813
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 3, 2014
1a4a1d6
Address @mateiz style comments
MLnick Jun 3, 2014
94beedc
Clean up args in PythonRDD. Set key/value converter defaults to None …
MLnick Jun 3, 2014
43eb728
PySpark InputFormats docs into programming guide
MLnick Jun 3, 2014
085b55f
Move input format tests to tests.py and clean up docs
MLnick Jun 3, 2014
5757f6e
Default key/value classes for sequenceFile asre None
MLnick Jun 3, 2014
b65606f
Add converter interface
MLnick Jun 4, 2014
2c18513
Add examples for reading HBase and Cassandra InputFormats from Python
MLnick Jun 4, 2014
3f90c3e
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 4, 2014
1eaa08b
HBase -> Cassandra app name oversight
MLnick Jun 4, 2014
eeb8205
Fix path relative to SPARK_HOME in tests
MLnick Jun 4, 2014
365d0be
Make classes private[python]. Add docs and @Experimental annotation t…
MLnick Jun 5, 2014
a985492
Move Converter examples to own package
MLnick Jun 5, 2014
5ebacfa
Update docs for PySpark input formats
MLnick Jun 5, 2014
cde6af9
Parameterize converter trait
MLnick Jun 6, 2014
d150431
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 6, 2014
4c972d8
Add license headers
MLnick Jun 6, 2014
761269b
Address @pwendell comments, simplify default writable conversions and…
MLnick Jun 7, 2014
268df7e
Documentation changes mer @pwendell comments
MLnick Jun 8, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
Conflicts:
	docs/python-programming-guide.md
  • Loading branch information
MLnick committed May 23, 2014
commit 7caa73a2e9fe162c672432dbbd8d79e45d9c5c64
47 changes: 24 additions & 23 deletions docs/python-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ errors = logData.filter(is_error)

PySpark will automatically ship these functions to executors, along with any objects that they reference.
Instances of classes will be serialized and shipped to executors by PySpark, but classes themselves cannot be automatically distributed to executors.
The [Standalone Use](#standalone-programs) section describes how to ship code dependencies to executors.
The [Standalone Use](#standalone-use) section describes how to ship code dependencies to executors.

In addition, PySpark fully supports interactive use---simply run `./bin/pyspark` to launch an interactive shell.

Expand All @@ -62,7 +62,7 @@ All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.

# Interactive Use

The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line:
The `bin/pyspark` script launches a Python interpreter that is configured to run PySpark applications. To use `pyspark` interactively, first build Spark, then launch it directly from the command line without any options:

{% highlight bash %}
$ sbt/sbt assembly
Expand All @@ -79,28 +79,24 @@ The Python shell can be used explore data interactively and is a simple way to l
{% endhighlight %}

By default, the `bin/pyspark` shell creates SparkContext that runs applications locally on all of
your machine's logical cores. To connect to a non-local cluster, or to specify a number of cores,
set the `--master` flag. For example, to use the `bin/pyspark` shell with a
[standalone Spark cluster](spark-standalone.html):
your machine's logical cores.
To connect to a non-local cluster, or to specify a number of cores, set the `MASTER` environment variable.
For example, to use the `bin/pyspark` shell with a [standalone Spark cluster](spark-standalone.html):

{% highlight bash %}
$ ./bin/pyspark --master spark://1.2.3.4:7077
$ MASTER=spark://IP:PORT ./bin/pyspark
{% endhighlight %}

Or, to use exactly four cores on the local machine:

{% highlight bash %}
$ ./bin/pyspark --master local[4]
$ MASTER=local[4] ./bin/pyspark
{% endhighlight %}

Under the hood `bin/pyspark` is a wrapper around the
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit), so these
two scripts share the same list of options. For a complete list of options, run `bin/pyspark` with
the `--help` option.

## IPython

It is also possible to launch the PySpark shell in [IPython](http://ipython.org), the
It is also possible to launch PySpark in [IPython](http://ipython.org), the
enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To
use IPython, set the `IPYTHON` variable to `1` when running `bin/pyspark`:

Expand All @@ -115,23 +111,23 @@ the [IPython Notebook](http://ipython.org/notebook.html) with PyLab graphing sup
$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
{% endhighlight %}

IPython also works on a cluster or on multiple cores if you set the `--master` flag.
IPython also works on a cluster or on multiple cores if you set the `MASTER` environment variable.


# Standalone Programs

PySpark can also be used from standalone Python scripts by creating a SparkContext in your script
and running the script using `bin/spark-submit`. The Quick Start guide includes a
[complete example](quick-start.html#standalone-applications) of a standalone Python application.
PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `bin/pyspark`.
The Quick Start guide includes a [complete example](quick-start.html#a-standalone-app-in-python) of a standalone Python application.

Code dependencies can be deployed by passing .zip or .egg files in the `--py-files` option of `spark-submit`:
Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor:

{% highlight bash %}
./bin/spark-submit --py-files lib1.zip,lib2.zip my_script.py
{% highlight python %}
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
{% endhighlight %}

Files listed here will be added to the `PYTHONPATH` and shipped to remote worker machines.
Code dependencies can also be added to an existing SparkContext at runtime using its `addPyFile()` method.
Code dependencies can be added to an existing SparkContext using its `addPyFile()` method.

You can set [configuration properties](configuration.html#spark-properties) by passing a
[SparkConf](api/python/pyspark.conf.SparkConf-class.html) object to SparkContext:
Expand Down Expand Up @@ -218,6 +214,11 @@ Future support for 'wrapper' functions for keys/values that allows this to be wr
and called from Python, as well as support for writing data out as SequenceFile format
and other OutputFormats, is forthcoming.

`spark-submit` supports launching Python applications on standalone, Mesos or YARN clusters, through
its `--master` argument. However, it currently requires the Python driver program to run on the local
machine, not the cluster (i.e. the `--deploy-mode` parameter cannot be `cluster`).


# API Docs

[API documentation](api/python/index.html) for PySpark is available as Epydoc.
Expand All @@ -231,9 +232,9 @@ some example applications.

# Where to Go from Here

PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
PySpark also includes several sample programs in the [`python/examples` folder](https://github.com/apache/spark/tree/master/python/examples).
You can run them by passing the files to `pyspark`; e.g.:

./bin/spark-submit examples/src/main/python/wordcount.py README.md
./bin/pyspark python/examples/wordcount.py

Each program prints usage help when run without the sufficient arguments.
Each program prints usage help when run without arguments.
You are viewing a condensed version of this merge commit. You can view the full changes here.