Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
d86325f
Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop …
MLnick Dec 9, 2013
4b0a43f
Refactoring utils into own objects. Cleaning up old commented-out code
MLnick Dec 12, 2013
c304cc8
Adding supporting sequncefiles for tests. Cleaning up
MLnick Dec 15, 2013
4e7c9e3
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Dec 15, 2013
818a1e6
Add seqencefile and Hadoop InputFormat support to PythonRDD
MLnick Dec 15, 2013
4294cbb
Add old Hadoop api methods. Clean up and expand comments. Clean up ar…
MLnick Dec 19, 2013
0f5cd84
Remove unused pair UTF8 class. Add comments to msgpack deserializer
MLnick Dec 19, 2013
f1d73e3
mergeConfs returns a copy rather than mutating one of the input argum…
MLnick Dec 19, 2013
4d7ef2e
Fix indentation
MLnick Dec 19, 2013
eb40036
Remove unused comment lines
MLnick Dec 19, 2013
1c8efbc
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jan 13, 2014
619c0fa
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jan 20, 2014
703ee65
Add back msgpack
MLnick Jan 20, 2014
174f520
Add back graphx settings
MLnick Jan 20, 2014
795a763
Change name to WriteInputFormatTestDataGenerator. Cleanup some var na…
MLnick Jan 20, 2014
2beeedb
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Feb 8, 2014
97ef708
Remove old writeToStream
MLnick Feb 14, 2014
41856a5
Merge branch 'master' into pyspark-inputformats
MLnick Mar 19, 2014
f2d76a0
Merge branch 'master' into pyspark-inputformats
MLnick Mar 19, 2014
e67212a
Add back msgpack dependency
MLnick Mar 19, 2014
dd57922
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Apr 10, 2014
d72bf18
msgpack
MLnick Apr 10, 2014
0c612e5
Merge branch 'master' into pyspark-inputformats
MLnick Apr 12, 2014
65360d5
Adding test SequenceFiles
MLnick Apr 18, 2014
25da1ca
Add generator for nulls, bools, bytes and maps
MLnick Apr 18, 2014
7237263
Add back msgpack serializer and hadoop file code lost during merging
MLnick Apr 18, 2014
a67dfad
Clean up Msgpack serialization and registering
MLnick Apr 18, 2014
1bbbfb0
Clean up SparkBuild from merge
MLnick Apr 18, 2014
9d2256e
Merge branch 'master' into pyspark-inputformats
MLnick Apr 18, 2014
f6aac55
Bring back msgpack
MLnick Apr 18, 2014
951c117
Merge branch 'master' into pyspark-inputformats
MLnick Apr 19, 2014
b20ec7e
Clean up merge duplicate dependencies
MLnick Apr 19, 2014
4e08983
Clean up docs for PySpark context methods
MLnick Apr 19, 2014
fc5099e
Add Apache license headers
MLnick Apr 19, 2014
31a2fff
Scalastyle fixes
MLnick Apr 21, 2014
450e0a2
Merge branch 'master' into pyspark-inputformats
MLnick Apr 21, 2014
f60959e
Remove msgpack dependency and serializer from PySpark
MLnick Apr 21, 2014
17a656b
remove binary sequencefile for tests
MLnick Apr 21, 2014
1d7c17c
Amend tests to auto-generate sequencefile data in temp dir
MLnick Apr 21, 2014
c0ebfb6
Change sequencefile test data generator to easily be called from PySp…
MLnick Apr 21, 2014
44f2857
Remove msgpack dependency and switch serialization to Pyrolite, plus …
MLnick Apr 21, 2014
e7552fa
Merge branch 'master' into pyspark-inputformats
MLnick Apr 22, 2014
64eb051
Scalastyle fix
MLnick Apr 22, 2014
78978d9
Add doc for SequenceFile and InputFormat support to Python programmin…
MLnick Apr 22, 2014
e001b94
Fix test failures due to ordering
MLnick Apr 23, 2014
bef3afb
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Apr 23, 2014
35b8e3a
Another fix for test ordering
MLnick Apr 23, 2014
5af4770
Merge branch 'master' into pyspark-inputformats
MLnick Apr 23, 2014
077ecb2
Recover earlier changes lost in previous merge for context.py
MLnick Apr 23, 2014
9ef1896
Recover earlier changes lost in previous merge for serializers.py
MLnick Apr 23, 2014
93ef995
Add back context.py changes
MLnick Apr 23, 2014
7caa73a
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick May 23, 2014
d0f52b6
Python programming guide
MLnick May 23, 2014
84fe8e3
Python programming guide space formatting
MLnick May 23, 2014
9fe6bd5
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick May 31, 2014
15a7d07
Remove default args for key/value classes. Arg names to camelCase
MLnick Jun 3, 2014
01e0813
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 3, 2014
1a4a1d6
Address @mateiz style comments
MLnick Jun 3, 2014
94beedc
Clean up args in PythonRDD. Set key/value converter defaults to None …
MLnick Jun 3, 2014
43eb728
PySpark InputFormats docs into programming guide
MLnick Jun 3, 2014
085b55f
Move input format tests to tests.py and clean up docs
MLnick Jun 3, 2014
5757f6e
Default key/value classes for sequenceFile asre None
MLnick Jun 3, 2014
b65606f
Add converter interface
MLnick Jun 4, 2014
2c18513
Add examples for reading HBase and Cassandra InputFormats from Python
MLnick Jun 4, 2014
3f90c3e
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 4, 2014
1eaa08b
HBase -> Cassandra app name oversight
MLnick Jun 4, 2014
eeb8205
Fix path relative to SPARK_HOME in tests
MLnick Jun 4, 2014
365d0be
Make classes private[python]. Add docs and @Experimental annotation t…
MLnick Jun 5, 2014
a985492
Move Converter examples to own package
MLnick Jun 5, 2014
5ebacfa
Update docs for PySpark input formats
MLnick Jun 5, 2014
cde6af9
Parameterize converter trait
MLnick Jun 6, 2014
d150431
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
MLnick Jun 6, 2014
4c972d8
Add license headers
MLnick Jun 6, 2014
761269b
Address @pwendell comments, simplify default writable conversions and…
MLnick Jun 7, 2014
268df7e
Documentation changes mer @pwendell comments
MLnick Jun 8, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
Conflicts:
	project/SparkBuild.scala
	python/pyspark/context.py
  • Loading branch information
MLnick committed Jan 13, 2014
commit 1c8efbc4a87cd7d719d8fef4e41781c67b414a6f
10 changes: 6 additions & 4 deletions project/SparkBuild.scala
Original file line number Diff line number Diff line change
Expand Up @@ -245,12 +245,12 @@ object SparkBuild extends Build {
"org.slf4j" % "slf4j-api" % slf4jVersion,
"org.slf4j" % "slf4j-log4j12" % slf4jVersion,
"commons-daemon" % "commons-daemon" % "1.0.10", // workaround for bug HADOOP-9407
"com.ning" % "compress-lzf" % "1.0.0",
"com.ning" % "compress-lzf" % "0.8.4",
"org.xerial.snappy" % "snappy-java" % "1.0.5",
"org.ow2.asm" % "asm" % "4.0",
"org.spark-project.akka" %% "akka-remote" % "2.2.3-shaded-protobuf" excludeAll(excludeNetty),
"org.spark-project.akka" %% "akka-slf4j" % "2.2.3-shaded-protobuf" excludeAll(excludeNetty),
"org.spark-project.akka" %% "akka-testkit" % "2.2.3-shaded-protobuf" % "test",
"com.google.protobuf" % "protobuf-java" % "2.4.1",
"com.typesafe.akka" %% "akka-remote" % "2.2.3" excludeAll(excludeNetty),
"com.typesafe.akka" %% "akka-slf4j" % "2.2.3" excludeAll(excludeNetty),
"net.liftweb" %% "lift-json" % "2.5.1" excludeAll(excludeNetty),
"it.unimi.dsi" % "fastutil" % "6.4.4",
"colt" % "colt" % "1.2.0",
Expand All @@ -268,6 +268,8 @@ object SparkBuild extends Build {
"com.codahale.metrics" % "metrics-graphite" % "3.0.0",
"com.twitter" %% "chill" % "0.3.1",
"com.twitter" % "chill-java" % "0.3.1",
"com.typesafe" % "config" % "1.0.2",
"com.clearspring.analytics" % "stream" % "2.5.1",
"org.msgpack" %% "msgpack-scala" % "0.6.8"
)
)
Expand Down
14 changes: 6 additions & 8 deletions python/pyspark/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ class SparkContext(object):
_python_includes = None # zip and egg files that need to be added to PYTHONPATH


def __init__(self, master, jobName, sparkHome=None, pyFiles=None,
environment=None, batchSize=1024, serializer=PickleSerializer()):
def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
environment=None, batchSize=1024, serializer=PickleSerializer(), conf=None):
"""
Create a new SparkContext. At least the master and app name should be set,
either through the named parameters here or through C{conf}.
Expand Down Expand Up @@ -120,17 +120,15 @@ def __init__(self, master, jobName, sparkHome=None, pyFiles=None,
self.environment[varName] = v

# Create the Java SparkContext through Py4J
empty_string_array = self._gateway.new_array(self._jvm.String, 0)
self._jsc = self._jvm.JavaSparkContext(master, jobName, sparkHome,
empty_string_array)
self._jsc = self._jvm.JavaSparkContext(self._conf._jconf)

# Create a single Accumulator in Java that we'll send all our updates through;
# they will be passed back to us through a TCP server
self._accumulatorServer = accumulators._start_update_server()
(host, port) = self._accumulatorServer.server_address
self._javaAccumulator = self._jsc.accumulator(
self._jvm.java.util.ArrayList(),
self._jvm.PythonAccumulatorParam(host, port))
self._jvm.java.util.ArrayList(),
self._jvm.PythonAccumulatorParam(host, port))

self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python')

Expand Down Expand Up @@ -465,7 +463,7 @@ def _getJavaStorageLevel(self, storageLevel):

newStorageLevel = self._jvm.org.apache.spark.storage.StorageLevel
return newStorageLevel(storageLevel.useDisk, storageLevel.useMemory,
storageLevel.deserialized, storageLevel.replication)
storageLevel.deserialized, storageLevel.replication)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also reverting an earlier change and will break; you need to keep it the way it was.



def _test():
Expand Down
You are viewing a condensed version of this merge commit. You can view the full changes here.