Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
711356b
[SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregat…
jkbradley Sep 8, 2014
e16a8e7
SPARK-3337 Paranoid quoting in shell to allow install dirs with space…
ScrapCodes Sep 8, 2014
16a73c2
SPARK-2978. Transformation with MR shuffle semantics
sryza Sep 8, 2014
386bc24
Provide a default PYSPARK_PYTHON for python/run_tests
Sep 8, 2014
26bc765
[SQL] Minor edits to sql programming guide.
hcook Sep 8, 2014
939a322
[SPARK-3417] Use new-style classes in PySpark
mrocklin Sep 8, 2014
08ce188
[SPARK-3019] Pluggable block transfer interface (BlockTransferService)
rxin Sep 8, 2014
7db5339
[SPARK-3349][SQL] Output partitioning of limit should not be inherite…
Sep 8, 2014
50a4fa7
[SPARK-3443][MLLIB] update default values of tree:
mengxr Sep 9, 2014
ca0348e
SPARK-3423: [SQL] Implement BETWEEN for SQLParser
willb Sep 9, 2014
dc1dbf2
[SPARK-3414][SQL] Stores analyzed logical plan when registering a tem…
liancheng Sep 9, 2014
2b7ab81
[SPARK-3329][SQL] Don't depend on Hive SET pair ordering in tests.
willb Sep 9, 2014
092e2f1
SPARK-2425 Don't kill a still-running Application because of some mis…
markhamstra Sep 9, 2014
ce5cb32
[Build] Removed -Phive-thriftserver since this profile has been removed
liancheng Sep 9, 2014
c419e4f
[Docs] actorStream storageLevel default is MEMORY_AND_DISK_SER_2
melrief Sep 9, 2014
1e03cf7
[SPARK-3455] [SQL] **HOT FIX** Fix the unit test failure
chenghao-intel Sep 9, 2014
88547a0
SPARK-3422. JavaAPISuite.getHadoopInputSplits isn't used anywhere.
sryza Sep 9, 2014
f0f1ba0
SPARK-3404 [BUILD] SparkSubmitSuite fails with "spark-submit exits wi…
srowen Sep 9, 2014
2686233
[SPARK-3193]output errer info when Process exit code is not zero in t…
scwf Sep 9, 2014
02b5ac7
Minor - Fix trivial compilation warnings.
ScrapCodes Sep 9, 2014
07ee4a2
[SPARK-3176] Implement 'ABS and 'LAST' for sql
Sep 9, 2014
c110614
[SPARK-3448][SQL] Check for null in SpecificMutableRow.update
liancheng Sep 10, 2014
25b5b86
[SPARK-3458] enable python "with" statements for SparkContext
Sep 10, 2014
b734ed0
[SPARK-3395] [SQL] DSL sometimes incorrectly reuses attribute ids, br…
Sep 10, 2014
6f7a768
[SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url schem…
Sep 10, 2014
a028330
[SPARK-3362][SQL] Fix resolution for casewhen with nulls.
adrian-wang Sep 10, 2014
f0c87dc
[SPARK-3363][SQL] Type Coercion should promote null to all other types.
adrian-wang Sep 10, 2014
26503fd
[HOTFIX] Fix scala style issue introduced by #2276.
JoshRosen Sep 10, 2014
1f4a648
SPARK-1713. Use a thread pool for launching executors.
sryza Sep 10, 2014
e4f4886
[SPARK-2096][SQL] Correctly parse dot notations
cloud-fan Sep 10, 2014
558962a
[SPARK-3411] Improve load-balancing of concurrently-submitted drivers…
WangTaoTheTonic Sep 10, 2014
79cdb9b
[SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minim…
Sep 10, 2014
84e2c8b
[SQL] Add test case with workaround for reading partitioned Avro files
marmbrus Sep 11, 2014
f92cde2
[SPARK-3447][SQL] Remove explicit conversion with JListWrapper to avo…
marmbrus Sep 11, 2014
c27718f
[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer.
staple Sep 11, 2014
ed1980f
[SPARK-2140] Updating heap memory calculation for YARN stable and alpha.
Sep 11, 2014
1ef656e
[SPARK-3047] [PySpark] add an option to use str in textFileRDD
davies Sep 11, 2014
ca83f1e
[SPARK-2917] [SQL] Avoid table creation in logical plan analyzing for…
chenghao-intel Sep 11, 2014
4bc9e04
[SPARK-3390][SQL] sqlContext.jsonRDD fails on a complex structure of …
yhuai Sep 11, 2014
6324eb7
[Spark-3490] Disable SparkUI for tests
andrewor14 Sep 12, 2014
ce59725
[SPARK-3429] Don't include the empty string "" as a defaultAclUser
ash211 Sep 12, 2014
f858f46
SPARK-3462 push down filters and projections into Unions
Sep 12, 2014
33c7a73
SPARK-2482: Resolve sbt warnings during build
witgo Sep 12, 2014
42904b8
[SPARK-3465] fix task metrics aggregation in local mode
davies Sep 12, 2014
b8634df
[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-alloca…
jkbradley Sep 12, 2014
f116f76
[SPARK-2558][DOCS] Add --queue example to YARN doc
kramimus Sep 12, 2014
5333776
[PySpark] Add blank line so that Python RDD.top() docstring renders c…
rnowling Sep 12, 2014
8194fc6
[SPARK-3481] [SQL] Eliminate the error log in local Hive comparison test
chenghao-intel Sep 12, 2014
eae81b0
MAINTENANCE: Automated closing of pull requests.
pwendell Sep 12, 2014
15a5645
[SPARK-3427] [GraphX] Avoid active vertex tracking in static PageRank
ankurdave Sep 12, 2014
1d76796
SPARK-3014. Log a more informative messages in a couple failure scena…
sryza Sep 12, 2014
af25838
[SPARK-3217] Add Guava to classpath when SPARK_PREPEND_CLASSES is set.
Sep 12, 2014
25311c2
[SPARK-3456] YarnAllocator on alpha can lose container requests to RM
tgravescs Sep 13, 2014
71af030
[SPARK-3094] [PySpark] compatitable with PyPy
davies Sep 13, 2014
885d162
[SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd
davies Sep 13, 2014
6d887db
[SPARK-3515][SQL] Moves test suite setup code to beforeAll rather tha…
liancheng Sep 13, 2014
2584ea5
[SPARK-3469] Make sure all TaskCompletionListener are called even wit…
rxin Sep 13, 2014
e11eeb7
[SQL][Docs] Update SQL programming guide to show the correct default …
yhuai Sep 13, 2014
feaa370
SPARK-3470 [CORE] [STREAMING] Add Closeable / close() to Java context…
srowen Sep 13, 2014
b4dded4
Proper indent for the previous commit.
rxin Sep 13, 2014
a523cea
[SQL] [Docs] typo fixes
nchammas Sep 13, 2014
184cd51
[SPARK-3481][SQL] Removes the evil MINOR HACK
liancheng Sep 13, 2014
7404924
[SPARK-3294][SQL] Eliminates boxing costs from in-memory columnar sto…
liancheng Sep 13, 2014
0f8c4ed
[SQL] Decrease partitions when testing
marmbrus Sep 13, 2014
2aea0da
[SPARK-3030] [PySpark] Reuse Python worker
davies Sep 13, 2014
4e3fbe8
[SPARK-3463] [PySpark] aggregate and show spilled bytes in Python
davies Sep 14, 2014
c243b21
SPARK-3039: Allow spark to be built using avro-mapred for hadoop2
bbossy Sep 15, 2014
f493f79
[SPARK-3452] Maven build should skip publishing artifacts people shou…
ScrapCodes Sep 15, 2014
cc14644
[SPARK-3410] The priority of shutdownhook for ApplicationMaster shoul…
sarutak Sep 15, 2014
fe2b1d6
[SPARK-3425] do not set MaxPermSize for OpenJDK 1.8
Sep 15, 2014
e59fac1
[SPARK-3518] Remove wasted statement in JsonProtocol
sarutak Sep 15, 2014
37d9252
[SPARK-2714] DAGScheduler logs jobid when runJob finishes
YanTangZhai Sep 15, 2014
3b93128
[SPARK-3396][MLLIB] Use SquaredL2Updater in LogisticRegressionWithSGD
BigCrunsh Sep 16, 2014
983d6a9
[MLlib] Update SVD documentation in IndexedRowMatrix
rezazadeh Sep 16, 2014
fdb302f
[SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGa…
Sep 16, 2014
da33acb
[SPARK-2951] [PySpark] support unpickle array.array for Python 2.6
davies Sep 16, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[SPARK-3094] [PySpark] compatitable with PyPy
After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:

```
PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
```

The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:

 Job | CPython 2.7 | PyPy 2.3.1  | Speed up
 ------- | ------------ | ------------- | -------
 Word Count | 41s   | 15s  | 2.7x
 Sort | 46s |  44s | 1.05x
 Stats | 174s | 3.6s | 48x

Here is the code used for benchmark:

```python
rdd = sc.textFile("text")
def wordcount():
    rdd.flatMap(lambda x:x.split('/'))\
        .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
    rdd.sortBy(lambda x:x, 1).count()
def stats():
    sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
```

Author: Davies Liu <[email protected]>

Closes apache#2144 from davies/pypy and squashes the following commits:

9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
4bc1f04 [Davies Liu] refactor
b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
3ca2351 [Davies Liu] Merge branch 'master' into pypy
fae8b19 [Davies Liu] improve attrgetter, add tests
591f830 [Davies Liu] try to run tests with PyPy in run-tests
c8d62ba [Davies Liu] cleanup
f651fd0 [Davies Liu] fix tests using array with PyPy
1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
3c1dbfe [Davies Liu] Merge branch 'master' into pypy
42fb5fa [Davies Liu] Merge branch 'master' into pypy
cb2d724 [Davies Liu] fix tests
9986692 [Davies Liu] Merge branch 'master' into pypy
25b4ca7 [Davies Liu] support PyPy
  • Loading branch information
davies authored and JoshRosen committed Sep 13, 2014
commit 71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7
168 changes: 67 additions & 101 deletions python/pyspark/cloudpickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,35 +52,19 @@
import itertools
from copy_reg import _extension_registry, _inverted_registry, _extension_cache
import new
import dis
import traceback
import platform

#relevant opcodes
STORE_GLOBAL = chr(dis.opname.index('STORE_GLOBAL'))
DELETE_GLOBAL = chr(dis.opname.index('DELETE_GLOBAL'))
LOAD_GLOBAL = chr(dis.opname.index('LOAD_GLOBAL'))
GLOBAL_OPS = [STORE_GLOBAL, DELETE_GLOBAL, LOAD_GLOBAL]
PyImp = platform.python_implementation()

HAVE_ARGUMENT = chr(dis.HAVE_ARGUMENT)
EXTENDED_ARG = chr(dis.EXTENDED_ARG)

import logging
cloudLog = logging.getLogger("Cloud.Transport")

try:
import ctypes
except (MemoryError, ImportError):
logging.warning('Exception raised on importing ctypes. Likely python bug.. some functionality will be disabled', exc_info = True)
ctypes = None
PyObject_HEAD = None
else:

# for reading internal structures
PyObject_HEAD = [
('ob_refcnt', ctypes.c_size_t),
('ob_type', ctypes.c_void_p),
]

if PyImp == "PyPy":
# register builtin type in `new`
new.method = types.MethodType

try:
from cStringIO import StringIO
Expand Down Expand Up @@ -225,6 +209,8 @@ def save_function(self, obj, name=None, pack=struct.pack):

if themodule:
self.modules.add(themodule)
if getattr(themodule, name, None) is obj:
return self.save_global(obj, name)

if not self.savedDjangoEnv:
#hack for django - if we detect the settings module, we transport it
Expand Down Expand Up @@ -306,44 +292,28 @@ def save_function_tuple(self, func, forced_imports):

# create a skeleton function object and memoize it
save(_make_skel_func)
save((code, len(closure), base_globals))
save((code, closure, base_globals))
write(pickle.REDUCE)
self.memoize(func)

# save the rest of the func data needed by _fill_function
save(f_globals)
save(defaults)
save(closure)
save(dct)
write(pickle.TUPLE)
write(pickle.REDUCE) # applies _fill_function on the tuple

@staticmethod
def extract_code_globals(co):
def extract_code_globals(code):
"""
Find all globals names read or written to by codeblock co
"""
code = co.co_code
names = co.co_names
out_names = set()

n = len(code)
i = 0
extended_arg = 0
while i < n:
op = code[i]

i = i+1
if op >= HAVE_ARGUMENT:
oparg = ord(code[i]) + ord(code[i+1])*256 + extended_arg
extended_arg = 0
i = i+2
if op == EXTENDED_ARG:
extended_arg = oparg*65536L
if op in GLOBAL_OPS:
out_names.add(names[oparg])
#print 'extracted', out_names, ' from ', names
return out_names
names = set(code.co_names)
if code.co_consts: # see if nested function have any global refs
for const in code.co_consts:
if type(const) is types.CodeType:
names |= CloudPickler.extract_code_globals(const)
return names

def extract_func_data(self, func):
"""
Expand All @@ -354,10 +324,7 @@ def extract_func_data(self, func):

# extract all global ref's
func_global_refs = CloudPickler.extract_code_globals(code)
if code.co_consts: # see if nested function have any global refs
for const in code.co_consts:
if type(const) is types.CodeType and const.co_names:
func_global_refs = func_global_refs.union( CloudPickler.extract_code_globals(const))

# process all variables referenced by global environment
f_globals = {}
for var in func_global_refs:
Expand Down Expand Up @@ -396,6 +363,12 @@ def get_contents(cell):

return (code, f_globals, defaults, closure, dct, base_globals)

def save_builtin_function(self, obj):
if obj.__module__ is "__builtin__":
return self.save_global(obj)
return self.save_function(obj)
dispatch[types.BuiltinFunctionType] = save_builtin_function

def save_global(self, obj, name=None, pack=struct.pack):
write = self.write
memo = self.memo
Expand Down Expand Up @@ -435,7 +408,7 @@ def save_global(self, obj, name=None, pack=struct.pack):
try:
klass = getattr(themodule, name)
except AttributeError, a:
#print themodule, name, obj, type(obj)
# print themodule, name, obj, type(obj)
raise pickle.PicklingError("Can't pickle builtin %s" % obj)
else:
raise
Expand Down Expand Up @@ -480,7 +453,6 @@ def save_global(self, obj, name=None, pack=struct.pack):
write(pickle.GLOBAL + modname + '\n' + name + '\n')
self.memoize(obj)
dispatch[types.ClassType] = save_global
dispatch[types.BuiltinFunctionType] = save_global
dispatch[types.TypeType] = save_global

def save_instancemethod(self, obj):
Expand Down Expand Up @@ -551,23 +523,39 @@ def save_property(self, obj):
dispatch[property] = save_property

def save_itemgetter(self, obj):
"""itemgetter serializer (needed for namedtuple support)
a bit of a pain as we need to read ctypes internals"""
class ItemGetterType(ctypes.Structure):
_fields_ = PyObject_HEAD + [
('nitems', ctypes.c_size_t),
('item', ctypes.py_object)
]


obj = ctypes.cast(ctypes.c_void_p(id(obj)), ctypes.POINTER(ItemGetterType)).contents
return self.save_reduce(operator.itemgetter,
obj.item if obj.nitems > 1 else (obj.item,))

if PyObject_HEAD:
"""itemgetter serializer (needed for namedtuple support)"""
class Dummy:
def __getitem__(self, item):
return item
items = obj(Dummy())
if not isinstance(items, tuple):
items = (items, )
return self.save_reduce(operator.itemgetter, items)

if type(operator.itemgetter) is type:
dispatch[operator.itemgetter] = save_itemgetter

def save_attrgetter(self, obj):
"""attrgetter serializer"""
class Dummy(object):
def __init__(self, attrs, index=None):
self.attrs = attrs
self.index = index
def __getattribute__(self, item):
attrs = object.__getattribute__(self, "attrs")
index = object.__getattribute__(self, "index")
if index is None:
index = len(attrs)
attrs.append(item)
else:
attrs[index] = ".".join([attrs[index], item])
return type(self)(attrs, index)
attrs = []
obj(Dummy(attrs))
return self.save_reduce(operator.attrgetter, tuple(attrs))

if type(operator.attrgetter) is type:
dispatch[operator.attrgetter] = save_attrgetter

def save_reduce(self, func, args, state=None,
listitems=None, dictitems=None, obj=None):
Expand Down Expand Up @@ -660,11 +648,11 @@ def save_file(self, obj):

if not hasattr(obj, 'name') or not hasattr(obj, 'mode'):
raise pickle.PicklingError("Cannot pickle files that do not map to an actual file")
if obj.name == '<stdout>':
if obj is sys.stdout:
return self.save_reduce(getattr, (sys,'stdout'), obj=obj)
if obj.name == '<stderr>':
if obj is sys.stderr:
return self.save_reduce(getattr, (sys,'stderr'), obj=obj)
if obj.name == '<stdin>':
if obj is sys.stdin:
raise pickle.PicklingError("Cannot pickle standard input")
if hasattr(obj, 'isatty') and obj.isatty():
raise pickle.PicklingError("Cannot pickle files that map to tty objects")
Expand Down Expand Up @@ -873,58 +861,36 @@ def _genpartial(func, args, kwds):
kwds = {}
return partial(func, *args, **kwds)


def _fill_function(func, globals, defaults, closure, dict):
def _fill_function(func, globals, defaults, dict):
""" Fills in the rest of function data into the skeleton function object
that were created via _make_skel_func().
"""
func.func_globals.update(globals)
func.func_defaults = defaults
func.func_dict = dict

if len(closure) != len(func.func_closure):
raise pickle.UnpicklingError("closure lengths don't match up")
for i in range(len(closure)):
_change_cell_value(func.func_closure[i], closure[i])

return func

def _make_skel_func(code, num_closures, base_globals = None):
def _make_cell(value):
return (lambda: value).func_closure[0]

def _reconstruct_closure(values):
return tuple([_make_cell(v) for v in values])

def _make_skel_func(code, closures, base_globals = None):
""" Creates a skeleton function object that contains just the provided
code and the correct number of cells in func_closure. All other
func attributes (e.g. func_globals) are empty.
"""
#build closure (cells):
if not ctypes:
raise Exception('ctypes failed to import; cannot build function')

cellnew = ctypes.pythonapi.PyCell_New
cellnew.restype = ctypes.py_object
cellnew.argtypes = (ctypes.py_object,)
dummy_closure = tuple(map(lambda i: cellnew(None), range(num_closures)))
closure = _reconstruct_closure(closures) if closures else None

if base_globals is None:
base_globals = {}
base_globals['__builtins__'] = __builtins__

return types.FunctionType(code, base_globals,
None, None, dummy_closure)

# this piece of opaque code is needed below to modify 'cell' contents
cell_changer_code = new.code(
1, 1, 2, 0,
''.join([
chr(dis.opmap['LOAD_FAST']), '\x00\x00',
chr(dis.opmap['DUP_TOP']),
chr(dis.opmap['STORE_DEREF']), '\x00\x00',
chr(dis.opmap['RETURN_VALUE'])
]),
(), (), ('newval',), '<nowhere>', 'cell_changer', 1, '', ('c',), ()
)

def _change_cell_value(cell, newval):
""" Changes the contents of 'cell' object to newval """
return new.function(cell_changer_code, {}, None, (), (cell,))(newval)
None, None, closure)


"""Constructors for 3rd party libraries
Note: These can never be renamed due to client compatibility issues"""
Expand Down
6 changes: 1 addition & 5 deletions python/pyspark/daemon.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,6 @@ def worker(sock):
"""
Called by a worker process after the fork().
"""
# Redirect stdout to stderr
os.dup2(2, 1)
sys.stdout = sys.stderr # The sys.stdout object is different from file descriptor 1

signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
Expand Down Expand Up @@ -102,6 +98,7 @@ def manager():
listen_sock.listen(max(1024, SOMAXCONN))
listen_host, listen_port = listen_sock.getsockname()
write_int(listen_port, sys.stdout)
sys.stdout.flush()

def shutdown(code):
signal.signal(SIGTERM, SIG_DFL)
Expand All @@ -115,7 +112,6 @@ def handle_sigterm(*args):
signal.signal(SIGHUP, SIG_IGN) # Don't die on SIGHUP

# Initialization complete
sys.stdout.close()
try:
while True:
try:
Expand Down
10 changes: 7 additions & 3 deletions python/pyspark/serializers.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,8 @@ class PickleSerializer(FramedSerializer):
def dumps(self, obj):
return cPickle.dumps(obj, 2)

loads = cPickle.loads
def loads(self, obj):
return cPickle.loads(obj)


class CloudPickleSerializer(PickleSerializer):
Expand All @@ -374,8 +375,11 @@ class MarshalSerializer(FramedSerializer):
This serializer is faster than PickleSerializer but supports fewer datatypes.
"""

dumps = marshal.dumps
loads = marshal.loads
def dumps(self, obj):
return marshal.dumps(obj)

def loads(self, obj):
return marshal.loads(obj)


class AutoSerializer(FramedSerializer):
Expand Down
Loading