Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
add comp argument for RDD.max() and RDD.min()
  • Loading branch information
davies committed Aug 22, 2014
commit dd91e08a92ebace863506cdfe52114ffeec894c9
36 changes: 29 additions & 7 deletions python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -810,23 +810,45 @@ def func(iterator):

return self.mapPartitions(func).fold(zeroValue, combOp)

def max(self):
def max(self, comp=None):
"""
Find the maximum item in this RDD.

>>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
@param comp: A function used to compare two elements, the builtin `cmp`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the buildin 'max'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cmp is the function used in max or min, so cmp is the default value for comp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using comp here is bit confusing. The builtin min use key, it will be better for Python programer, but it will be different than Scala API.

cc @mateiz @rxin @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use key in Python instead of Ordering in Scala, so I had change it into key.

Also , I would like to add key to top(), will be helpful, such as:

rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))

We already have ord in Scala. Should I add this in this PR?

will be used by default.

>>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
>>> rdd.max()
43.0
>>> rdd.max(lambda a, b: cmp(str(a), str(b)))
5.0
"""
return self.reduce(max)
if comp is not None:
func = lambda a, b: a if comp(a, b) >= 0 else b
else:
func = max

def min(self):
return self.reduce(func)

def min(self, comp=None):
"""
Find the minimum item in this RDD.

>>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).min()
1.0
@param comp: A function used to compare two elements, the builtin `cmp`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the builtin 'min'

will be used by default.

>>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
>>> rdd.min()
2.0
>>> rdd.min(lambda a, b: cmp(str(a), str(b)))
10.0
"""
return self.reduce(min)
if comp is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider default of comp=min in arg list and test for comp is not min

same for max method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and comp have different meanings:

>>> min(1, 2)
1
>>> cmp(1, 2)
-1

func = lambda a, b: a if comp(a, b) <= 0 else b
else:
func = min

return self.reduce(func)

def sum(self):
"""
Expand Down