-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2871] [PySpark] add key argument for max(), min() and top(n)
#2094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
comp argument for RDD.max() and RDD.min()
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -810,23 +810,45 @@ def func(iterator): | |
|
|
||
| return self.mapPartitions(func).fold(zeroValue, combOp) | ||
|
|
||
| def max(self): | ||
| def max(self, comp=None): | ||
| """ | ||
| Find the maximum item in this RDD. | ||
|
|
||
| >>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max() | ||
| @param comp: A function used to compare two elements, the builtin `cmp` | ||
| will be used by default. | ||
|
|
||
| >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0]) | ||
| >>> rdd.max() | ||
| 43.0 | ||
| >>> rdd.max(lambda a, b: cmp(str(a), str(b))) | ||
| 5.0 | ||
| """ | ||
| return self.reduce(max) | ||
| if comp is not None: | ||
| func = lambda a, b: a if comp(a, b) >= 0 else b | ||
| else: | ||
| func = max | ||
|
|
||
| def min(self): | ||
| return self.reduce(func) | ||
|
|
||
| def min(self, comp=None): | ||
| """ | ||
| Find the minimum item in this RDD. | ||
|
|
||
| >>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).min() | ||
| 1.0 | ||
| @param comp: A function used to compare two elements, the builtin `cmp` | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit - the builtin 'min' |
||
| will be used by default. | ||
|
|
||
| >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0]) | ||
| >>> rdd.min() | ||
| 2.0 | ||
| >>> rdd.min(lambda a, b: cmp(str(a), str(b))) | ||
| 10.0 | ||
| """ | ||
| return self.reduce(min) | ||
| if comp is not None: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. consider default of comp=min in arg list and test for comp is not min same for max method
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. min and comp have different meanings: |
||
| func = lambda a, b: a if comp(a, b) <= 0 else b | ||
| else: | ||
| func = min | ||
|
|
||
| return self.reduce(func) | ||
|
|
||
| def sum(self): | ||
| """ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - the buildin 'max'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
cmpis the function used inmaxormin, socmpis the default value forcomp.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using
comphere is bit confusing. The builtinminusekey, it will be better for Python programer, but it will be different than Scala API.cc @mateiz @rxin @JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use
keyin Python instead ofOrderingin Scala, so I had change it intokey.Also , I would like to add
keyto top(), will be helpful, such as:rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))
We already have
ordin Scala. Should I add this in this PR?