-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2871] [PySpark] Add missing API #1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
+91
−12
Closed
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
ff2cbe3
add missing API in SparkContext
davies e0b3d30
add histogram()
davies 5d5be95
change histogram API
davies a95eca0
add zipWithIndex and zipWithUniqueId
davies 4ffae00
collectPartitions()
davies 7a9ea0a
update docs of histogram
davies 53640be
histogram() in pure Python, better support for int
davies 9a01ac3
fix docs
davies 7ba5f88
refactor
davies a25c34e
fix bug of countApproxDistinct
davies 1218b3b
add countApprox and countApproxDistinct
davies 034124f
Merge branch 'master' into api
davies 9132456
fix pep8
davies 977e474
address comments: improve docs
davies ac606ca
comment out not implemented APIs
davies f0158e4
comment out not implemented API in SparkContext
davies cb4f712
Mark SparkConf as read-only after initialization
davies 96713fa
Merge branch 'master' into api
davies e9e1037
Merge branch 'master' into api
davies 63c013d
address all the comments:
davies 1213aca
Merge branch 'master' into api
davies 28fd368
Merge branch 'master' into api
davies 1ac98d6
remove splitted changes
davies 657a09b
remove countApproxDistinct()
davies File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
add zipWithIndex and zipWithUniqueId
- Loading branch information
commit a95eca01ebfd023a5b016015b49d98abbd658287
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -907,10 +907,10 @@ def histogram(self, buckets=None, even=False): | |
|
|
||
| >>> rdd = sc.parallelize(range(51)) | ||
| >>> rdd.histogram(2) | ||
| ([0L, 25L, 50L], [25L, 26L] | ||
| >>> rdd.histogram(3, [0, 5, 25, 50]) | ||
| [5L, 20L, 25L] | ||
| >>> rdd.histogram(4, [0, 15, 30, 45, 60], True) | ||
| ([0.0, 25.0, 50.0], [25L, 26L]) | ||
| >>> rdd.histogram([0, 5, 25, 50]) | ||
| [5L, 20L, 26L] | ||
| >>> rdd.histogram([0, 15, 30, 45, 60], True) | ||
| [15L, 15L, 15L, 6L] | ||
| """ | ||
|
|
||
|
|
@@ -923,12 +923,12 @@ def histogram(self, buckets=None, even=False): | |
| raise ValueError("buckets should be greater than 1") | ||
|
|
||
| r = jdrdd.histogram(buckets) | ||
| return list(r._r1()), list(r._2()) | ||
| return list(r._1()), list(r._2()) | ||
|
|
||
| jbuckets = self.ctx._gateway.new_array(self.ctx._gateway.jvm.java.lang.Double, len(buckets)) | ||
| for i in range(len(buckets)): | ||
| jbuckets[i] = float(buckets[i]) | ||
| return list(jdrdd.histogram(jbuckets, evenBuckets)) | ||
| return list(jdrdd.histogram(jbuckets, even)) | ||
|
|
||
| def mean(self): | ||
| """ | ||
|
|
@@ -1750,29 +1750,56 @@ def zip(self, other): | |
| >>> x.zip(y).collect() | ||
| [(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)] | ||
| """ | ||
| if self.getNumPartitions() != other.getNumPartitions(): | ||
| raise ValueError("the number of partitions dose not match" | ||
| " with each other") | ||
|
|
||
| pairRDD = self._jrdd.zip(other._jrdd) | ||
| deserializer = PairDeserializer(self._jrdd_deserializer, | ||
| other._jrdd_deserializer) | ||
| return RDD(pairRDD, self.ctx, deserializer) | ||
|
|
||
| def zipPartitions(self, other, f): | ||
| def zipPartitions(self, other, f, preservesPartitioning=False): | ||
| """ | ||
| Zip this RDD's partitions with one (or more) RDD(s) and return a | ||
| new RDD by applying a function to the zipped partitions. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def zipWithIndex(self): | ||
| """ | ||
| Zips this RDD with its element indices. | ||
|
|
||
| >>> sc.parallelize(range(4), 2).zipWithIndex().collect() | ||
| [(0, 0), (1, 1), (2, 2), (3, 3)] | ||
| """ | ||
| raise NotImplementedError | ||
| nums = self.glom().map(lambda it: sum(1 for i in it)).collect() | ||
| starts = [0] | ||
| for i in range(len(nums) - 1): | ||
| starts.append(starts[-1] + nums[i]) | ||
|
|
||
| def func(k, it): | ||
| for i, v in enumerate(it): | ||
| yield starts[k] + i, v | ||
|
|
||
| return self.mapPartitionsWithIndex(func) | ||
|
|
||
| def zipWithUniqueId(self): | ||
| """ | ||
| Zips this RDD with generated unique Long ids. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same case here: this should be similarly descriptive to the Scala docs: /**
* Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
* 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
* won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
*/
def zipWithUniqueId(): RDD[(T, Long)] = { |
||
|
|
||
| >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect() | ||
| [(0, 0), (2, 1), (1, 2), (3, 3)] | ||
| """ | ||
| raise NotImplementedError | ||
| n = self.getNumPartitions() | ||
|
|
||
| def func(k, it): | ||
| for i, v in enumerate(it): | ||
| yield i * n + k, v | ||
|
|
||
| return self.mapPartitionsWithIndex(func) | ||
|
|
||
| def name(self): | ||
| """ | ||
|
|
@@ -1842,63 +1869,78 @@ def _defaultReducePartitions(self): | |
| def lookup(self, key): | ||
| """ | ||
| Return the list of values in the RDD for key key. | ||
|
|
||
| Not Implemented | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def countApprox(self, timeout, confidence=1.0): | ||
| def countApprox(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Approximate version of count() that returns a potentially incomplete | ||
| result within a timeout, even if not all tasks have finished. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def countApproxDistinct(self, timeout, confidence=1.0): | ||
| def countApproxDistinct(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Return approximate number of distinct elements in the RDD. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def countByValueApprox(self, timeout, confidence=1.0): | ||
| def countByValueApprox(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental:: | ||
| Approximate version of countByValue(). | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def sumApprox(self, timeout, confidence=1.0): | ||
| def sumApprox(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Approximate operation to return the sum within a timeout | ||
| or meet the confidence. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def meanApprox(self, timeout, confidence=1.0): | ||
| def meanApprox(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Approximate operation to return the mean within a timeout | ||
| or meet the confidence. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def countApproxDistinctByKey(self): | ||
| def countApproxDistinctByKey(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Return approximate number of distinct values for each key in this RDD. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def countByKeyApprox(self, timeout, confidence=1.0): | ||
| def countByKeyApprox(self, timeout, confidence=0.95): | ||
| """ | ||
| :: Experimental :: | ||
| Approximate version of countByKey that can return a partial result if it does not finish within a timeout. | ||
|
|
||
| Not implemented. | ||
| """ | ||
| raise NotImplementedError | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
| class PipelinedRDD(RDD): | ||
|
|
||
| """ | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scala documentation is much more descriptive about what this method does:
The Python documentation should explain these subtleties, too.