-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4417] New API: sample RDD to fixed number of items #3723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…from the RDD and returns it as an RDD.
|
Test build #24549 has started for PR 3723 at commit
|
|
Test build #24549 has finished for PR 3723 at commit
|
|
Test PASSed. |
|
Hello - can anyone take a look at this patch please and provide feedback on the approach? |
|
So the emphasis is on RDD, right? you can already sample to an Array on the driver. You could make the same argument for several other methods. Put differently, how about just taking the If you really want a huge sample of a much huge-r So I think this may not quite fit in with how other similar API methods work, for better or worse, although maybe you don't have to have this method to do what you want. |
|
Hi Sean - my concern with using take/collect() like in the previous approach is that there is essentially a hard-cap on what is tractable due to memory limitations. I wanted to build an implementation that is independent of memory, even if it is less efficient. I've run into several use cases now where I'd like to operate on a large set of a parent RDD that is too big to fit into memory. When you start talking about complex operations on datasets of several hundred million entries, it becomes necessary to batch process the data to keep things tractable. Having a sampling function that samples by number (versus splitting up the RDD into multiple RDDs like randomSplit()) provides a functionality that isn't presently available. I've found that randomSplit does not handle a large number of splits of a larger dataset - partially due to memory problems and partly due to shuffle issues. The sampling "over and over" will only happen a very small fraction of the time (when we're at the very tail end of the statistical distribution used to do the sampling). In general, this approach will only make a couple of passes over the data (once to sample the data and then at the end, if we have too many samples since the sampling is an approximation, pare down to the exact number). |
|
Hello, could anyone please provide any more feedback on this patch and ideally get this merged? Thanks! |
|
My biggest problem with this is that, while the existing This is by no means the only eager transformation (or whatever we end up calling these unholy beasts), since there is a handful of others that already exist in Spark; but I am really hesitant to add another. What we need is a larger strategy and re-organization to properly handle, name and document eager transformations, but that is well beyond the scope of this single PR. In the meantime, eager transformations are just conveniences (inconveniences if you are trying to launch jobs asynchronously) that packages up one or more actions. They can always be broken up into multiple explicit and ordinary transformations and actions (as Sean was effectively suggesting earlier), so none of them are strictly necessary to achieve their functionality. I'm really hesitant to add |
|
I agree with Mark about this. This method doesn't seem worth adding an API for by default, especially if it will be tricky to implement. For extracting small samples, takeSample already lets you specify an exact numbers, and for downsampling large RDDs, most users probably don't need an exact number (and wouldn't want to pay an extra pass over the data for it). This and other advanced sampling methods could make a good external package though. |
|
Mark and Matei - I hear you guys and understand what you're saying. Does it make sense to create new Jira to refactor the RDD interface to move the advanced sampling methods into a packages class? This would obviously involve deprecating the presently existing functions so I presume this wouldn't see the light of day for a while. |
|
By package, we meant an external library (e.g. on http://spark-packages.org). We shouldn't break or deprecate the methods in the current API. But a utility class with helper methods should be easy to maintain as a separate package, and then if many of them are widely used, we can also move some into Spark itself. |
|
Okay sounds like we'll close this issue as a wont fix. |
Hi all - I've added an interface to split an RDD by a count of elements (instead of simply by percentage). I've also added new tests to validate this performance and I've updated a previously existing function interface to re-use common code.