[SPARK-4417] New API: sample RDD to fixed number of items #3723

ilganeli · 2014-12-17T18:22:45Z

Hi all - I've added an interface to split an RDD by a count of elements (instead of simply by percentage). I've also added new tests to validate this performance and I've updated a previously existing function interface to re-use common code.

…from the RDD and returns it as an RDD.

SparkQA · 2014-12-17T18:27:30Z

Test build #24549 has started for PR 3723 at commit 8d411c3.

This patch merges cleanly.

SparkQA · 2014-12-17T19:51:21Z

Test build #24549 has finished for PR 3723 at commit 8d411c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ParquetTest
- protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String]

AmplabJenkins · 2014-12-17T19:51:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24549/
Test PASSed.

ilganeli · 2014-12-18T21:13:27Z

Hello - can anyone take a look at this patch please and provide feedback on the approach?

srowen · 2014-12-18T21:22:40Z

So the emphasis is on RDD, right? you can already sample to an Array on the driver. You could make the same argument for several other methods. take(100) can't be used to make an RDD. I think the logic is that it's for taking a smallish number of things. Likewise for sampling.

Put differently, how about just taking the Array and using parallelize() to make an RDD again?

If you really want a huge sample of a much huge-r RDD, you probably need to sample by partition, using a different approach, to do it efficiently. Here you're sampling over and over until you get enough.

So I think this may not quite fit in with how other similar API methods work, for better or worse, although maybe you don't have to have this method to do what you want.

ilganeli · 2014-12-18T21:30:55Z

Hi Sean - my concern with using take/collect() like in the previous approach is that there is essentially a hard-cap on what is tractable due to memory limitations. I wanted to build an implementation that is independent of memory, even if it is less efficient.

I've run into several use cases now where I'd like to operate on a large set of a parent RDD that is too big to fit into memory. When you start talking about complex operations on datasets of several hundred million entries, it becomes necessary to batch process the data to keep things tractable. Having a sampling function that samples by number (versus splitting up the RDD into multiple RDDs like randomSplit()) provides a functionality that isn't presently available. I've found that randomSplit does not handle a large number of splits of a larger dataset - partially due to memory problems and partly due to shuffle issues.

The sampling "over and over" will only happen a very small fraction of the time (when we're at the very tail end of the statistical distribution used to do the sampling). In general, this approach will only make a couple of passes over the data (once to sample the data and then at the end, if we have too many samples since the sampling is an approximation, pare down to the exact number).

ilganeli · 2014-12-28T16:39:18Z

Hello, could anyone please provide any more feedback on this patch and ideally get this merged? Thanks!

markhamstra · 2014-12-28T19:28:22Z

My biggest problem with this is that, while the existing sample is an action, sampleByCount is another one of those unholy beasts that is neither an action nor a transformation -- meaning that, while it transforms an RDD into another RDD, it isn't lazy while doing so, but rather embeds several actions (count) and makes use of another unholy beast (zipWithIndex), all of which means that invoking sampleByCount eagerly launches several jobs in order to create the new RDD.

This is by no means the only eager transformation (or whatever we end up calling these unholy beasts), since there is a handful of others that already exist in Spark; but I am really hesitant to add another. What we need is a larger strategy and re-organization to properly handle, name and document eager transformations, but that is well beyond the scope of this single PR. In the meantime, eager transformations are just conveniences (inconveniences if you are trying to launch jobs asynchronously) that packages up one or more actions. They can always be broken up into multiple explicit and ordinary transformations and actions (as Sean was effectively suggesting earlier), so none of them are strictly necessary to achieve their functionality.

I'm really hesitant to add sampleByCount to the Spark API and thereby to the list of eager transformations that we need to somehow fix in the future. Perhaps a better way to handle such convenience packaging of transformations and actions on RDDs is to include them in Spark Packages.

mateiz · 2014-12-29T23:38:23Z

I agree with Mark about this. This method doesn't seem worth adding an API for by default, especially if it will be tricky to implement. For extracting small samples, takeSample already lets you specify an exact numbers, and for downsampling large RDDs, most users probably don't need an exact number (and wouldn't want to pay an extra pass over the data for it). This and other advanced sampling methods could make a good external package though.

ilganeli · 2014-12-30T01:48:32Z

Mark and Matei - I hear you guys and understand what you're saying. Does it make sense to create new Jira to refactor the RDD interface to move the advanced sampling methods into a packages class? This would obviously involve deprecating the presently existing functions so I presume this wouldn't see the light of day for a while.

mateiz · 2015-01-01T01:58:48Z

By package, we meant an external library (e.g. on http://spark-packages.org). We shouldn't break or deprecate the methods in the current API. But a utility class with helper methods should be easy to maintain as a separate package, and then if many of them are widely used, we can also move some into Spark itself.

pwendell · 2015-01-19T09:56:57Z

Okay sounds like we'll close this issue as a wont fix.

Ilya Ganelin added 6 commits December 10, 2014 10:34

Upated RDD class to add a sampling function that provides N elements …

4a8ad6a

…from the RDD and returns it as an RDD.

Added testing for new sampling function.

564bcc1

Fixing minor formatting problems

350f039

Merge remote-tracking branch 'upstream/master' into SPARK-4417B

8e625d3

Spcing

26b4b81

More Spacing

8d411c3

asfgit closed this in 1ac1c1d Jan 19, 2015

[SPARK-4417] New API: sample RDD to fixed number of items #3723

[SPARK-4417] New API: sample RDD to fixed number of items #3723

Uh oh!

Conversation

ilganeli commented Dec 17, 2014

Uh oh!

SparkQA commented Dec 17, 2014

Uh oh!

SparkQA commented Dec 17, 2014

Uh oh!

AmplabJenkins commented Dec 17, 2014

Uh oh!

ilganeli commented Dec 18, 2014

Uh oh!

srowen commented Dec 18, 2014

Uh oh!

ilganeli commented Dec 18, 2014

Uh oh!

ilganeli commented Dec 28, 2014

Uh oh!

markhamstra commented Dec 28, 2014

Uh oh!

mateiz commented Dec 29, 2014

Uh oh!

ilganeli commented Dec 30, 2014

Uh oh!

mateiz commented Jan 1, 2015

Uh oh!

pwendell commented Jan 19, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants