Skip to content

Conversation

@ilganeli
Copy link

Hi all - I've added an interface to split an RDD by a count of elements (instead of simply by percentage). I've also added new tests to validate this performance and I've updated a previously existing function interface to re-use common code.

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24549 has started for PR 3723 at commit 8d411c3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24549 has finished for PR 3723 at commit 8d411c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait ParquetTest
    • protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24549/
Test PASSed.

@ilganeli
Copy link
Author

Hello - can anyone take a look at this patch please and provide feedback on the approach?

@srowen
Copy link
Member

srowen commented Dec 18, 2014

So the emphasis is on RDD, right? you can already sample to an Array on the driver. You could make the same argument for several other methods. take(100) can't be used to make an RDD. I think the logic is that it's for taking a smallish number of things. Likewise for sampling.

Put differently, how about just taking the Array and using parallelize() to make an RDD again?

If you really want a huge sample of a much huge-r RDD, you probably need to sample by partition, using a different approach, to do it efficiently. Here you're sampling over and over until you get enough.

So I think this may not quite fit in with how other similar API methods work, for better or worse, although maybe you don't have to have this method to do what you want.

@ilganeli
Copy link
Author

Hi Sean - my concern with using take/collect() like in the previous approach is that there is essentially a hard-cap on what is tractable due to memory limitations. I wanted to build an implementation that is independent of memory, even if it is less efficient.

I've run into several use cases now where I'd like to operate on a large set of a parent RDD that is too big to fit into memory. When you start talking about complex operations on datasets of several hundred million entries, it becomes necessary to batch process the data to keep things tractable. Having a sampling function that samples by number (versus splitting up the RDD into multiple RDDs like randomSplit()) provides a functionality that isn't presently available. I've found that randomSplit does not handle a large number of splits of a larger dataset - partially due to memory problems and partly due to shuffle issues.

The sampling "over and over" will only happen a very small fraction of the time (when we're at the very tail end of the statistical distribution used to do the sampling). In general, this approach will only make a couple of passes over the data (once to sample the data and then at the end, if we have too many samples since the sampling is an approximation, pare down to the exact number).

@ilganeli
Copy link
Author

Hello, could anyone please provide any more feedback on this patch and ideally get this merged? Thanks!

@markhamstra
Copy link
Contributor

My biggest problem with this is that, while the existing sample is an action, sampleByCount is another one of those unholy beasts that is neither an action nor a transformation -- meaning that, while it transforms an RDD into another RDD, it isn't lazy while doing so, but rather embeds several actions (count) and makes use of another unholy beast (zipWithIndex), all of which means that invoking sampleByCount eagerly launches several jobs in order to create the new RDD.

This is by no means the only eager transformation (or whatever we end up calling these unholy beasts), since there is a handful of others that already exist in Spark; but I am really hesitant to add another. What we need is a larger strategy and re-organization to properly handle, name and document eager transformations, but that is well beyond the scope of this single PR. In the meantime, eager transformations are just conveniences (inconveniences if you are trying to launch jobs asynchronously) that packages up one or more actions. They can always be broken up into multiple explicit and ordinary transformations and actions (as Sean was effectively suggesting earlier), so none of them are strictly necessary to achieve their functionality.

I'm really hesitant to add sampleByCount to the Spark API and thereby to the list of eager transformations that we need to somehow fix in the future. Perhaps a better way to handle such convenience packaging of transformations and actions on RDDs is to include them in Spark Packages.

@mateiz
Copy link
Contributor

mateiz commented Dec 29, 2014

I agree with Mark about this. This method doesn't seem worth adding an API for by default, especially if it will be tricky to implement. For extracting small samples, takeSample already lets you specify an exact numbers, and for downsampling large RDDs, most users probably don't need an exact number (and wouldn't want to pay an extra pass over the data for it). This and other advanced sampling methods could make a good external package though.

@ilganeli
Copy link
Author

Mark and Matei - I hear you guys and understand what you're saying. Does it make sense to create new Jira to refactor the RDD interface to move the advanced sampling methods into a packages class? This would obviously involve deprecating the presently existing functions so I presume this wouldn't see the light of day for a while.

@mateiz
Copy link
Contributor

mateiz commented Jan 1, 2015

By package, we meant an external library (e.g. on http://spark-packages.org). We shouldn't break or deprecate the methods in the current API. But a utility class with helper methods should be easy to maintain as a separate package, and then if many of them are widely used, we can also move some into Spark itself.

@pwendell
Copy link
Contributor

Okay sounds like we'll close this issue as a wont fix.

@asfgit asfgit closed this in 1ac1c1d Jan 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants