-
Notifications
You must be signed in to change notification settings - Fork 29k
[DOCS][SPARK-18365] Improve Sample Method Documentation #15815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@anabranch I don't see how the documentation was wrong. The second argument doesn't take the seed as a parameter, therefore the seed is random |
|
LGTM |
|
Test build #68371 has finished for PR 15815 at commit
|
| * Returns a new Dataset by sampling a fraction of rows. | ||
| * Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed. | ||
| * Note: this is NOT guaranteed to provide exactly the fraction specified of the | ||
| * Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change this to [[Dataset]] instead to mark down pretty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in next commit.
|
How about the ones in Python and R? (If we should change them too, don't forget mark down |
| /** | ||
| * Returns a new Dataset by sampling a fraction of rows. | ||
| * Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed. | ||
| * Note: this is NOT guaranteed to provide exactly the fraction specified of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we need a newline between description and Note: just to be consistent with the others in here and other places such as functions.scala if more commits should be pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new line.
|
Test build #68363 has finished for PR 15815 at commit
|
|
Test build #68364 has finished for PR 15815 at commit
|
|
@anabranch OK, the purpose of this JIRA and PR have changed significantly. Could you update all the titles and descriptions? I think the right thing to do is make the documentation consistent across all sample methods, including RDD and DataFrame and in Python and R. They all describe this a bit differently. For example, the RDD method covers this, in a way, by talking about expected sample size. I'd merge this if it were making every related method's docs consistent. |
|
Sounds good to me. I will update it shortly. |
|
@srowen Think this is probably ready.
|
R/pkg/R/DataFrame.R
Outdated
| #' | ||
| #' Return a sampled subset of this SparkDataFrame using a random seed. | ||
| #' Return a sampled subset of this SparkDataFrame using a random seed. | ||
| #' Note that this is not guaranteed to provide exactly the fraction specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anabranch it might be probably already fine (I hope this does not sound like nitpicking) but maybe we would better just match this up as below (there are some examples for this in https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/functions.R#L2299-L2300 and https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/mllib.R#L606-L607):
#' Return a sampled subset of this SparkDataFrame using a random seed.
#'
#' Note: that this is not guaranteed to provide exactly the fraction specified
I understand Note: and NOTE: are mixed across the documentation (including annotations and indentation). Let me try to deal with this across all documentation in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I see what you're saying but I think it might be a bit out of scope for this PR. What your goal is definitely makes sense as there should be a standard, but I don't think I'm degrading anything significantly. I will add the : character.
|
Test build #68566 has finished for PR 15815 at commit
|
|
The test failure seems quite unrelated but we'll see if it happens again. |
|
Test build #68573 has finished for PR 15815 at commit
|
R/pkg/R/DataFrame.R
Outdated
| #' | ||
| #' Return a sampled subset of this SparkDataFrame using a random seed. | ||
| #' Return a sampled subset of this SparkDataFrame using a random seed. | ||
| #' Note: that this is not guaranteed to provide exactly the fraction specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change to Note: this is not ... from Note: that this is not...?
|
Test build #68590 has finished for PR 15815 at commit
|
|
LGTM thanks! |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should give RDD, JavaRDD and rdd.py a similar treatment for consistency. They also have a sample method.
|
Test build #68691 has finished for PR 15815 at commit
|
|
Test build #3427 has finished for PR 15815 at commit
|
| def sample(self, withReplacement, fraction, seed=None): | ||
| """Returns a sampled subset of this :class:`DataFrame`. | ||
| .. note:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny question about this syntax (I don't know it) -- I see other instances of this in the code base have to use a line continuation \ when it spans several lines? and then indent continuation lines flush with "..".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found that this syntax works quite well. I'm not familiar with the line continuation syntax that you're referring to but this will display appropriately (on one line as a sentence).
|
|
||
| /** | ||
| * Return a sampled subset of this RDD. | ||
| * Note: this is NOT guaranteed to provide exactly the fraction of the count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple overloads of sample here; update them all and maybe apply the clarification about seed you added below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also another method I forgot in the python RDD that I will fix now as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, the Python one didn't need it once i re-read the docs.
|
Test build #3428 has finished for PR 15815 at commit
|
|
failures also seem unrelated. |
|
Test build #68742 has finished for PR 15815 at commit
|
|
Merged to master/2.1 |
## What changes were proposed in this pull request? I found the documentation for the sample method to be confusing, this adds more clarification across all languages. - [x] Scala - [x] Python - [x] R - [x] RDD Scala - [ ] RDD Python with SEED - [X] RDD Java - [x] RDD Java with SEED - [x] RDD Python ## How was this patch tested? NA Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: anabranch <[email protected]> Author: Bill Chambers <[email protected]> Closes #15815 from anabranch/SPARK-18365. (cherry picked from commit 49b6f45) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? I found the documentation for the sample method to be confusing, this adds more clarification across all languages. - [x] Scala - [x] Python - [x] R - [x] RDD Scala - [ ] RDD Python with SEED - [X] RDD Java - [x] RDD Java with SEED - [x] RDD Python ## How was this patch tested? NA Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: anabranch <[email protected]> Author: Bill Chambers <[email protected]> Closes apache#15815 from anabranch/SPARK-18365.

What changes were proposed in this pull request?
I found the documentation for the sample method to be confusing, this adds more clarification across all languages.
How was this patch tested?
NA
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.