[DOCS][SPARK-18365] Improve Sample Method Documentation #15815

bllchmbrs · 2016-11-08T21:28:55Z

What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

brkyvz · 2016-11-08T21:31:44Z

@anabranch I don't see how the documentation was wrong. The second argument doesn't take the seed as a parameter, therefore the seed is random

brkyvz · 2016-11-08T21:38:11Z

LGTM

SparkQA · 2016-11-08T23:20:59Z

Test build #68371 has finished for PR 15815 at commit ce2bb90.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-08T23:59:21Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-   * Returns a new Dataset by sampling a fraction of rows.
+   * Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
+   * Note: this is NOT guaranteed to provide exactly the fraction specified of the 
+   * Dataset.


Should we change this to [[Dataset]] instead to mark down pretty?

Done in next commit.

HyukjinKwon · 2016-11-09T00:03:17Z

How about the ones in Python and R? (If we should change them too, don't forget mark down Note: to ..note: and Dataset to :class:DataFrame`` to mark down pretty for Python).

HyukjinKwon · 2016-11-09T00:05:33Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

  /**
-   * Returns a new Dataset by sampling a fraction of rows.
+   * Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
+   * Note: this is NOT guaranteed to provide exactly the fraction specified of the 


Maybe, we need a newline between description and Note: just to be consistent with the others in here and other places such as functions.scala if more commits should be pushed.

Added a new line.

SparkQA · 2016-11-09T00:14:42Z

Test build #68363 has finished for PR 15815 at commit c6e1000.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-09T00:27:12Z

Test build #68364 has finished for PR 15815 at commit e46e6a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-12T09:37:20Z

@anabranch OK, the purpose of this JIRA and PR have changed significantly. Could you update all the titles and descriptions?

I think the right thing to do is make the documentation consistent across all sample methods, including RDD and DataFrame and in Python and R. They all describe this a bit differently. For example, the RDD method covers this, in a way, by talking about expected sample size.

I'd merge this if it were making every related method's docs consistent.

bllchmbrs · 2016-11-12T16:45:54Z

Sounds good to me. I will update it shortly.

bllchmbrs · 2016-11-12T18:00:06Z

@srowen Think this is probably ready.

Updated All Languages
Updated Ticket Description

HyukjinKwon · 2016-11-12T18:21:03Z

R/pkg/R/DataFrame.R

 #'
-#' Return a sampled subset of this SparkDataFrame using a random seed.
+#' Return a sampled subset of this SparkDataFrame using a random seed. 
+#' Note that this is not guaranteed to provide exactly the fraction specified


@anabranch it might be probably already fine (I hope this does not sound like nitpicking) but maybe we would better just match this up as below (there are some examples for this in https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/functions.R#L2299-L2300 and https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/mllib.R#L606-L607):

#' Return a sampled subset of this SparkDataFrame using a random seed. #' #' Note: that this is not guaranteed to provide exactly the fraction specified

I understand Note: and NOTE: are mixed across the documentation (including annotations and indentation). Let me try to deal with this across all documentation in a separate PR.

Yeah I see what you're saying but I think it might be a bit out of scope for this PR. What your goal is definitely makes sense as there should be a standard, but I don't think I'm degrading anything significantly. I will add the : character.

HyukjinKwon · 2016-11-12T18:21:38Z

For Python documenation, it seems fine.

SparkQA · 2016-11-12T19:53:53Z

Test build #68566 has finished for PR 15815 at commit b4f2611.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bllchmbrs · 2016-11-12T20:47:33Z

The test failure seems quite unrelated but we'll see if it happens again.

SparkQA · 2016-11-12T22:48:29Z

Test build #68573 has finished for PR 15815 at commit fae4a80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-13T22:28:52Z

R/pkg/R/DataFrame.R

 #'
-#' Return a sampled subset of this SparkDataFrame using a random seed.
+#' Return a sampled subset of this SparkDataFrame using a random seed. 
+#' Note: that this is not guaranteed to provide exactly the fraction specified


nit: change to Note: this is not ... from Note: that this is not...?

SparkQA · 2016-11-14T00:55:19Z

Test build #68590 has finished for PR 15815 at commit f29acb4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-14T09:15:53Z

LGTM thanks!

srowen

I think we should give RDD, JavaRDD and rdd.py a similar treatment for consistency. They also have a sample method.

SparkQA · 2016-11-16T06:58:42Z

Test build #68691 has finished for PR 15815 at commit f464b6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-16T16:16:03Z

Test build #3427 has finished for PR 15815 at commit f464b6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-16T16:21:13Z

python/pyspark/sql/dataframe.py

    def sample(self, withReplacement, fraction, seed=None):
        """Returns a sampled subset of this :class:`DataFrame`.

+        .. note::


Tiny question about this syntax (I don't know it) -- I see other instances of this in the code base have to use a line continuation \ when it spans several lines? and then indent continuation lines flush with "..".

I've found that this syntax works quite well. I'm not familiar with the line continuation syntax that you're referring to but this will display appropriately (on one line as a sentence).

srowen · 2016-11-16T16:22:10Z

core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala


  /**
   * Return a sampled subset of this RDD.
+   * Note: this is NOT guaranteed to provide exactly the fraction of the count


There are a couple overloads of sample here; update them all and maybe apply the clarification about seed you added below?

Yes. Will add this.

There is also another method I forgot in the python RDD that I will fix now as well.

Fixed, the Python one didn't need it once i re-read the docs.

SparkQA · 2016-11-16T18:13:26Z

Test build #3428 has finished for PR 15815 at commit f464b6e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bllchmbrs · 2016-11-17T03:50:18Z

failures also seem unrelated.

SparkQA · 2016-11-17T06:24:45Z

Test build #68742 has finished for PR 15815 at commit 0d7cde8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-17T11:35:11Z

Merged to master/2.1

## What changes were proposed in this pull request? I found the documentation for the sample method to be confusing, this adds more clarification across all languages. - [x] Scala - [x] Python - [x] R - [x] RDD Scala - [ ] RDD Python with SEED - [X] RDD Java - [x] RDD Java with SEED - [x] RDD Python ## How was this patch tested? NA Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: anabranch <[email protected]> Author: Bill Chambers <[email protected]> Closes #15815 from anabranch/SPARK-18365. (cherry picked from commit 49b6f45) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? I found the documentation for the sample method to be confusing, this adds more clarification across all languages. - [x] Scala - [x] Python - [x] R - [x] RDD Scala - [ ] RDD Python with SEED - [X] RDD Java - [x] RDD Java with SEED - [x] RDD Python ## How was this patch tested? NA Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: anabranch <[email protected]> Author: Bill Chambers <[email protected]> Closes apache#15815 from anabranch/SPARK-18365.

fix error in docs

c6e1000

Bill Chambers added 2 commits November 8, 2016 13:34

updated with feedback

9f04fa8

wording

e46e6a7

even more clarification

ce2bb90

HyukjinKwon reviewed Nov 8, 2016

View reviewed changes

HyukjinKwon reviewed Nov 9, 2016

View reviewed changes

bllchmbrs changed the title ~~[DOCS][SPARK-18365] Documentation is Switched on Sample Methods~~ [DOCS][SPARK-18365] Improve Sample Documentation Nov 12, 2016

bllchmbrs added 4 commits November 12, 2016 09:47

added links, spacing

a257d81

add python docs

19c4828

added R

c94b75c

made slightly more consistent

b4f2611

bllchmbrs changed the title ~~[DOCS][SPARK-18365] Improve Sample Documentation~~ [DOCS][SPARK-18365] Improve Sample Method Documentation Nov 12, 2016

HyukjinKwon reviewed Nov 12, 2016

View reviewed changes

added colon

fae4a80

felixcheung reviewed Nov 13, 2016

View reviewed changes

fix nit

f29acb4

srowen requested changes Nov 14, 2016

View reviewed changes

bllchmbrs added 2 commits November 15, 2016 19:21

updated python rdd

064c653

rdd & javardd

f464b6e

srowen reviewed Nov 16, 2016

View reviewed changes

improved javaRDD

0d7cde8

asfgit closed this in 49b6f45 Nov 17, 2016

[DOCS][SPARK-18365] Improve Sample Method Documentation #15815

[DOCS][SPARK-18365] Improve Sample Method Documentation #15815

Uh oh!

Conversation

bllchmbrs commented Nov 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

brkyvz commented Nov 8, 2016

Uh oh!

brkyvz commented Nov 8, 2016

Uh oh!

SparkQA commented Nov 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

srowen commented Nov 12, 2016

Uh oh!

bllchmbrs commented Nov 12, 2016

Uh oh!

bllchmbrs commented Nov 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Nov 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 12, 2016

Uh oh!

SparkQA commented Nov 12, 2016

Uh oh!

bllchmbrs commented Nov 12, 2016

Uh oh!

SparkQA commented Nov 12, 2016

Uh oh!

felixcheung Nov 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 14, 2016

Uh oh!

felixcheung commented Nov 14, 2016

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 16, 2016

bllchmbrs commented Nov 8, 2016 •

edited

Loading

bllchmbrs commented Nov 12, 2016 •

edited

Loading

HyukjinKwon Nov 12, 2016 •

edited

Loading

felixcheung Nov 13, 2016 •

edited

Loading