Skip to content

Conversation

@bllchmbrs
Copy link
Contributor

@bllchmbrs bllchmbrs commented Nov 8, 2016

What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

  • Scala
  • Python
  • R
  • RDD Scala
  • RDD Python with SEED
  • RDD Java
  • RDD Java with SEED
  • RDD Python

How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@brkyvz
Copy link
Contributor

brkyvz commented Nov 8, 2016

@anabranch I don't see how the documentation was wrong. The second argument doesn't take the seed as a parameter, therefore the seed is random

@brkyvz
Copy link
Contributor

brkyvz commented Nov 8, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68371 has finished for PR 15815 at commit ce2bb90.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns a new Dataset by sampling a fraction of rows.
* Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
* Note: this is NOT guaranteed to provide exactly the fraction specified of the
* Dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to [[Dataset]] instead to mark down pretty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next commit.

@HyukjinKwon
Copy link
Member

How about the ones in Python and R? (If we should change them too, don't forget mark down Note: to ..note: and Dataset to :class:DataFrame`` to mark down pretty for Python).

/**
* Returns a new Dataset by sampling a fraction of rows.
* Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
* Note: this is NOT guaranteed to provide exactly the fraction specified of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we need a newline between description and Note: just to be consistent with the others in here and other places such as functions.scala if more commits should be pushed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new line.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68363 has finished for PR 15815 at commit c6e1000.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68364 has finished for PR 15815 at commit e46e6a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 12, 2016

@anabranch OK, the purpose of this JIRA and PR have changed significantly. Could you update all the titles and descriptions?

I think the right thing to do is make the documentation consistent across all sample methods, including RDD and DataFrame and in Python and R. They all describe this a bit differently. For example, the RDD method covers this, in a way, by talking about expected sample size.

I'd merge this if it were making every related method's docs consistent.

@bllchmbrs
Copy link
Contributor Author

Sounds good to me. I will update it shortly.

@bllchmbrs bllchmbrs changed the title [DOCS][SPARK-18365] Documentation is Switched on Sample Methods [DOCS][SPARK-18365] Improve Sample Documentation Nov 12, 2016
@bllchmbrs bllchmbrs changed the title [DOCS][SPARK-18365] Improve Sample Documentation [DOCS][SPARK-18365] Improve Sample Method Documentation Nov 12, 2016
@bllchmbrs
Copy link
Contributor Author

bllchmbrs commented Nov 12, 2016

@srowen Think this is probably ready.

  • Updated All Languages
  • Updated Ticket Description

#'
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Note that this is not guaranteed to provide exactly the fraction specified
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anabranch it might be probably already fine (I hope this does not sound like nitpicking) but maybe we would better just match this up as below (there are some examples for this in https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/functions.R#L2299-L2300 and https://github.com/anabranch/spark/blob/b4f2611214e00e11040d78fa33b8772588897264/R/pkg/R/mllib.R#L606-L607):

#' Return a sampled subset of this SparkDataFrame using a random seed. 
#' 
#' Note: that this is not guaranteed to provide exactly the fraction specified

I understand Note: and NOTE: are mixed across the documentation (including annotations and indentation). Let me try to deal with this across all documentation in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see what you're saying but I think it might be a bit out of scope for this PR. What your goal is definitely makes sense as there should be a standard, but I don't think I'm degrading anything significantly. I will add the : character.

@HyukjinKwon
Copy link
Member

2016-11-13 3 09 02

For Python documenation, it seems fine.

@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68566 has finished for PR 15815 at commit b4f2611.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bllchmbrs
Copy link
Contributor Author

The test failure seems quite unrelated but we'll see if it happens again.

@SparkQA
Copy link

SparkQA commented Nov 12, 2016

Test build #68573 has finished for PR 15815 at commit fae4a80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#'
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Note: that this is not guaranteed to provide exactly the fraction specified
Copy link
Member

@felixcheung felixcheung Nov 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change to Note: this is not ... from Note: that this is not...?

@SparkQA
Copy link

SparkQA commented Nov 14, 2016

Test build #68590 has finished for PR 15815 at commit f29acb4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

LGTM thanks!

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give RDD, JavaRDD and rdd.py a similar treatment for consistency. They also have a sample method.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68691 has finished for PR 15815 at commit f464b6e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #3427 has finished for PR 15815 at commit f464b6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def sample(self, withReplacement, fraction, seed=None):
"""Returns a sampled subset of this :class:`DataFrame`.
.. note::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny question about this syntax (I don't know it) -- I see other instances of this in the code base have to use a line continuation \ when it spans several lines? and then indent continuation lines flush with "..".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that this syntax works quite well. I'm not familiar with the line continuation syntax that you're referring to but this will display appropriately (on one line as a sentence).


/**
* Return a sampled subset of this RDD.
* Note: this is NOT guaranteed to provide exactly the fraction of the count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple overloads of sample here; update them all and maybe apply the clarification about seed you added below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another method I forgot in the python RDD that I will fix now as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, the Python one didn't need it once i re-read the docs.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #3428 has finished for PR 15815 at commit f464b6e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bllchmbrs
Copy link
Contributor Author

failures also seem unrelated.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68742 has finished for PR 15815 at commit 0d7cde8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Nov 17, 2016

Merged to master/2.1

asfgit pushed a commit that referenced this pull request Nov 17, 2016
## What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python

## How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: anabranch <[email protected]>
Author: Bill Chambers <[email protected]>

Closes #15815 from anabranch/SPARK-18365.

(cherry picked from commit 49b6f45)
Signed-off-by: Sean Owen <[email protected]>
@asfgit asfgit closed this in 49b6f45 Nov 17, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python

## How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: anabranch <[email protected]>
Author: Bill Chambers <[email protected]>

Closes apache#15815 from anabranch/SPARK-18365.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants