Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -936,7 +936,9 @@ setMethod("unique",

#' Sample
#'
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Note: this is not guaranteed to provide exactly the fraction specified
#' of the total count of of the given SparkDataFrame.
#'
#' @param x A SparkDataFrame
#' @param withReplacement Sampling with replacement or not
Expand Down
8 changes: 6 additions & 2 deletions core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,9 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
def repartition(numPartitions: Int): JavaRDD[T] = rdd.repartition(numPartitions)

/**
* Return a sampled subset of this RDD.
* Return a sampled subset of this RDD with a random seed.
* Note: this is NOT guaranteed to provide exactly the fraction of the count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple overloads of sample here; update them all and maybe apply the clarification about seed you added below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Will add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also another method I forgot in the python RDD that I will fix now as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, the Python one didn't need it once i re-read the docs.

* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
Expand All @@ -109,7 +111,9 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
sample(withReplacement, fraction, Utils.random.nextLong)

/**
* Return a sampled subset of this RDD.
* Return a sampled subset of this RDD, with a user-supplied seed.
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
Expand Down
3 changes: 3 additions & 0 deletions core/src/main/scala/org/apache/spark/rdd/RDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,9 @@ abstract class RDD[T: ClassTag](
/**
* Return a sampled subset of this RDD.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
Expand Down
5 changes: 5 additions & 0 deletions python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,11 @@ def sample(self, withReplacement, fraction, seed=None):
with replacement: expected number of times each element is chosen; fraction must be >= 0
:param seed: seed for the random number generator

.. note::

This is not guaranteed to provide exactly the fraction specified of the total count
of the given :class:`DataFrame`.

>>> rdd = sc.parallelize(range(100), 4)
>>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14
True
Expand Down
5 changes: 5 additions & 0 deletions python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -549,6 +549,11 @@ def distinct(self):
def sample(self, withReplacement, fraction, seed=None):
"""Returns a sampled subset of this :class:`DataFrame`.

.. note::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny question about this syntax (I don't know it) -- I see other instances of this in the code base have to use a line continuation \ when it spans several lines? and then indent continuation lines flush with "..".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that this syntax works quite well. I'm not familiar with the line continuation syntax that you're referring to but this will display appropriately (on one line as a sentence).


This is not guaranteed to provide exactly the fraction specified of the total count
of the given :class:`DataFrame`.

>>> df.sample(False, 0.5, 42).count()
2
"""
Expand Down
10 changes: 8 additions & 2 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1612,7 +1612,10 @@ class Dataset[T] private[sql](
}

/**
* Returns a new Dataset by sampling a fraction of rows.
* Returns a new [[Dataset]] by sampling a fraction of rows, using a user-supplied seed.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[Dataset]].
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.
Expand All @@ -1631,7 +1634,10 @@ class Dataset[T] private[sql](
}

/**
* Returns a new Dataset by sampling a fraction of rows, using a random seed.
* Returns a new [[Dataset]] by sampling a fraction of rows, using a random seed.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the total count
* of the given [[Dataset]].
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.
Expand Down