Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
doc
  • Loading branch information
felixcheung committed Feb 15, 2017
commit 3593794710423c2699188c5d22f34f3eb79cd43e
3 changes: 2 additions & 1 deletion R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -683,7 +683,8 @@ setMethod("storageLevel",
#' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions.
#' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100
#' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of
#' the current partitions.
#' the current partitions. If a larger number of partitions is requested, it will stay at the
#' current number of partitions.
#'
#' @param numPartitions the number of partitions to use.
#'
Expand Down
3 changes: 2 additions & 1 deletion R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -2491,7 +2491,7 @@ test_that("repartition by columns on DataFrame", {
("Please, specify the number of partitions and/or a column\\(s\\)", retError), TRUE)

# repartition by column and number of partitions
actual <- repartition(df, 3L, col = df$"a")
actual <- repartition(df, 3, col = df$"a")

# Checking that at least the dimensions are identical
expect_identical(dim(df), dim(actual))
Expand All @@ -2502,6 +2502,7 @@ test_that("repartition by columns on DataFrame", {
expect_identical(dim(df), dim(actual))
expect_equal(getNumPartitions(actual), 13L)

expect_equal(getNumPartitions(coalesce(actual, 14)), 13L)
expect_equal(getNumPartitions(coalesce(actual, 1L)), 1L)

# a test case with a column and dapply
Expand Down
3 changes: 2 additions & 1 deletion core/src/main/scala/org/apache/spark/rdd/RDD.scala
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,8 @@ abstract class RDD[T: ClassTag](
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions.
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
Expand Down
3 changes: 2 additions & 1 deletion python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,8 @@ def coalesce(self, numPartitions):
Similar to coalesce defined on an :class:`RDD`, this operation results in a
narrow dependency, e.g. if you go from 1000 partitions to 100 partitions,
there will not be a shuffle, instead each of the 100 new partitions will
claim 10 of the current partitions.
claim 10 of the current partitions. If a larger number of partitions is requested,
it will stay at the current number of partitions.

>>> df.coalesce(1).rdd.getNumPartitions()
1
Expand Down
3 changes: 2 additions & 1 deletion sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2432,7 +2432,8 @@ class Dataset[T] private[sql](
* Returns a new Dataset that has exactly `numPartitions` partitions.
* Similar to coalesce defined on an `RDD`, this operation results in a narrow dependency, e.g.
* if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
* the 100 new partitions will claim 10 of the current partitions.
* the 100 new partitions will claim 10 of the current partitions. If a larger number of
* partitions is requested, it will stay at the current number of partitions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we seem to have left out the warning from RDD about darastic coaleces in the Dataset coalesce. Since we are updating the docstrings now anyways would it maybe make sense to include that warning here as well? (Looking at the implementation of CoalesceExec it seems like it would still apply unless I'm missing something).

*
* @group typedrel
* @since 1.6.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,8 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
* Physical plan for returning a new RDD that has exactly `numPartitions` partitions.
* Similar to coalesce defined on an [[RDD]], this operation results in a narrow dependency, e.g.
* if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
* the 100 new partitions will claim 10 of the current partitions.
* the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions
* is requested, it will stay at the current number of partitions.
*/
case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
override def output: Seq[Attribute] = child.output
Expand Down