-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column #16739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -678,14 +678,53 @@ setMethod("storageLevel", | |
| storageLevelToString(callJMethod(x@sdf, "storageLevel")) | ||
| }) | ||
|
|
||
| #' Coalesce | ||
| #' | ||
| #' Returns a new SparkDataFrame that has exactly \code{numPartitions} partitions. | ||
| #' This operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 | ||
| #' partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of | ||
| #' the current partitions. If a larger number of partitions is requested, it will stay at the | ||
| #' current number of partitions. | ||
| #' | ||
| #' However, if you're doing a drastic coalesce on a SparkDataFrame, e.g. to numPartitions = 1, | ||
| #' this may result in your computation taking place on fewer nodes than | ||
| #' you like (e.g. one node in the case of numPartitions = 1). To avoid this, | ||
| #' call \code{repartition}. This will add a shuffle step, but means the | ||
| #' current upstream partitions will be executed in parallel (per whatever | ||
| #' the current partitioning is). | ||
| #' | ||
| #' @param numPartitions the number of partitions to use. | ||
| #' | ||
| #' @family SparkDataFrame functions | ||
| #' @rdname coalesce | ||
| #' @name coalesce | ||
| #' @aliases coalesce,SparkDataFrame-method | ||
| #' @seealso \link{repartition} | ||
| #' @export | ||
| #' @examples | ||
| #'\dontrun{ | ||
| #' sparkR.session() | ||
| #' path <- "path/to/file.json" | ||
| #' df <- read.json(path) | ||
| #' newDF <- coalesce(df, 1L) | ||
| #'} | ||
| #' @note coalesce(SparkDataFrame) since 2.1.1 | ||
|
||
| setMethod("coalesce", | ||
| signature(x = "SparkDataFrame"), | ||
| function(x, numPartitions) { | ||
| stopifnot(is.numeric(numPartitions)) | ||
|
||
| sdf <- callJMethod(x@sdf, "coalesce", numToInt(numPartitions)) | ||
| dataFrame(sdf) | ||
| }) | ||
|
|
||
| #' Repartition | ||
| #' | ||
| #' The following options for repartition are possible: | ||
| #' \itemize{ | ||
| #' \item{1.} {Return a new SparkDataFrame partitioned by | ||
| #' \item{1.} {Return a new SparkDataFrame that has exactly \code{numPartitions}.} | ||
| #' \item{2.} {Return a new SparkDataFrame hash partitioned by | ||
| #' the given columns into \code{numPartitions}.} | ||
| #' \item{2.} {Return a new SparkDataFrame that has exactly \code{numPartitions}.} | ||
| #' \item{3.} {Return a new SparkDataFrame partitioned by the given column(s), | ||
| #' \item{3.} {Return a new SparkDataFrame hash partitioned by the given column(s), | ||
| #' using \code{spark.sql.shuffle.partitions} as number of partitions.} | ||
| #'} | ||
| #' @param x a SparkDataFrame. | ||
|
|
@@ -697,6 +736,7 @@ setMethod("storageLevel", | |
| #' @rdname repartition | ||
| #' @name repartition | ||
| #' @aliases repartition,SparkDataFrame-method | ||
| #' @seealso \link{coalesce} | ||
| #' @export | ||
| #' @examples | ||
| #'\dontrun{ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2432,7 +2432,15 @@ class Dataset[T] private[sql]( | |
| * Returns a new Dataset that has exactly `numPartitions` partitions. | ||
| * Similar to coalesce defined on an `RDD`, this operation results in a narrow dependency, e.g. | ||
| * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of | ||
| * the 100 new partitions will claim 10 of the current partitions. | ||
| * the 100 new partitions will claim 10 of the current partitions. If a larger number of | ||
| * partitions is requested, it will stay at the current number of partitions. | ||
|
||
| * | ||
| * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, | ||
| * this may result in your computation taking place on fewer nodes than | ||
| * you like (e.g. one node in the case of numPartitions = 1). To avoid this, | ||
| * you can call repartition. This will add a shuffle step, but means the | ||
| * current upstream partitions will be executed in parallel (per whatever | ||
| * the current partitioning is). | ||
| * | ||
| * @group typedrel | ||
| * @since 1.6.0 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are more partitions then there will be a shuffle right ? Might be useful to add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, coalesce is set to
min(prev partitions, numPartitions)according to CoalescedRDD here so it will be unchanged then.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh well I guess thats worth mentioning then ?