Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -767,6 +767,14 @@ setMethod("repartition",
#' using \code{spark.sql.shuffle.partitions} as number of partitions.}
#'}
#'
#' At least one partition-by expression must be specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't be formatted correctly in R doc due to the fact that "empty line" is significant. L769 should be removed to ensure it is in description

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. What about on 761? I see several docs around here with empty lines (829, 831 below). Are those different? These comments are secondary, but I guess they belong in the public docs as much as anything.

Copy link
Member

@felixcheung felixcheung Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

761 is significant also, but correct.

essentially:

  1. first line of the blob is the title (L760)
  2. second text after "empty line" is the description (L762)
  3. third after another "empty line" is the "detail note" which is stashed all the way to the bottom of the doc page

so generally you want "important" part of the description on top and not in the "detail" section because it is easily missed.

this is the most common pattern in this code base. there's another, where multiple function is doc together as a group, eg. collection sql function (in functions.R). other finer control is possible as well but not used today in this code base.

similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without L831 to keep the whole text in the description section of the doc. for me, generally if the doc text starts with "Note that" I'm ok with it in the "detail" section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung have a look at #23167

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Thanks, I did not know about this strict doc formatting rule in R.

@srowen Thanks for taking care of the fix!

#' When no explicit sort order is specified, "ascending nulls first" is assumed.
#'
#' Note that due to performance reasons this method uses sampling to estimate the ranges.
#' Hence, the output may not be consistent, since sampling can return different values.
#' The sample size can be controlled by the config
#' \code{spark.sql.execution.rangeExchange.sampleSizePerPartition}.
#'
#' @param x a SparkDataFrame.
#' @param numPartitions the number of partitions to use.
#' @param col the column by which the range partitioning will be performed.
Expand Down
5 changes: 5 additions & 0 deletions python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols):
At least one partition-by expression must be specified.
When no explicit sort order is specified, "ascending nulls first" is assumed.

Note that due to performance reasons this method uses sampling to estimate the ranges.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides Python, we also have repartitionByRange API in R. Can you also update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I missed it! Pushed.

Hence, the output may not be consistent, since sampling can return different values.
The sample size can be controlled by the config
`spark.sql.execution.rangeExchange.sampleSizePerPartition`.

>>> df.repartitionByRange(2, "age").rdd.getNumPartitions()
2
>>> df.show()
Expand Down
11 changes: 11 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql](
* When no explicit sort order is specified, "ascending nulls first" is assumed.
* Note, the rows are not sorted in each partition of the resulting Dataset.
*
*
* Note that due to performance reasons this method uses sampling to estimate the ranges.
* Hence, the output may not be consistent, since sampling can return different values.
* The sample size can be controlled by the config
* `spark.sql.execution.rangeExchange.sampleSizePerPartition`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a parameter but a config. So I'd like to propose

The sample size can be controlled by the config `xxx`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark).

*
* @group typedrel
* @since 2.3.0
*/
Expand All @@ -2813,6 +2819,11 @@ class Dataset[T] private[sql](
* When no explicit sort order is specified, "ascending nulls first" is assumed.
* Note, the rows are not sorted in each partition of the resulting Dataset.
*
* Note that due to performance reasons this method uses sampling to estimate the ranges.
* Hence, the output may not be consistent, since sampling can return different values.
* The sample size can be controlled by the config
* `spark.sql.execution.rangeExchange.sampleSizePerPartition`.
*
* @group typedrel
* @since 2.3.0
*/
Expand Down