-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26024][SQL]: Update documentation for repartitionByRange #23025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b47b6d0
5a50282
5bce520
654fed9
f829dfe
7ca4821
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols): | |
| At least one partition-by expression must be specified. | ||
| When no explicit sort order is specified, "ascending nulls first" is assumed. | ||
|
|
||
| Note that due to performance reasons this method uses sampling to estimate the ranges. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Besides Python, we also have
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh right, I missed it! Pushed. |
||
| Hence, the output may not be consistent, since sampling can return different values. | ||
| The sample size can be controlled by the config | ||
| `spark.sql.execution.rangeExchange.sampleSizePerPartition`. | ||
|
|
||
| >>> df.repartitionByRange(2, "age").rdd.getNumPartitions() | ||
| 2 | ||
| >>> df.show() | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2789,6 +2789,12 @@ class Dataset[T] private[sql]( | |
| * When no explicit sort order is specified, "ascending nulls first" is assumed. | ||
| * Note, the rows are not sorted in each partition of the resulting Dataset. | ||
| * | ||
| * | ||
| * Note that due to performance reasons this method uses sampling to estimate the ranges. | ||
| * Hence, the output may not be consistent, since sampling can return different values. | ||
| * The sample size can be controlled by the config | ||
| * `spark.sql.execution.rangeExchange.sampleSizePerPartition`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not a parameter but a config. So I'd like to propose
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark). |
||
| * | ||
| * @group typedrel | ||
| * @since 2.3.0 | ||
| */ | ||
|
|
@@ -2813,6 +2819,11 @@ class Dataset[T] private[sql]( | |
| * When no explicit sort order is specified, "ascending nulls first" is assumed. | ||
| * Note, the rows are not sorted in each partition of the resulting Dataset. | ||
| * | ||
| * Note that due to performance reasons this method uses sampling to estimate the ranges. | ||
| * Hence, the output may not be consistent, since sampling can return different values. | ||
| * The sample size can be controlled by the config | ||
| * `spark.sql.execution.rangeExchange.sampleSizePerPartition`. | ||
| * | ||
| * @group typedrel | ||
| * @since 2.3.0 | ||
| */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't be formatted correctly in R doc due to the fact that "empty line" is significant. L769 should be removed to ensure it is in description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. What about on 761? I see several docs around here with empty lines (829, 831 below). Are those different? These comments are secondary, but I guess they belong in the public docs as much as anything.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
761 is significant also, but correct.
essentially:
so generally you want "important" part of the description on top and not in the "detail" section because it is easily missed.
this is the most common pattern in this code base. there's another, where multiple function is doc together as a group, eg. collection sql function (in functions.R). other finer control is possible as well but not used today in this code base.
similarly L829 is good, L831 is a bit fuzzy - I'd personally prefer without L831 to keep the whole text in the description section of the doc. for me, generally if the doc text starts with "Note that" I'm ok with it in the "detail" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung have a look at #23167
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Thanks, I did not know about this strict doc formatting rule in R.
@srowen Thanks for taking care of the fix!