Skip to content

Commit 1e29f0a

Browse files
HyukjinKwongatorsmile
authored andcommitted
[SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and improve documentation
## What changes were proposed in this pull request? This PR proposes to change the documentation for functions. Please refer the discussion from #15513 The changes include - Re-indent the documentation - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json). For examples, the documentation was updated as below: ### Functions with single line usage **Before** - `pow` ``` sql Usage: pow(x1, x2) - Raise x1 to the power of x2. Extended Usage: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example for current_timestamp. ``` **After** - `pow` ``` sql Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`. Extended Usage: Examples: > SELECT pow(2, 3); 8.0 ``` - `current_timestamp` ``` sql Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation. Extended Usage: No example/argument for current_timestamp. ``` ### Functions with (already) multiple line usage **Before** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++. approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++ with relativeSD, the maximum estimation error allowed. Extended Usage: No example for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. ``` **After** - `approx_count_distinct` ``` sql Usage: approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++. `relativeSD` defines the maximum estimation error allowed. Extended Usage: No example/argument for approx_count_distinct. ``` - `percentile_approx` ``` sql Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` ## How was this patch tested? Manually tested **When examples are multiple** ``` sql spark-sql> describe function extended reflect; Function: reflect Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection. Extended Usage: Examples: > SELECT reflect('java.util.UUID', 'randomUUID'); c33fb387-8500-4bfa-81d2-6e0e3e930df2 > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2'); a5cf6c42-0c85-418f-af6c-3e4e5b1328f2 ``` **When `Usage` is in single line** ``` sql spark-sql> describe function extended min; Function: min Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min Usage: min(expr) - Returns the minimum value of `expr`. Extended Usage: No example/argument for min. ``` **When `Usage` is already in multiple lines** ``` sql spark-sql> describe function extended percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column `col` at the given percentage array. Extended Usage: Examples: > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100); [10.0,10.0,10.0] > SELECT percentile_approx(10.0, 0.5, 100); 10.0 ``` **When example/argument is missing** ``` sql spark-sql> describe function extended rank; Function: rank Class: org.apache.spark.sql.catalyst.expressions.Rank Usage: rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence. Extended Usage: No example/argument for rank. ``` Author: hyukjinkwon <[email protected]> Closes #15677 from HyukjinKwon/SPARK-17963-1. (cherry picked from commit 7eb2ca8) Signed-off-by: gatorsmile <[email protected]>
1 parent 5ea2f9e commit 1e29f0a

40 files changed

+1256
-451
lines changed

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,5 +39,5 @@
3939
@Retention(RetentionPolicy.RUNTIME)
4040
public @interface ExpressionDescription {
4141
String usage() default "_FUNC_ is undocumented";
42-
String extended() default "No example for _FUNC_.";
42+
String extended() default "\n No example/argument for _FUNC_.\n";
4343
}

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,15 @@ import org.apache.spark.util.Utils
4343
* and the second element should be a literal string for the method name,
4444
* and the remaining are input arguments to the Java method.
4545
*/
46-
// scalastyle:off line.size.limit
4746
@ExpressionDescription(
48-
usage = "_FUNC_(class,method[,arg1[,arg2..]]) calls method with reflection",
49-
extended = "> SELECT _FUNC_('java.util.UUID', 'randomUUID');\n c33fb387-8500-4bfa-81d2-6e0e3e930df2")
50-
// scalastyle:on line.size.limit
47+
usage = "_FUNC_(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.",
48+
extended = """
49+
Examples:
50+
> SELECT _FUNC_('java.util.UUID', 'randomUUID');
51+
c33fb387-8500-4bfa-81d2-6e0e3e930df2
52+
> SELECT _FUNC_('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
53+
a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
54+
""")
5155
case class CallMethodViaReflection(children: Seq[Expression])
5256
extends Expression with CodegenFallback {
5357

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,8 +114,12 @@ object Cast {
114114

115115
/** Cast the child expression to the target data type. */
116116
@ExpressionDescription(
117-
usage = " - Cast value v to the target data type.",
118-
extended = "> SELECT _FUNC_('10' as int);\n 10")
117+
usage = "_FUNC_(expr AS type) - Casts the value `expr` to the target data type `type`.",
118+
extended = """
119+
Examples:
120+
> SELECT _FUNC_('10' as int);
121+
10
122+
""")
119123
case class Cast(child: Expression, dataType: DataType) extends UnaryExpression with NullIntolerant {
120124

121125
override def toString: String = s"cast($child as ${dataType.simpleString})"

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InputFileName.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,7 @@ import org.apache.spark.unsafe.types.UTF8String
2727
* Expression that returns the name of the current file being read.
2828
*/
2929
@ExpressionDescription(
30-
usage = "_FUNC_() - Returns the name of the current file being read if available",
31-
extended = "> SELECT _FUNC_();\n ''")
30+
usage = "_FUNC_() - Returns the name of the current file being read if available.")
3231
case class InputFileName() extends LeafExpression with Nondeterministic {
3332

3433
override def nullable: Boolean = true

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,13 @@ import org.apache.spark.sql.types.{DataType, LongType}
3333
* Since this expression is stateful, it cannot be a case object.
3434
*/
3535
@ExpressionDescription(
36-
usage =
37-
"""_FUNC_() - Returns monotonically increasing 64-bit integers.
38-
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
39-
The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits
40-
represent the record number within each partition. The assumption is that the data frame has
41-
less than 1 billion partitions, and each partition has less than 8 billion records.""",
42-
extended = "> SELECT _FUNC_();\n 0")
36+
usage = """
37+
_FUNC_() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed
38+
to be monotonically increasing and unique, but not consecutive. The current implementation
39+
puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number
40+
within each partition. The assumption is that the data frame has less than 1 billion
41+
partitions, and each partition has less than 8 billion records.
42+
""")
4343
case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic {
4444

4545
/**

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SparkPartitionID.scala

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,7 @@ import org.apache.spark.sql.types.{DataType, IntegerType}
2525
* Expression that returns the current partition id.
2626
*/
2727
@ExpressionDescription(
28-
usage = "_FUNC_() - Returns the current partition id",
29-
extended = "> SELECT _FUNC_();\n 0")
28+
usage = "_FUNC_() - Returns the current partition id.")
3029
case class SparkPartitionID() extends LeafExpression with Nondeterministic {
3130

3231
override def nullable: Boolean = false

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -49,21 +49,23 @@ import org.apache.spark.sql.types._
4949
* DEFAULT_PERCENTILE_ACCURACY.
5050
*/
5151
@ExpressionDescription(
52-
usage =
53-
"""
54-
_FUNC_(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
52+
usage = """
53+
_FUNC_(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
5554
column `col` at the given percentage. The value of percentage must be between 0.0
56-
and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
55+
and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
5756
controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
5857
better accuracy, `1.0/accuracy` is the relative error of the approximation.
59-
60-
_FUNC_(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
61-
percentile array of column `col` at the given percentage array. Each value of the
62-
percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
63-
a positive integer literal which controls approximation accuracy at the cost of memory.
64-
Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
65-
the approximation.
66-
""")
58+
When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
59+
In this case, returns the approximate percentile array of column `col` at the given
60+
percentage array.
61+
""",
62+
extended = """
63+
Examples:
64+
> SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
65+
[10.0,10.0,10.0]
66+
> SELECT percentile_approx(10.0, 0.5, 100);
67+
10.0
68+
""")
6769
case class ApproximatePercentile(
6870
child: Expression,
6971
percentageExpression: Expression,

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ import org.apache.spark.sql.catalyst.util.TypeUtils
2424
import org.apache.spark.sql.types._
2525

2626
@ExpressionDescription(
27-
usage = "_FUNC_(x) - Returns the mean calculated from values of a group.")
27+
usage = "_FUNC_(expr) - Returns the mean calculated from values of a group.")
2828
case class Average(child: Expression) extends DeclarativeAggregate {
2929

3030
override def prettyName: String = "avg"

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -132,7 +132,7 @@ abstract class CentralMomentAgg(child: Expression) extends DeclarativeAggregate
132132
// Compute the population standard deviation of a column
133133
// scalastyle:off line.size.limit
134134
@ExpressionDescription(
135-
usage = "_FUNC_(x) - Returns the population standard deviation calculated from values of a group.")
135+
usage = "_FUNC_(expr) - Returns the population standard deviation calculated from values of a group.")
136136
// scalastyle:on line.size.limit
137137
case class StddevPop(child: Expression) extends CentralMomentAgg(child) {
138138

@@ -147,8 +147,10 @@ case class StddevPop(child: Expression) extends CentralMomentAgg(child) {
147147
}
148148

149149
// Compute the sample standard deviation of a column
150+
// scalastyle:off line.size.limit
150151
@ExpressionDescription(
151-
usage = "_FUNC_(x) - Returns the sample standard deviation calculated from values of a group.")
152+
usage = "_FUNC_(expr) - Returns the sample standard deviation calculated from values of a group.")
153+
// scalastyle:on line.size.limit
152154
case class StddevSamp(child: Expression) extends CentralMomentAgg(child) {
153155

154156
override protected def momentOrder = 2
@@ -164,7 +166,7 @@ case class StddevSamp(child: Expression) extends CentralMomentAgg(child) {
164166

165167
// Compute the population variance of a column
166168
@ExpressionDescription(
167-
usage = "_FUNC_(x) - Returns the population variance calculated from values of a group.")
169+
usage = "_FUNC_(expr) - Returns the population variance calculated from values of a group.")
168170
case class VariancePop(child: Expression) extends CentralMomentAgg(child) {
169171

170172
override protected def momentOrder = 2
@@ -179,7 +181,7 @@ case class VariancePop(child: Expression) extends CentralMomentAgg(child) {
179181

180182
// Compute the sample variance of a column
181183
@ExpressionDescription(
182-
usage = "_FUNC_(x) - Returns the sample variance calculated from values of a group.")
184+
usage = "_FUNC_(expr) - Returns the sample variance calculated from values of a group.")
183185
case class VarianceSamp(child: Expression) extends CentralMomentAgg(child) {
184186

185187
override protected def momentOrder = 2
@@ -194,7 +196,7 @@ case class VarianceSamp(child: Expression) extends CentralMomentAgg(child) {
194196
}
195197

196198
@ExpressionDescription(
197-
usage = "_FUNC_(x) - Returns the Skewness value calculated from values of a group.")
199+
usage = "_FUNC_(expr) - Returns the skewness value calculated from values of a group.")
198200
case class Skewness(child: Expression) extends CentralMomentAgg(child) {
199201

200202
override def prettyName: String = "skewness"
@@ -209,7 +211,7 @@ case class Skewness(child: Expression) extends CentralMomentAgg(child) {
209211
}
210212

211213
@ExpressionDescription(
212-
usage = "_FUNC_(x) - Returns the Kurtosis value calculated from values of a group.")
214+
usage = "_FUNC_(expr) - Returns the kurtosis value calculated from values of a group.")
213215
case class Kurtosis(child: Expression) extends CentralMomentAgg(child) {
214216

215217
override protected def momentOrder = 4

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,10 @@ import org.apache.spark.sql.types._
2828
* Definition of Pearson correlation can be found at
2929
* http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
3030
*/
31+
// scalastyle:off line.size.limit
3132
@ExpressionDescription(
32-
usage = "_FUNC_(x,y) - Returns Pearson coefficient of correlation between a set of number pairs.")
33+
usage = "_FUNC_(expr1, expr2) - Returns Pearson coefficient of correlation between a set of number pairs.")
34+
// scalastyle:on line.size.limit
3335
case class Corr(x: Expression, y: Expression) extends DeclarativeAggregate {
3436

3537
override def children: Seq[Expression] = Seq(x, y)

0 commit comments

Comments
 (0)