diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index b8c6e50e53fb..bc8509374c26 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -242,15 +242,13 @@ - text: Functions url: sql-ref-functions.html subitems: - - text: Build-in Functions + - text: Built-in Functions url: sql-ref-functions-builtin.html subitems: - - text: Build-in Aggregate Functions + - text: Aggregate Functions url: sql-ref-functions-builtin-aggregate.html - - text: Build-in Array Functions - url: sql-ref-functions-builtin-array.html - - text: Build-in Date Time Functions - url: sql-ref-functions-builtin-date-time.html + - text: Window Functions + url: sql-ref-functions-builtin-window.html - text: UDFs (User-Defined Functions) url: sql-ref-functions-udf.html subitems: diff --git a/docs/sql-ref-functions-builtin-array.md b/docs/sql-ref-functions-builtin-array.md deleted file mode 100644 index 599d10c23b61..000000000000 --- a/docs/sql-ref-functions-builtin-array.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -layout: global -title: Built-in Array Functions -displayTitle: Built-in Array Functions -license: | - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---- - -Array Functions diff --git a/docs/sql-ref-functions-builtin-date-time.md b/docs/sql-ref-functions-builtin-date-time.md deleted file mode 100644 index 10421324a344..000000000000 --- a/docs/sql-ref-functions-builtin-date-time.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -layout: global -title: Built-in Date and Time Functions -displayTitle: Built-in Date and Time Functions -license: | - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---- - -Date-Time Functions diff --git a/docs/sql-ref-functions-builtin-scalar.md b/docs/sql-ref-functions-builtin-scalar.md deleted file mode 100644 index 1d818a25c4ac..000000000000 --- a/docs/sql-ref-functions-builtin-scalar.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -layout: global -title: Builtin Scalar Functions -displayTitle: Builtin Scalar Functions -license: | - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---- - -**This page is under construction** diff --git a/docs/sql-ref-functions-builtin-window.md b/docs/sql-ref-functions-builtin-window.md new file mode 100644 index 000000000000..68a6557dfb7b --- /dev/null +++ b/docs/sql-ref-functions-builtin-window.md @@ -0,0 +1,161 @@ +--- +layout: global +title: Window Functions +displayTitle: Window Functions +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +Similarly to aggregate functions, window functions operate on a group of rows. However, unlike aggregate functions, window functions perform aggregation without reducing, calculating a return value for each row in the group. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative, or accessing the value of rows given the relative position of the current row. Spark SQL supports three types of window functions: + * Ranking Functions + * Analytic Functions + * Aggregate Functions + +### How to Use Window Functions + + * Mark a function as window function by using `over`. + - SQL: Add an OVER clause after the window function, e.g. avg ( ... ) OVER ( ... ); + - DataFrame API: Call the window function's `over` method, e.g. rank ( ).over ( ... ) + * Define the window specification associated with this function. A window specification includes partitioning specification, ordering specification, and frame specification. + - Partitioning Specification: + - SQL: PARTITION BY + - DataFrame API: Window.partitionBy ( ... ) + - Ordering Specification: + - SQL: Order BY + - DataFrame API: Window.orderBy ( ... ) + - Frame Specification: + - SQL: ROWS ( for ROW frame ), RANGE ( for RANGE frame ) + - DataFrame API: WindowSpec.rowsBetween ( for ROW frame ), WindowSpec.rangeBetween ( for RANGE frame ) + +### Examples + +{% highlight scala %} + + import spark.implicits._ + + val data = Seq(("Lisa", "Sales", 10000), + ("Evan", "Sales", 32000), + ("Fred", "Engineering", 21000), + ("Helen", "Marketing", 29000), + ("Alex", "Sales", 30000), + ("Tom", "Engineering", 23000), + ("Jane", "Marketing", 29000), + ("Jeff", "Marketing", 35000), + ("Paul", "Engineering", 29000), + ("Chloe", "Engineering", 23000) + ) + val df = data.toDF("name", "dept", "salary") + df.show() + +-----+-----------+------+ + | name| dept|salary| + +-----+-----------+------+ + | Lisa| Sales| 10000| + | Evan| Sales| 32000| + | Fred|Engineering| 21000| + |Helen| Marketing| 29000| + | Alex| Sales| 30000| + | Tom|Engineering| 23000| + | Jane| Marketing| 29000| + | Jeff| Marketing| 35000| + | Paul|Engineering| 29000| + |Chloe|Engineering| 23000| + +-----+-----------+------+ + + val windowSpec = Window.partitionBy("dept").orderBy("salary") + windowSpec.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) + + // Using Ranking Functions + df.withColumn("rank", rank().over(windowSpec)).show() + +-----+-----------+------+----+ + | name| dept|salary|rank| + +-----+-----------+------+----+ + |Helen| Marketing| 29000| 1| + | Jane| Marketing| 29000| 1| + | Jeff| Marketing| 35000| 3| + | Fred|Engineering| 21000| 1| + | Tom|Engineering| 23000| 2| + |Chloe|Engineering| 23000| 2| + | Paul|Engineering| 29000| 4| + | Lisa| Sales| 10000| 1| + | Alex| Sales| 30000| 2| + | Evan| Sales| 32000| 3| + +-----+-----------+------+----+ + + df.withColumn("dense_rank", dense_rank().over(windowSpec)).show() + +-----+-----------+------+----------+ + | name| dept|salary|dense_rank| + +-----+-----------+------+----------+ + |Helen| Marketing| 29000| 1| + | Jane| Marketing| 29000| 1| + | Jeff| Marketing| 35000| 2| + | Fred|Engineering| 21000| 1| + | Tom|Engineering| 23000| 2| + |Chloe|Engineering| 23000| 2| + | Paul|Engineering| 29000| 3| + | Lisa| Sales| 10000| 1| + | Alex| Sales| 30000| 2| + | Evan| Sales| 32000| 3| + +-----+-----------+------+----------+ + + // Using Analytic Functions + df.withColumn("cume_dist", cume_dist().over(windowSpec)).show() + +-----+-----------+------+------------------+ + | name| dept|salary| cume_dist| + +-----+-----------+------+------------------+ + |Helen| Marketing| 29000|0.6666666666666666| + | Jane| Marketing| 29000|0.6666666666666666| + | Jeff| Marketing| 35000| 1.0| + | Fred|Engineering| 21000| 0.25| + | Tom|Engineering| 23000| 0.75| + |Chloe|Engineering| 23000| 0.75| + | Paul|Engineering| 29000| 1.0| + | Lisa| Sales| 10000|0.3333333333333333| + | Alex| Sales| 30000|0.6666666666666666| + | Evan| Sales| 32000| 1.0| + +-----+-----------+------+------------------+ + + df.withColumn("lag", lag("salary", 2).over(windowSpec)).show() + +-----+-----------+------+-----+ + |Helen| Marketing| 29000| null| + | Jane| Marketing| 29000| null| + | Jeff| Marketing| 35000|29000| + | Fred|Engineering| 21000| null| + | Tom|Engineering| 23000| null| + |Chloe|Engineering| 23000|21000| + | Paul|Engineering| 29000|23000| + | Lisa| Sales| 10000| null| + | Alex| Sales| 30000| null| + | Evan| Sales| 32000|10000| + +-----+-----------+------+-----+ + + // Using Aggregate Functions + df.withColumn("min", min(col("salary")).over(windowSpec)).show() + +-----+-----------+------+-----+ + | name| dept|salary| min| + +-----+-----------+------+-----+ + |Helen| Marketing| 29000|29000| + | Jane| Marketing| 29000|29000| + | Jeff| Marketing| 35000|29000| + | Fred|Engineering| 21000|21000| + | Tom|Engineering| 23000|21000| + |Chloe|Engineering| 23000|21000| + | Paul|Engineering| 29000|21000| + | Lisa| Sales| 10000|10000| + | Alex| Sales| 30000|10000| + | Evan| Sales| 32000|10000| + +-----+-----------+------+-----+ + +{% endhighlight %} diff --git a/docs/sql-ref-functions-builtin.md b/docs/sql-ref-functions-builtin.md index 917c081adb7a..149f5de2c5b2 100644 --- a/docs/sql-ref-functions-builtin.md +++ b/docs/sql-ref-functions-builtin.md @@ -19,8 +19,6 @@ license: | limitations under the License. --- -Spark SQL defines built-in functions to use, a complete list of which can be found [here](api/sql/). Among them, Spark SQL has several special categories of built-in functions: [Aggregate Functions](sql-ref-functions-builtin-aggregate.html) to operate on a group of rows, [Array Functions](sql-ref-functions-builtin-array.html) to operate on Array columns, and [Date and Time Functions](sql-ref-functions-builtin-date-time.html) to operate on Date and Time. - - * [Aggregate Functions](sql-ref-functions-builtin-aggregate.html) - * [Array Functions](sql-ref-functions-builtin-array.html) - * [Date and Time Functions](sql-ref-functions-builtin-date-time.html) +Spark SQL defines built-in functions to use, a complete list of which can be found [here](api/sql/). Among them, Spark SQL has several special categories of built-in functions: [Aggregate Functions](sql-ref-functions-builtin-aggregate.html) to operate on a group of rows and return a single value, while [Window Functions](sql-ref-functions-builtin-window.html) to operate on a group of rows but return values for each row in the group. + * [Aggregate Functions](sql-ref-functions-builtin-aggregate.html) + * [Window Functions](sql-ref-functions-builtin-window.html)