[SPARK-39876][SQL] Add UNPIVOT to SQL syntax #37407

EnricoMi · 2022-08-04T15:12:16Z

What changes were proposed in this pull request?

This adds UNPIVOT clause to SQL syntax. It follows the same syntax as BigQuery, T-SQL, Oracle:

FROM ... [ unpivot_clause ]

unpivot_clause:
    UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ] (
        { single_value_column_unpivot | multi_value_column_unpivot }
    ) [[AS] alias]

single_value_column_unpivot:
    values_column
    FOR name_column
    IN (unpivot_column [[AS] alias] [, ...])

multi_value_column_unpivot:
    (values_column [, ...])
    FOR name_column
    IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

unpivotColumn:
    multipartIdentifier

For example:

CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
    (2020, null, 1000, 2000, 2500),
    (2021, 2250, 3200, 4200, 5900),
    (2022, 4200, 3100, null, null);

SELECT * FROM sales_quarterly
    UNPIVOT (
        sales FOR quarter IN (q1, q2, q3, q4)
    );

SELECT up.* FROM sales_quarterly
    UNPIVOT INCLUDE NULLS (
        sales FOR quarter IN (q1 AS Q1, q2 AS Q2, q3 AS Q3, q4 AS Q4)
    ) AS up;

SELECT * FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        (first_quarter, second_quarter)
        FOR half_of_the_year IN (
            (q1, q2) AS H1,
            (q3, q4) AS H2
        )
    );

Why are the changes needed?

To support Dataset.unpivot in SQL queries.

Does this PR introduce any user-facing change?

Yes, adds UNPIVOT to SQL syntax.

How was this patch tested?

Added end-to-end tests to SQLQueryTestSuite.

AmplabJenkins · 2022-08-04T18:39:58Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-08-05T00:26:22Z

cc @maryannxue FYI

EnricoMi · 2022-08-27T11:26:42Z

@cloud-fan @MaxGekk @HyukjinKwon @gengliangwang @zhengruifeng what do you think?

MaxGekk · 2022-08-27T17:19:46Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetUnpivotSuite.scala

matchPVals = true is not needed if you are sure that regexp is not needed, here. Just in case, could you explain why did you replace the regexps.

good spot, this is not needed anymore, fixed

MaxGekk · 2022-08-27T17:34:01Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Column names are different slightly:

scala> df.unpivot(Array($"id" * 2), "var", "val").show() +--------+----+---+ |(id * 2)| var|val| +--------+----+---+ | 2| int| 11| | 2|long| 12| | 4| int| 21| | 4|long| 22| +--------+----+---+

cloud-fan · 2022-08-29T16:59:01Z

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

EnricoMi · 2022-08-30T15:50:39Z

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

I have added the syntax and examples from docs/sql-ref-syntax-qry-select-unpivot.md to the PR description.

cloud-fan · 2022-08-31T14:58:00Z

docs/sql-ref-syntax-qry-select-unpivot.md

this seems incorrect, should be IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

BTW, I think alias is required here? otherwise I have no idea what should be the name of things like (q1, q2)

You are right, fixed.

cloud-fan · 2022-08-31T15:03:15Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

do we allow IN ((col1 AS a, col2 AS b) AS c)?

Yes, unpivotColumn itself is defined as expression (AS? identifier)?, so each individual column can have an alias, and the entire set can. However, with an alias for the set, the alias of the column is hidden. But the syntax is valid.

is it also valid in other systems?

It looks like other systems only allow for identifiers available in the FROM, not expressions as our unpivot supports. In those systems you would have to move the expressions into a subquery in FROM and reference the named expression.

If this is a spark-specific syntax, shall we make it simple and don't allow per-expression alias here?

Saying unpivotColumn should be namedExpression instead of expression (AS? identifier)??

just expression? people can still use col AS alias but we just treat it as a normal expression.

expression does not allow for an alias, namedExpression does:

namedExpression : expression (AS? (name=errorCapturingIdentifier | identifierList))? ;

I have changed that in 004bb692.

Have have to come back to this:

It looks like Oracle and BigQuery allow for these aliases:

SELECT * FROM table UNPIVOT ( val for var in (col1 AS alias1, col2 AS alias2) )

and

SELECT * FROM table UNPIVOT ( (val1, val2) for var in ((col1, col2) AS alias1, (col3, col4) AS alias2) )

but still not

SELECT * FROM table UNPIVOT ( (val1, val2) for var in ((col1 AS aliasA, col2 AS aliasB) AS alias1, (col3 AS aliasC, col4 AS aliasD) AS alias2) )

(which was your original question).

https://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF55133
https://blogs.oracle.com/sql/post/how-to-convert-rows-to-columns-and-back-again-with-sql-aka-pivot-and-unpivot
https://hevodata.com/learn/bigquery-columns-to-rows/#u1

I'll add the former alias again.

Added in 11cce9ef33.

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

EnricoMi · 2022-08-31T19:53:07Z

docs/sql-ref-syntax-qry-select.md

@cloud-fan this was incorrect I think

cloud-fan · 2022-10-06T02:42:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

+      .mapValues(exprs => exprs.map(expr => expr.toString.replaceAll("#\\d+", "")).sorted)
+      .mapValues(exprs => if (exprs.length > 3) exprs.take(3) :+ "..." else exprs)
+      .toList.sortBy(_._1)
+      .map { case (className, exprs) => s"$className (${exprs.mkString(", ")})" }


We should not expose too many internal information in the user-facing error message. Can we just put expressions.filterNot(_.isInstanceOf[Attribute]).map(_.sql).mkString(", ")?

Done in 2c8f53d, though .map(toSQLExpr) gives better results

cloud-fan · 2022-10-06T02:50:50Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+SELECT up.* FROM courseEarnings
+UNPIVOT (
+  earningsYear FOR year IN (`2012`, `2013`, `2014`)
+) AS up


nit: ideally the SQL test should focus on the end-user behavior, instead of tiny details like the optional AS keyword. We can have a UnpivotParserSuite to focus on these details. See DDLParserSuite as an example.

Those parser suites parse the SQL statement and assert the logical plan, but the plan is not fully analyzed. Some of the situations tested in unpivot.sql require full analysis, so they cannot be covered by UnpivotParserSuite.

I could add those to DatasetUnpivotSuite and assert the result of spark.sql("...").

I'll sketch that out so we can see how this looks like.

parser tests should check the unresolved plan (the raw parsed plan). feel free to add the parser test suite in a followup PR.

I have cleaned up unpivot.sql: 84a02b6

Will add the removed tests to UnpivotParserSuite and DatasetUnpivotSuite in a follow-up PR.

cloud-fan · 2022-10-06T02:51:27Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+-- !query
+SELECT * FROM courseEarningsAndSales
+UNPIVOT (
+  values FOR year IN ()


ditto for this one, we can create a UnpivotParserSuite to test this.

cloud-fan · 2022-10-06T02:52:59Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+-- !query
+SELECT * FROM courseEarningsAndSales
+UNPIVOT (
+  (earnings, sales) FOR year IN ((earnings2012, sales2012) as `2012`, (earnings2013, sales2013) as `2013`, (earnings2014, sales2014) as `2014`)


ditto, we can test optional AS keyword in UnpivotParserSuite

cloud-fan · 2022-10-06T02:53:54Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+-- !query
+SELECT * FROM courseEarningsAndSales
+UNPIVOT (
+  (earnings, sales) FOR year IN ((earnings2012 as earnings, sales2012 as sales) as `2012`, (earnings2013 as earnings, sales2013 as sales) as `2013`, (earnings2014 as earnings, sales2014 as sales) as `2014`)


cloud-fan · 2022-10-06T02:54:30Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+-- !query
+SELECT * FROM courseEarningsAndSales
+UNPIVOT (
+  () FOR year IN ((earnings2012, sales2012), (earnings2013, sales2013), (earnings2014, sales2014))


cloud-fan · 2022-10-06T02:54:41Z

sql/core/src/test/resources/sql-tests/results/unpivot.sql.out

+-- !query
+SELECT * FROM courseEarningsAndSales
+UNPIVOT (
+  (earnings, sales) FOR year IN ()


cloud-fan · 2022-10-06T02:57:53Z

core/src/main/resources/error/error-classes.json

  },
+  "UNPIVOT_REQUIRES_ATTRIBUTES" : {
+    "message" : [
+      "UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]"


Suggested change

"UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]"

"UNPIVOT requires given {given} expressions to be columns when no {empty} expressions are given, but got: [<expressions>]"

then the caller side can just pass "id" and "value"

Fixed in 2c8f53d, though but got implies expressions is an exhaustive list, but it is only the non-attributes.

I have rephrased that to

... expressions are given. These are not columns: [<expressions>].

cloud-fan · 2022-10-06T02:59:20Z

core/src/main/resources/error/error-classes.json

  },
+  "UNPIVOT_VALUE_SIZE_MISMATCH" : {
+    "message" : [
+      "All unpivot value columns must have the same size as there are value column names (<names>): [<sizes>]"


We probably don't need to put <sizes>, as it's very clear from the SQL statement.

fixed in e6b1bcf

cloud-fan · 2022-10-06T02:59:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-        up.values.isEmpty || !up.values.forall(_.resolved) || !up.valuesTypeCoercioned => up
+      // once children are resolved, we can determine values from ids and vice versa
+      // if only either is given, and only AttributeReference are given
+      case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&


Suggested change

case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&

case up @ Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&

cloud-fan · 2022-10-06T03:00:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        val idAttrs = AttributeSet(up.ids.get)
+        val values = up.child.output.filterNot(idAttrs.contains)
+        up.copy(values = Some(values.map(Seq(_))))
+      case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&


Suggested change

case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&

case up @ Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&

cloud-fan · 2022-10-06T03:00:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

+  def unpivotRequiresAttributes(given: String,
+                                empty: String,
+                                expressions: Seq[NamedExpression]): Throwable = {


Suggested change

def unpivotRequiresAttributes(given: String,

empty: String,

expressions: Seq[NamedExpression]): Throwable = {

def unpivotRequiresAttributes(

given: String,

empty: String,

expressions: Seq[NamedExpression]): Throwable = {

fixed in 2c8f53d

cloud-fan

looks good except for a few comments, thanks for your great work!

EnricoMi · 2022-10-06T13:08:28Z

All green: https://github.com/G-Research/spark/actions/runs/3196590413/jobs/5220447028

Status update may have failed.

EnricoMi · 2022-10-06T13:08:46Z

Thanks for the excellent code review and guidance!

cloud-fan · 2022-10-07T12:38:19Z

thanks, merging to master!

github-actions bot added CORE DOCS SQL labels Aug 4, 2022

EnricoMi force-pushed the branch-sql-unpivot branch from df340ca to 0bc1c0d Compare August 4, 2022 15:28

github-actions bot added BUILD DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS MLLIB SPARK SHELL YARN labels Aug 4, 2022

EnricoMi force-pushed the branch-sql-unpivot branch from 353e988 to 066e81d Compare August 4, 2022 18:33

github-actions bot added the PYTHON label Aug 5, 2022

EnricoMi force-pushed the branch-sql-unpivot branch from 7bb7640 to 6481810 Compare August 5, 2022 13:56

EnricoMi marked this pull request as ready for review August 5, 2022 18:51

EnricoMi changed the title ~~[SPARK-39876][SQL][WIP] Add UNPIVOT to SQL syntax~~ [SPARK-39876][SQL] Add UNPIVOT to SQL syntax Aug 5, 2022

MaxGekk reviewed Aug 27, 2022

View reviewed changes

EnricoMi force-pushed the branch-sql-unpivot branch from 764832a to 196f3e2 Compare August 30, 2022 16:20

cloud-fan reviewed Aug 31, 2022

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

EnricoMi commented Aug 31, 2022

View reviewed changes

EnricoMi added 3 commits October 5, 2022 11:09

Allow alias for single unpivot columns

84fb403

When either no values or ids are given, requires AttributeReferences

e570a85

Update unpivot.sql.out after merging master

340ffee

EnricoMi force-pushed the branch-sql-unpivot branch from 70f986a to 340ffee Compare October 5, 2022 09:28

cloud-fan reviewed Oct 6, 2022

View reviewed changes

cloud-fan approved these changes Oct 6, 2022

View reviewed changes

EnricoMi added 5 commits October 6, 2022 10:06

Rework UNPIVOT_REQUIRES_ATTRIBUTES error message

2c8f53d

Rework UNPIVOT_VALUE_SIZE_MISMATCH error message

e6b1bcf

Fix case @ Unpivot pattern

b3f74d8

Reduce queries iin unpivot.sql, to be moved into separate test suite

84a02b6

Fix UNPIVOT_REQUIRES_ATTRIBUTES error message template

59e6a4f

cloud-fan closed this in 29e4552 Oct 7, 2022

EnricoMi deleted the branch-sql-unpivot branch October 7, 2022 18:25

EnricoMi mentioned this pull request Oct 7, 2022

[SPARK-39876][FOLLOW-UP][SQL] Add parser and Dataset tests for SQL UNPIVOT #38153

Closed

	"UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]"
	"UNPIVOT requires given {given} expressions to be columns when no {empty} expressions are given, but got: [<expressions>]"

	case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&
	case up @ Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&

	case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&
	case up @ Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&

[SPARK-39876][SQL] Add UNPIVOT to SQL syntax #37407

[SPARK-39876][SQL] Add UNPIVOT to SQL syntax #37407

Uh oh!

Conversation

EnricoMi commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 4, 2022

Uh oh!

HyukjinKwon commented Aug 5, 2022

Uh oh!

EnricoMi commented Aug 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 29, 2022

Uh oh!

EnricoMi commented Aug 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnricoMi Sep 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnricoMi Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EnricoMi commented Aug 4, 2022 •

edited

Loading

EnricoMi Sep 28, 2022 •

edited

Loading

EnricoMi Oct 6, 2022 •

edited

Loading

cloud-fan Oct 6, 2022 •

edited

Loading