Skip to content

Conversation

@EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Aug 4, 2022

What changes were proposed in this pull request?

This adds UNPIVOT clause to SQL syntax. It follows the same syntax as BigQuery, T-SQL, Oracle:

FROM ... [ unpivot_clause ]

unpivot_clause:
    UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ] (
        { single_value_column_unpivot | multi_value_column_unpivot }
    ) [[AS] alias]

single_value_column_unpivot:
    values_column
    FOR name_column
    IN (unpivot_column [[AS] alias] [, ...])

multi_value_column_unpivot:
    (values_column [, ...])
    FOR name_column
    IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

unpivotColumn:
    multipartIdentifier

For example:

CREATE TABLE sales_quarterly (year INT, q1 INT, q2 INT, q3 INT, q4 INT);
INSERT INTO sales_quarterly VALUES
    (2020, null, 1000, 2000, 2500),
    (2021, 2250, 3200, 4200, 5900),
    (2022, 4200, 3100, null, null);

SELECT * FROM sales_quarterly
    UNPIVOT (
        sales FOR quarter IN (q1, q2, q3, q4)
    );

SELECT up.* FROM sales_quarterly
    UNPIVOT INCLUDE NULLS (
        sales FOR quarter IN (q1 AS Q1, q2 AS Q2, q3 AS Q3, q4 AS Q4)
    ) AS up;

SELECT * FROM sales_quarterly
    UNPIVOT EXCLUDE NULLS (
        (first_quarter, second_quarter)
        FOR half_of_the_year IN (
            (q1, q2) AS H1,
            (q3, q4) AS H2
        )
    );

Why are the changes needed?

To support Dataset.unpivot in SQL queries.

Does this PR introduce any user-facing change?

Yes, adds UNPIVOT to SQL syntax.

How was this patch tested?

Added end-to-end tests to SQLQueryTestSuite.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

cc @maryannxue FYI

@github-actions github-actions bot added the PYTHON label Aug 5, 2022
@EnricoMi EnricoMi force-pushed the branch-sql-unpivot branch from 7bb7640 to 6481810 Compare August 5, 2022 13:56
@EnricoMi EnricoMi marked this pull request as ready for review August 5, 2022 18:51
@EnricoMi EnricoMi changed the title [SPARK-39876][SQL][WIP] Add UNPIVOT to SQL syntax [SPARK-39876][SQL] Add UNPIVOT to SQL syntax Aug 5, 2022
@EnricoMi
Copy link
Contributor Author

@cloud-fan @MaxGekk @HyukjinKwon @gengliangwang @zhengruifeng what do you think?

Comment on lines 330 to 333
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matchPVals = true is not needed if you are sure that regexp is not needed, here. Just in case, could you explain why did you replace the regexps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good spot, this is not needed anymore, fixed

Comment on lines 2105 to 2111
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names are different slightly:

scala> df.unpivot(Array($"id" * 2), "var", "val").show()
+--------+----+---+
|(id * 2)| var|val|
+--------+----+---+
|       2| int| 11|
|       2|long| 12|
|       4| int| 21|
|       4|long| 22|
+--------+----+---+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@cloud-fan
Copy link
Contributor

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

@EnricoMi
Copy link
Contributor Author

Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic.

I have added the syntax and examples from docs/sql-ref-syntax-qry-select-unpivot.md to the PR description.

@EnricoMi EnricoMi force-pushed the branch-sql-unpivot branch from 764832a to 196f3e2 Compare August 30, 2022 16:20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems incorrect, should be IN ((unpivot_column [, ...]) [[AS] alias] [, ...])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think alias is required here? otherwise I have no idea what should be the name of things like (q1, q2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we allow IN ((col1 AS a, col2 AS b) AS c)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unpivotColumn itself is defined as expression (AS? identifier)?, so each individual column can have an alias, and the entire set can. However, with an alias for the set, the alias of the column is hidden. But the syntax is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also valid in other systems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like other systems only allow for identifiers available in the FROM, not expressions as our unpivot supports. In those systems you would have to move the expressions into a subquery in FROM and reference the named expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a spark-specific syntax, shall we make it simple and don't allow per-expression alias here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying unpivotColumn should be namedExpression instead of expression (AS? identifier)??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just expression? people can still use col AS alias but we just treat it as a normal expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expression does not allow for an alias, namedExpression does:

namedExpression
    : expression (AS? (name=errorCapturingIdentifier | identifierList))?
    ;

I have changed that in 004bb692.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have have to come back to this:

It looks like Oracle and BigQuery allow for these aliases:

SELECT * FROM table UNPIVOT (
  val for var in (col1 AS alias1, col2 AS alias2)
)

and

SELECT * FROM table UNPIVOT (
  (val1, val2) for var in ((col1, col2) AS alias1, (col3, col4) AS alias2)
)

but still not

SELECT * FROM table UNPIVOT (
  (val1, val2) for var in ((col1 AS aliasA, col2 AS aliasB) AS alias1, (col3 AS aliasC, col4 AS aliasD) AS alias2)
)

(which was your original question).

https://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF55133
https://blogs.oracle.com/sql/post/how-to-convert-rows-to-columns-and-back-again-with-sql-aka-pivot-and-unpivot
https://hevodata.com/learn/bigquery-columns-to-rows/#u1

I'll add the former alias again.

Copy link
Contributor Author

@EnricoMi EnricoMi Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 11cce9ef33.

Comment on lines -78 to +79
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan this was incorrect I think

@EnricoMi EnricoMi force-pushed the branch-sql-unpivot branch from 70f986a to 340ffee Compare October 5, 2022 09:28
.mapValues(exprs => exprs.map(expr => expr.toString.replaceAll("#\\d+", "")).sorted)
.mapValues(exprs => if (exprs.length > 3) exprs.take(3) :+ "..." else exprs)
.toList.sortBy(_._1)
.map { case (className, exprs) => s"$className (${exprs.mkString(", ")})" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not expose too many internal information in the user-facing error message. Can we just put expressions.filterNot(_.isInstanceOf[Attribute]).map(_.sql).mkString(", ")?

Copy link
Contributor Author

@EnricoMi EnricoMi Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2c8f53d, though .map(toSQLExpr) gives better results

SELECT up.* FROM courseEarnings
UNPIVOT (
earningsYear FOR year IN (`2012`, `2013`, `2014`)
) AS up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ideally the SQL test should focus on the end-user behavior, instead of tiny details like the optional AS keyword. We can have a UnpivotParserSuite to focus on these details. See DDLParserSuite as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those parser suites parse the SQL statement and assert the logical plan, but the plan is not fully analyzed. Some of the situations tested in unpivot.sql require full analysis, so they cannot be covered by UnpivotParserSuite.

I could add those to DatasetUnpivotSuite and assert the result of spark.sql("...").

I'll sketch that out so we can see how this looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser tests should check the unresolved plan (the raw parsed plan). feel free to add the parser test suite in a followup PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have cleaned up unpivot.sql: 84a02b6

Will add the removed tests to UnpivotParserSuite and DatasetUnpivotSuite in a follow-up PR.

-- !query
SELECT * FROM courseEarningsAndSales
UNPIVOT (
values FOR year IN ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for this one, we can create a UnpivotParserSuite to test this.

-- !query
SELECT * FROM courseEarningsAndSales
UNPIVOT (
(earnings, sales) FOR year IN ((earnings2012, sales2012) as `2012`, (earnings2013, sales2013) as `2013`, (earnings2014, sales2014) as `2014`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we can test optional AS keyword in UnpivotParserSuite

-- !query
SELECT * FROM courseEarningsAndSales
UNPIVOT (
(earnings, sales) FOR year IN ((earnings2012 as earnings, sales2012 as sales) as `2012`, (earnings2013 as earnings, sales2013 as sales) as `2013`, (earnings2014 as earnings, sales2014 as sales) as `2014`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

-- !query
SELECT * FROM courseEarningsAndSales
UNPIVOT (
() FOR year IN ((earnings2012, sales2012), (earnings2013, sales2013), (earnings2014, sales2014))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

-- !query
SELECT * FROM courseEarningsAndSales
UNPIVOT (
(earnings, sales) FOR year IN ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

},
"UNPIVOT_REQUIRES_ATTRIBUTES" : {
"message" : [
"UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]"
Copy link
Contributor

@cloud-fan cloud-fan Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]"
"UNPIVOT requires given {given} expressions to be columns when no {empty} expressions are given, but got: [<expressions>]"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the caller side can just pass "id" and "value"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2c8f53d, though but got implies expressions is an exhaustive list, but it is only the non-attributes.

I have rephrased that to

... expressions are given. These are not columns: [<expressions>].

},
"UNPIVOT_VALUE_SIZE_MISMATCH" : {
"message" : [
"All unpivot value columns must have the same size as there are value column names (<names>): [<sizes>]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to put <sizes>, as it's very clear from the SQL statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in e6b1bcf

up.values.isEmpty || !up.values.forall(_.resolved) || !up.valuesTypeCoercioned => up
// once children are resolved, we can determine values from ids and vice versa
// if only either is given, and only AttributeReference are given
case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&
case up @ Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

val idAttrs = AttributeSet(up.ids.get)
val values = up.child.output.filterNot(idAttrs.contains)
up.copy(values = Some(values.map(Seq(_))))
case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&
case up @ Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 104 to 106
def unpivotRequiresAttributes(given: String,
empty: String,
expressions: Seq[NamedExpression]): Throwable = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def unpivotRequiresAttributes(given: String,
empty: String,
expressions: Seq[NamedExpression]): Throwable = {
def unpivotRequiresAttributes(
given: String,
empty: String,
expressions: Seq[NamedExpression]): Throwable = {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2c8f53d

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except for a few comments, thanks for your great work!

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 6, 2022

@EnricoMi
Copy link
Contributor Author

EnricoMi commented Oct 6, 2022

Thanks for the excellent code review and guidance!

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants