-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39876][SQL] Add UNPIVOT to SQL syntax #37407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
df340ca to
0bc1c0d
Compare
353e988 to
066e81d
Compare
|
Can one of the admins verify this patch? |
|
cc @maryannxue FYI |
7bb7640 to
6481810
Compare
|
@cloud-fan @MaxGekk @HyukjinKwon @gengliangwang @zhengruifeng what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matchPVals = true is not needed if you are sure that regexp is not needed, here. Just in case, could you explain why did you replace the regexps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good spot, this is not needed anymore, fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names are different slightly:
scala> df.unpivot(Array($"id" * 2), "var", "val").show()
+--------+----+---+
|(id * 2)| var|val|
+--------+----+---+
| 2| int| 11|
| 2|long| 12|
| 4| int| 21|
| 4|long| 22|
+--------+----+---+There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Can we write down the SQL spec for this syntax in the PR description? To make it easier for people to review the syntax and understand the semantic. |
I have added the syntax and examples from |
764832a to
196f3e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems incorrect, should be IN ((unpivot_column [, ...]) [[AS] alias] [, ...])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think alias is required here? otherwise I have no idea what should be the name of things like (q1, q2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we allow IN ((col1 AS a, col2 AS b) AS c)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unpivotColumn itself is defined as expression (AS? identifier)?, so each individual column can have an alias, and the entire set can. However, with an alias for the set, the alias of the column is hidden. But the syntax is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it also valid in other systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like other systems only allow for identifiers available in the FROM, not expressions as our unpivot supports. In those systems you would have to move the expressions into a subquery in FROM and reference the named expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a spark-specific syntax, shall we make it simple and don't allow per-expression alias here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying unpivotColumn should be namedExpression instead of expression (AS? identifier)??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just expression? people can still use col AS alias but we just treat it as a normal expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expression does not allow for an alias, namedExpression does:
namedExpression
: expression (AS? (name=errorCapturingIdentifier | identifierList))?
;I have changed that in 004bb692.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have have to come back to this:
It looks like Oracle and BigQuery allow for these aliases:
SELECT * FROM table UNPIVOT (
val for var in (col1 AS alias1, col2 AS alias2)
)and
SELECT * FROM table UNPIVOT (
(val1, val2) for var in ((col1, col2) AS alias1, (col3, col4) AS alias2)
)but still not
SELECT * FROM table UNPIVOT (
(val1, val2) for var in ((col1 AS aliasA, col2 AS aliasB) AS alias1, (col3 AS aliasC, col4 AS aliasD) AS alias2)
)(which was your original question).
https://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_10002.htm#SQLRF55133
https://blogs.oracle.com/sql/post/how-to-convert-rows-to-columns-and-back-again-with-sql-aka-pivot-and-unpivot
https://hevodata.com/learn/bigquery-columns-to-rows/#u1
I'll add the former alias again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 11cce9ef33.
...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Outdated
Show resolved
Hide resolved
docs/sql-ref-syntax-qry-select.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this was incorrect I think
70f986a to
340ffee
Compare
| .mapValues(exprs => exprs.map(expr => expr.toString.replaceAll("#\\d+", "")).sorted) | ||
| .mapValues(exprs => if (exprs.length > 3) exprs.take(3) :+ "..." else exprs) | ||
| .toList.sortBy(_._1) | ||
| .map { case (className, exprs) => s"$className (${exprs.mkString(", ")})" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not expose too many internal information in the user-facing error message. Can we just put expressions.filterNot(_.isInstanceOf[Attribute]).map(_.sql).mkString(", ")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2c8f53d, though .map(toSQLExpr) gives better results
| SELECT up.* FROM courseEarnings | ||
| UNPIVOT ( | ||
| earningsYear FOR year IN (`2012`, `2013`, `2014`) | ||
| ) AS up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ideally the SQL test should focus on the end-user behavior, instead of tiny details like the optional AS keyword. We can have a UnpivotParserSuite to focus on these details. See DDLParserSuite as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those parser suites parse the SQL statement and assert the logical plan, but the plan is not fully analyzed. Some of the situations tested in unpivot.sql require full analysis, so they cannot be covered by UnpivotParserSuite.
I could add those to DatasetUnpivotSuite and assert the result of spark.sql("...").
I'll sketch that out so we can see how this looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser tests should check the unresolved plan (the raw parsed plan). feel free to add the parser test suite in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have cleaned up unpivot.sql: 84a02b6
Will add the removed tests to UnpivotParserSuite and DatasetUnpivotSuite in a follow-up PR.
| -- !query | ||
| SELECT * FROM courseEarningsAndSales | ||
| UNPIVOT ( | ||
| values FOR year IN () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for this one, we can create a UnpivotParserSuite to test this.
| -- !query | ||
| SELECT * FROM courseEarningsAndSales | ||
| UNPIVOT ( | ||
| (earnings, sales) FOR year IN ((earnings2012, sales2012) as `2012`, (earnings2013, sales2013) as `2013`, (earnings2014, sales2014) as `2014`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, we can test optional AS keyword in UnpivotParserSuite
| -- !query | ||
| SELECT * FROM courseEarningsAndSales | ||
| UNPIVOT ( | ||
| (earnings, sales) FOR year IN ((earnings2012 as earnings, sales2012 as sales) as `2012`, (earnings2013 as earnings, sales2013 as sales) as `2013`, (earnings2014 as earnings, sales2014 as sales) as `2014`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| -- !query | ||
| SELECT * FROM courseEarningsAndSales | ||
| UNPIVOT ( | ||
| () FOR year IN ((earnings2012, sales2012), (earnings2013, sales2013), (earnings2014, sales2014)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| -- !query | ||
| SELECT * FROM courseEarningsAndSales | ||
| UNPIVOT ( | ||
| (earnings, sales) FOR year IN () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| }, | ||
| "UNPIVOT_REQUIRES_ATTRIBUTES" : { | ||
| "message" : [ | ||
| "UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "UNPIVOT requires given {given} to be Attributes when no {empty} are given: [<types>]" | |
| "UNPIVOT requires given {given} expressions to be columns when no {empty} expressions are given, but got: [<expressions>]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then the caller side can just pass "id" and "value"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2c8f53d, though but got implies expressions is an exhaustive list, but it is only the non-attributes.
I have rephrased that to
... expressions are given. These are not columns: [<expressions>].
| }, | ||
| "UNPIVOT_VALUE_SIZE_MISMATCH" : { | ||
| "message" : [ | ||
| "All unpivot value columns must have the same size as there are value column names (<names>): [<sizes>]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to put <sizes>, as it's very clear from the SQL statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in e6b1bcf
| up.values.isEmpty || !up.values.forall(_.resolved) || !up.valuesTypeCoercioned => up | ||
| // once children are resolved, we can determine values from ids and vice versa | ||
| // if only either is given, and only AttributeReference are given | ||
| case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case up @Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved && | |
| case up @ Unpivot(Some(ids), None, _, _, _, _) if up.childrenResolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| val idAttrs = AttributeSet(up.ids.get) | ||
| val values = up.child.output.filterNot(idAttrs.contains) | ||
| up.copy(values = Some(values.map(Seq(_)))) | ||
| case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case up @Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved && | |
| case up @ Unpivot(None, Some(values), _, _, _, _) if up.childrenResolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| def unpivotRequiresAttributes(given: String, | ||
| empty: String, | ||
| expressions: Seq[NamedExpression]): Throwable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def unpivotRequiresAttributes(given: String, | |
| empty: String, | |
| expressions: Seq[NamedExpression]): Throwable = { | |
| def unpivotRequiresAttributes( | |
| given: String, | |
| empty: String, | |
| expressions: Seq[NamedExpression]): Throwable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 2c8f53d
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good except for a few comments, thanks for your great work!
|
All green: https://github.com/G-Research/spark/actions/runs/3196590413/jobs/5220447028 Status update may have failed. |
|
Thanks for the excellent code review and guidance! |
|
thanks, merging to master! |
What changes were proposed in this pull request?
This adds UNPIVOT clause to SQL syntax. It follows the same syntax as BigQuery, T-SQL, Oracle:
For example:
Why are the changes needed?
To support
Dataset.unpivotin SQL queries.Does this PR introduce any user-facing change?
Yes, adds
UNPIVOTto SQL syntax.How was this patch tested?
Added end-to-end tests to
SQLQueryTestSuite.