[SPARK-24035][SQL] SQL syntax for Pivot #21187

maryannxue · 2018-04-28T02:40:51Z

What changes were proposed in this pull request?

Add SQL support for Pivot according to Pivot grammar defined by Oracle (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_clause.htm) with some simplifications, based on our existing functionality and limitations for Pivot at the backend:

For pivot_for_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm), the column list form is not supported, which means the pivot column can only be one single column.
For pivot_in_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_in_clause.htm), the sub-query form and "ANY" is not supported (this is only supported by Oracle for XML anyway).
For pivot_in_clause, aliases for the constant values are not supported.

The code changes are:

Add parser support for Pivot. Note that according to https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#i2076542, Pivot cannot be used together with lateral views in the from clause. This restriction has been implemented in the Parser rule.
Infer group-by expressions: group-by expressions are not explicitly specified in SQL Pivot clause and need to be deduced based on this rule: https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#CHDFAFIE, so we have to post-fix it at query analysis stage.
Override Pivot.resolved as "false": for the reason mentioned in [2] and the fact that output attributes change after Pivot being replaced by Project or Aggregate, we avoid resolving parent references until after Pivot has been resolved and replaced.
Verify aggregate expressions: only aggregate expressions with or without aliases can appear in the first part of the Pivot clause, and this check is performed as analysis stage.

How was this patch tested?

A new test suite PivotSuite is added.

SparkQA · 2018-04-28T06:22:56Z

Test build #89950 has finished for PR 21187 at commit c486c6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-30T16:59:59Z

Since the original Pivot support was added by @aray , could @aray please take a look at this PR too?

rxin · 2018-04-30T19:28:24Z

sql/core/src/test/scala/org/apache/spark/sql/PivotSuite.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql


can we use the infra for SQLQueryTestSuite?

aray · 2018-05-01T00:05:38Z

LGTM thanks for doing this!

maryannxue · 2018-05-01T00:30:46Z

Thank you, @aray!
Thank you, @rxin, for the nice suggestion! Changes made accordingly in my latest commit.

Tagar · 2018-05-01T05:45:56Z

Would be great to make FOR section optional.
E.g. - make FOR year IN (2012, 2013) optional in one of your examples.
Currently pivot() when called programmatically, doesn't require to have list of values
defined up-front.
Otherwise it'll be a big inconvenience/limitation in PIVOT when called from SQL.

Notice that Oracle's PIVOT when used on XML doesn't require list of values either -
https://community.oracle.com/thread/2183084

maryannxue · 2018-05-01T06:27:37Z

Thank you, @Tagar, for you comment! I think by saying "making FOR section optional", you actually mean to support "IN ANY".
As you said and as I have pointed in my PR description, Oracle's "IN ANY" is only supported on XML. Besides, this version of Pivot SQL support is targeted to enable the front end that can leverage the existing runtime Pivot support, in which the values in "IN" clause are only allowed to be literals.
That said, there's a few improvements we can make in the backend/runtime Pivot support. Some of them are easy, like "IN" value aliases; others are less so, like supporting "IN ANY". To be able to allow unspecified values, we might need to run another query at an early stage of this query's compilation, to get a list of all possible values, similar to what's been done in RelationalGroupedDataset#pivot(String)

Tagar · 2018-05-01T15:54:13Z

@maryannxue, yep, "IN ANY" is what I meant. I missed it was already in the description - thanks for clarifying this.

gatorsmile · 2018-05-01T16:01:37Z

In-any-subquery in Pivot can be implemented like what we did in the other parts, but let us first leave it as the future item. @maryannxue Maybe you can create a JIRA.

The Oracle-compatible syntax in this PR is good enough. Thanks for your work! @maryannxue

gatorsmile · 2018-05-01T16:04:03Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

 FULL: 'FULL';
 NATURAL: 'NATURAL';
 ON: 'ON';
+PIVOT: 'PIVOT';


Could you add the keywords you added here to nonReserved (line 723)? Also update the suite TableIdentifierParserSuite?

Sure, I'll update TableIndentifierParserSuite. I believe I've added them to nonReserved already. Did I miss something?

gatorsmile · 2018-05-01T16:07:01Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

    aggregates: Seq[Expression],
-    child: LogicalPlan) extends UnaryNode {
+    child: LogicalPlan,
+    groupByExprsImplicit: Boolean = false) extends UnaryNode {


Could you add add the parameter descriptions like what we did in GroupingSets (line 663-675)?

Could we just change groupByExprs: Seq[NamedExpression] -> groupByExprs: Option[Seq[NamedExpression]]? Then, we do not need this extra flag. In the pattern matching, we just need to do something like Some, or None.

Yes, I hesitated for a while... trying to figure out which way it would be clearer, using Option or an extra flag. I did not have a preference, so I'll do Option if you think it's better.

gatorsmile · 2018-05-01T16:10:39Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

-    child: LogicalPlan) extends UnaryNode {
+    child: LogicalPlan,
+    groupByExprsImplicit: Boolean = false) extends UnaryNode {
+  override lazy val resolved = false // Pivot will be replaced after being resolved.


Could you add a negative test case to show which errors we could issue when Pivot can't be resolved? If the error does not make sense to the end users, we can improve the function checkAnalysis in the trait CheckAnalysis

In ResolvePivot rule, the Pivot node is guaranteed to be resolved and replaced, either with a Project or an Aggregate, except when either of the two errors occur:

Child expressions or child node of Pivot cannot be resolved, which is a general Analyzer error and not specific to Pivot. I'll add a test case for this anyway.

Pivot's aggregates are not really aggregate expressions (not guaranteed by the Parser). I have added this check in ResolvePivot rule, and the last test in "pivot.sql" is dedicated for this check.

So here, marking Pivot's resolved field as false is to stop its parent from reference-resolving or star-expansion and wait till it has been replaced, otherwise the star-expansion will be incorrect.

maryannxue · 2018-05-01T17:27:30Z

@gatorsmile "In-any-subquery in Pivot can be implemented like what we did in the other parts", can you make this clearer? The Pivot's "IN" values are special coz they will later become the columns of the relation.

SparkQA · 2018-05-02T01:37:33Z

Test build #90012 has finished for PR 21187 at commit 171c0c2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-02T02:14:44Z

retest this please

SparkQA · 2018-05-02T05:56:36Z

Test build #90024 has finished for PR 21187 at commit 171c0c2.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM except one minor comment.

gatorsmile · 2018-05-02T21:16:29Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+  override lazy val resolved = false // Pivot will be replaced after being resolved.
+  override def output: Seq[Attribute] =
+    groupByExprsOpt.getOrElse(Seq.empty).map(_.toAttribute) ++ aggregates match {
    case agg :: Nil => pivotValues.map(value => AttributeReference(value.toString, agg.dataType)())


Nit: indent issues. Let us just clean the code at the same time.

override def output: Seq[Attribute] = { val pivotAgg = aggregates match { case agg :: Nil => pivotValues.map(value => AttributeReference(value.toString, agg.dataType)()) case _ => pivotValues.flatMap { value => aggregates.map(agg => AttributeReference(value + "_" + agg.sql, agg.dataType)()) } } groupByExprsOpt.getOrElse(Seq.empty).map(_.toAttribute) ++ pivotAgg }

gatorsmile · 2018-05-02T22:46:30Z

"In-any-subquery in Pivot can be implemented like what we did in the other parts", can you make this clearer? The Pivot's "IN" values are special coz they will later become the columns of the relation.

Yes. The In subquery needs to be executed before/during query analysis stage. Thus, it is different.

maryannxue · 2018-05-02T23:57:29Z

Thank you, @gatorsmile, for the review and comments! I have opened SPARK-24162, SPARK-24163 and SPARK-24164 as follow-up improvements for this issue. Please feel free to assign them to me.

SparkQA · 2018-05-03T03:10:51Z

Test build #90084 has finished for PR 21187 at commit edd0eb1.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-03T06:59:57Z

Test build #90094 has finished for PR 21187 at commit c7eacf5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrayJoin(
case class Flatten(child: Expression) extends UnaryExpression
case class MonthsBetween(
case class CachedRDDBuilder(
case class InMemoryRelation(
case class WriteToContinuousDataSource(
case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

kiszk · 2018-05-03T16:11:32Z

retest this please

SparkQA · 2018-05-03T19:48:22Z

Test build #90144 has finished for PR 21187 at commit c7eacf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArrayJoin(
case class Flatten(child: Expression) extends UnaryExpression
case class MonthsBetween(
case class CachedRDDBuilder(
case class InMemoryRelation(
case class WriteToContinuousDataSource(
case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

gatorsmile · 2018-05-04T00:04:39Z

Thanks for your fast and great work! Merged to master.

[SPARK-24035] SQL syntax for Pivot

c486c6b

rxin reviewed Apr 30, 2018

View reviewed changes

Replace PivotSuite.scala with pivot.sql

d5bd01b

gatorsmile reviewed May 1, 2018

View reviewed changes

Code refine

171c0c2

gatorsmile approved these changes May 2, 2018

View reviewed changes

code cleanup

edd0eb1

Merge remote-tracking branch 'origin/master' into spark-24035

c7eacf5

asfgit closed this in e3201e1 May 4, 2018

kevinykuo mentioned this pull request Nov 10, 2018

Provide tidyr functions sparklyr/sparklyr#1231

Closed

maropu mentioned this pull request Jul 16, 2020

[SPARK-32324][SQL]Fix error messages during using PIVOT and lateral view #29126

Closed

[SPARK-24035][SQL] SQL syntax for Pivot #21187

[SPARK-24035][SQL] SQL syntax for Pivot #21187

Uh oh!

Conversation

maryannxue commented Apr 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 28, 2018

Uh oh!

gatorsmile commented Apr 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aray commented May 1, 2018

Uh oh!

maryannxue commented May 1, 2018

Uh oh!

Tagar commented May 1, 2018

Uh oh!

maryannxue commented May 1, 2018

Uh oh!

Tagar commented May 1, 2018

Uh oh!

gatorsmile commented May 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maryannxue commented May 1, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

kiszk commented May 2, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 2, 2018

Uh oh!

maryannxue commented May 2, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

kiszk commented May 3, 2018

Uh oh!

SparkQA commented May 3, 2018

Uh oh!

gatorsmile commented May 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

maryannxue commented Apr 28, 2018 •

edited

Loading

gatorsmile May 1, 2018 •

edited

Loading