Skip to content

Conversation

@maryannxue
Copy link
Contributor

@maryannxue maryannxue commented Apr 28, 2018

What changes were proposed in this pull request?

Add SQL support for Pivot according to Pivot grammar defined by Oracle (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_clause.htm) with some simplifications, based on our existing functionality and limitations for Pivot at the backend:

  1. For pivot_for_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm), the column list form is not supported, which means the pivot column can only be one single column.
  2. For pivot_in_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_in_clause.htm), the sub-query form and "ANY" is not supported (this is only supported by Oracle for XML anyway).
  3. For pivot_in_clause, aliases for the constant values are not supported.

The code changes are:

  1. Add parser support for Pivot. Note that according to https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#i2076542, Pivot cannot be used together with lateral views in the from clause. This restriction has been implemented in the Parser rule.
  2. Infer group-by expressions: group-by expressions are not explicitly specified in SQL Pivot clause and need to be deduced based on this rule: https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#CHDFAFIE, so we have to post-fix it at query analysis stage.
  3. Override Pivot.resolved as "false": for the reason mentioned in [2] and the fact that output attributes change after Pivot being replaced by Project or Aggregate, we avoid resolving parent references until after Pivot has been resolved and replaced.
  4. Verify aggregate expressions: only aggregate expressions with or without aliases can appear in the first part of the Pivot clause, and this check is performed as analysis stage.

How was this patch tested?

A new test suite PivotSuite is added.

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89950 has finished for PR 21187 at commit c486c6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Since the original Pivot support was added by @aray , could @aray please take a look at this PR too?

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the infra for SQLQueryTestSuite?

@aray
Copy link
Contributor

aray commented May 1, 2018

LGTM thanks for doing this!

@maryannxue
Copy link
Contributor Author

Thank you, @aray!
Thank you, @rxin, for the nice suggestion! Changes made accordingly in my latest commit.

@Tagar
Copy link

Tagar commented May 1, 2018

Would be great to make FOR section optional.
E.g. - make FOR year IN (2012, 2013) optional in one of your examples.
Currently pivot() when called programmatically, doesn't require to have list of values
defined up-front.
Otherwise it'll be a big inconvenience/limitation in PIVOT when called from SQL.

Notice that Oracle's PIVOT when used on XML doesn't require list of values either -
https://community.oracle.com/thread/2183084

@maryannxue
Copy link
Contributor Author

Thank you, @Tagar, for you comment! I think by saying "making FOR section optional", you actually mean to support "IN ANY".
As you said and as I have pointed in my PR description, Oracle's "IN ANY" is only supported on XML. Besides, this version of Pivot SQL support is targeted to enable the front end that can leverage the existing runtime Pivot support, in which the values in "IN" clause are only allowed to be literals.
That said, there's a few improvements we can make in the backend/runtime Pivot support. Some of them are easy, like "IN" value aliases; others are less so, like supporting "IN ANY". To be able to allow unspecified values, we might need to run another query at an early stage of this query's compilation, to get a list of all possible values, similar to what's been done in RelationalGroupedDataset#pivot(String)

@Tagar
Copy link

Tagar commented May 1, 2018

@maryannxue, yep, "IN ANY" is what I meant. I missed it was already in the description - thanks for clarifying this.

@gatorsmile
Copy link
Member

In-any-subquery in Pivot can be implemented like what we did in the other parts, but let us first leave it as the future item. @maryannxue Maybe you can create a JIRA.

The Oracle-compatible syntax in this PR is good enough. Thanks for your work! @maryannxue

FULL: 'FULL';
NATURAL: 'NATURAL';
ON: 'ON';
PIVOT: 'PIVOT';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the keywords you added here to nonReserved (line 723)? Also update the suite TableIdentifierParserSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update TableIndentifierParserSuite. I believe I've added them to nonReserved already. Did I miss something?

aggregates: Seq[Expression],
child: LogicalPlan) extends UnaryNode {
child: LogicalPlan,
groupByExprsImplicit: Boolean = false) extends UnaryNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add add the parameter descriptions like what we did in GroupingSets (line 663-675)?

Copy link
Member

@gatorsmile gatorsmile May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just change groupByExprs: Seq[NamedExpression] -> groupByExprs: Option[Seq[NamedExpression]]? Then, we do not need this extra flag. In the pattern matching, we just need to do something like Some, or None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I hesitated for a while... trying to figure out which way it would be clearer, using Option or an extra flag. I did not have a preference, so I'll do Option if you think it's better.

child: LogicalPlan) extends UnaryNode {
child: LogicalPlan,
groupByExprsImplicit: Boolean = false) extends UnaryNode {
override lazy val resolved = false // Pivot will be replaced after being resolved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a negative test case to show which errors we could issue when Pivot can't be resolved? If the error does not make sense to the end users, we can improve the function checkAnalysis in the trait CheckAnalysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ResolvePivot rule, the Pivot node is guaranteed to be resolved and replaced, either with a Project or an Aggregate, except when either of the two errors occur:

  1. Child expressions or child node of Pivot cannot be resolved, which is a general Analyzer error and not specific to Pivot. I'll add a test case for this anyway.
  2. Pivot's aggregates are not really aggregate expressions (not guaranteed by the Parser). I have added this check in ResolvePivot rule, and the last test in "pivot.sql" is dedicated for this check.

So here, marking Pivot's resolved field as false is to stop its parent from reference-resolving or star-expansion and wait till it has been replaced, otherwise the star-expansion will be incorrect.

@maryannxue
Copy link
Contributor Author

@gatorsmile "In-any-subquery in Pivot can be implemented like what we did in the other parts", can you make this clearer? The Pivot's "IN" values are special coz they will later become the columns of the relation.

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90012 has finished for PR 21187 at commit 171c0c2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented May 2, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90024 has finished for PR 21187 at commit 171c0c2.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor comment.

override lazy val resolved = false // Pivot will be replaced after being resolved.
override def output: Seq[Attribute] =
groupByExprsOpt.getOrElse(Seq.empty).map(_.toAttribute) ++ aggregates match {
case agg :: Nil => pivotValues.map(value => AttributeReference(value.toString, agg.dataType)())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent issues. Let us just clean the code at the same time.

  override def output: Seq[Attribute] = {
    val pivotAgg = aggregates match {
      case agg :: Nil =>
        pivotValues.map(value => AttributeReference(value.toString, agg.dataType)())
      case _ =>
        pivotValues.flatMap { value =>
          aggregates.map(agg => AttributeReference(value + "_" + agg.sql, agg.dataType)())
        }
    }
    groupByExprsOpt.getOrElse(Seq.empty).map(_.toAttribute) ++ pivotAgg
  }

@gatorsmile
Copy link
Member

"In-any-subquery in Pivot can be implemented like what we did in the other parts", can you make this clearer? The Pivot's "IN" values are special coz they will later become the columns of the relation.

Yes. The In subquery needs to be executed before/during query analysis stage. Thus, it is different.

@maryannxue
Copy link
Contributor Author

Thank you, @gatorsmile, for the review and comments! I have opened SPARK-24162, SPARK-24163 and SPARK-24164 as follow-up improvements for this issue. Please feel free to assign them to me.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90084 has finished for PR 21187 at commit edd0eb1.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90094 has finished for PR 21187 at commit c7eacf5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayJoin(
  • case class Flatten(child: Expression) extends UnaryExpression
  • case class MonthsBetween(
  • case class CachedRDDBuilder(
  • case class InMemoryRelation(
  • case class WriteToContinuousDataSource(
  • case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

@kiszk
Copy link
Member

kiszk commented May 3, 2018

retest this please

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90144 has finished for PR 21187 at commit c7eacf5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayJoin(
  • case class Flatten(child: Expression) extends UnaryExpression
  • case class MonthsBetween(
  • case class CachedRDDBuilder(
  • case class InMemoryRelation(
  • case class WriteToContinuousDataSource(
  • case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)

@gatorsmile
Copy link
Member

Thanks for your fast and great work! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants