[SPARK-21274][SQL] Add a new generator function replicate_rows to support EXCEPT ALL and INTERSECT ALL #21240

dilipbiswal · 2018-05-04T22:17:10Z

What changes were proposed in this pull request?

Add a new UDTF replicate_rows. This function replicates the values based on the first argument to the function. This will be used in EXCEPT ALL AND INTERSECT ALL transformation (future PR) mainly
to preserve "retain duplicates" semantics. Please refer to Link for design. The transformation code changes are in Code

Example

SELECT replicate_rows(3,  1,  2)

Result

spark-sql> SELECT replicate_rows(3, 1, 2);
3	1	2
3	1	2
3	1	2
Time taken: 0.045 seconds, Fetched 3 row(s)

Returns 3 rows based on the first parameter value.

How was this patch tested?

Added tests in GeneratorFunctionSuite, TypeCoercionSuite, SQLQueryTestSuite

…EXCEPT ALL and INTERSECT ALL

maropu · 2018-05-05T00:03:22Z

Why we need this? I thinks it's ok to add a new rewriting rule for EXCEPT ALL and INTERSECT ALL in analyzer?

maropu · 2018-05-05T00:04:17Z

Also, ISTM this function is less useful for end users.

gatorsmile · 2018-05-05T00:05:37Z

https://issues.apache.org/jira/browse/HIVE-14768 did the same thing.

maropu · 2018-05-05T00:06:34Z

Ah, ok.

SparkQA · 2018-05-05T02:03:44Z

Test build #90219 has finished for PR 21240 at commit 90efeff.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ReplicateRows(children: Seq[Expression]) extends Generator with CodegenFallback

maropu · 2018-05-05T10:12:19Z

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

      Row(1, null) :: Row(2, null) :: Nil)
  }
+
+  test("ReplicateRows generator") {


duplicate tests? I feel udtf_replicate_rows.sql is enough for tests.

maropu · 2018-05-05T10:14:18Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

    )
  }
+
+  test("type coercion for ReplicateRows") {


Can we move this tests into sql-tests/inputs/typeCoercion/native?

viirya · 2018-05-05T14:34:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+   * if necessary.
+   */
+  object ReplicateRowsCoercion extends TypeCoercionRule {
+    private val acceptedTypes = Seq(LongType, IntegerType, ShortType, ByteType)


nit: LongType seems not necessary be here. Can avoid re-entering the following pattern matching if it is already long type.

viirya · 2018-05-05T14:35:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

 import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
 import org.apache.spark.sql.types._

+


Not need to introduce this breaking line.

viirya · 2018-05-05T14:36:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

 }

+/**
+ * Replicate the row based N times. N is specified as the first argument to the function.


nit: Replicate N times the row.?

Btw, using n to match following expression description?

viirya · 2018-05-05T14:57:26Z

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

+      Row(3, "row1") :: Row(3, "row1") :: Row(3, "row1") :: Nil)
+    checkAnswer(df.selectExpr("replicate_rows(-1, 2.5)"), Nil)
+
+    // The data for the same column should have the same type.


This copied comment can be removed.

viirya · 2018-05-05T14:59:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+ *  }}}
+ */
+@ExpressionDescription(
+usage = "_FUNC_(n, expr1, ..., exprk) - Replicates `expr1`, ..., `exprk` into `n` rows.",


Replicates `n`, `expr1`, ..., `exprk` into `n` rows.?

dilipbiswal · 2018-05-06T01:47:32Z

@maropu @viirya Thanks for the comments. I have made the changes.

viirya · 2018-05-06T02:02:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    private val acceptedTypes = Seq(IntegerType, ShortType, ByteType)
+    override def coerceTypes(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+      case s @ ReplicateRows(children) if s.childrenResolved &&
+        s.children.head.dataType != LongType && acceptedTypes.contains(s.children.head.dataType) =>


We should check if s.children isn't empty.

viirya · 2018-05-06T02:04:44Z

sql/core/src/test/resources/sql-tests/inputs/udtf_replicate_rows.sql

+    AS tab1(c1, c2, c3);
+
+-- Requires 2 arguments at minimum.
+SELECT replicate_rows(c1) FROM tab1;


Add one case SELECT replicate_rows()?

viirya · 2018-05-06T02:29:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+ *  }}}
+ */
+@ExpressionDescription(
+usage = "_FUNC_(n, expr1, ..., exprk) - Replicates `n`, `expr1`, ..., `exprk` into `n` rows.",


I checked the design doc for INTERSECT ALL and EXCEPT ALL. Looks like the n is always stripped and useless after Generate operation. So why we need to keep n in ReplicateRows outputs? Can we do it like:

> SELECT _FUNC_(2, "val1", "val2"); val1 val2 val1 val2

@viirya I did think about it Simon. But then, i decided to match the output with Hive.

SparkQA · 2018-05-06T05:26:33Z

Test build #90262 has finished for PR 21240 at commit 748003a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-06T06:07:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    private val acceptedTypes = Seq(IntegerType, ShortType, ByteType)
+    override def coerceTypes(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+      case s @ ReplicateRows(children) if s.children.nonEmpty && s.childrenResolved &&
+        s.children.head.dataType != LongType && acceptedTypes.contains(s.children.head.dataType) =>


nit: s.children.head.dataType != LongType is redundant because we have acceptedTypes.contains(...).

@viirya Thanks. I will fix.

viirya · 2018-05-06T06:11:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+       2  val1  val2
+       2  val1  val2
+  """)
+case class ReplicateRows(children: Seq[Expression]) extends Generator with CodegenFallback {


This can be easily implemented in codegen so we don't need CodegenFallback. We can deal with it in follow-up if you want.

@viirya If you don't mind, i would like to do it in a follow-up.

viirya · 2018-05-06T06:47:08Z

This generator function implementation itself LGTM. I have other thoughts regarding the rewrite rule but it's better to discuss on JIRA.

cc @cloud-fan @kiszk

SparkQA · 2018-05-06T07:05:01Z

Test build #90266 has finished for PR 21240 at commit 1761068.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-06T07:05:02Z

Test build #90265 has finished for PR 21240 at commit 02ed058.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-05-06T07:12:21Z

retest this please

SparkQA · 2018-05-06T10:54:55Z

Test build #90267 has finished for PR 21240 at commit 1761068.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-06T13:58:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+
+  override def eval(input: InternalRow): TraversableOnce[InternalRow] = {
+    val numRows = children.head.eval(input).asInstanceOf[Long]
+    val values = children.map(_.eval(input)).toArray


children.head seems getting evaluated twice here, can we avoid it?

SparkQA · 2018-05-07T04:19:45Z

Test build #90287 has finished for PR 21240 at commit 4ab3af0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-07T04:21:41Z

retest this please.

SparkQA · 2018-05-07T07:05:01Z

Test build #90295 has finished for PR 21240 at commit 4ab3af0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-07T07:17:42Z

retest this please.

SparkQA · 2018-05-07T07:28:55Z

Test build #90303 has finished for PR 21240 at commit 4ab3af0.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-05-07T07:47:06Z

retest this please.

SparkQA · 2018-05-07T11:24:43Z

Test build #90305 has finished for PR 21240 at commit 4ab3af0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-08T08:57:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+  object ReplicateRowsCoercion extends TypeCoercionRule {
+    private val acceptedTypes = Seq(IntegerType, ShortType, ByteType)
+    override def coerceTypes(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions {
+      case s @ ReplicateRows(children) if s.children.nonEmpty && s.childrenResolved &&


children is not used. How about this?

case s @ ReplicateRows(children) if children.nonEmpty && s.childrenResolved && acceptedTypes.contains(children.head.dataType) => ReplicateRows(Cast(children.head, LongType) +: children.tail)

maropu · 2018-05-08T09:10:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

+    if (numColumns < 2) {
+      TypeCheckResult.TypeCheckFailure(s"$prettyName requires at least 2 arguments.")
+    } else if (children.head.dataType != LongType) {
+      TypeCheckResult.TypeCheckFailure("The number of rows must be a positive long value.")


How about this message? The first argument type must be byte, short, int, or long, but ${children.head.dataType} found. BTW, it seems we don't reject negative values? (The current message says the number must be positive though...?)

maropu · 2018-05-08T09:12:07Z

sql/core/src/test/resources/sql-tests/inputs/udtf_replicate_rows.sql

+-- Requires 2 arguments at minimum.
+SELECT replicate_rows(c1) FROM tab1;
+
+-- First argument should be a numeric type.


nit: I think numeric generally includes float and double, too. integral type?

maropu · 2018-05-08T09:13:22Z

sql/core/src/test/resources/sql-tests/inputs/udtf_replicate_rows.sql

+    (1, 'row1', 1.1), 
+    (2, 'row2', 2.2),
+    (0, 'row3', 3.3),
+    (-1,'row4', 4.4),


The current behaviour of the negative value case is the same with the hive one?

gatorsmile · 2018-05-08T16:41:50Z

Like what @maropu commented at the beginning, replicate_rows might be too specific. @dilipbiswal Could you provide the other more general built-in functions that can also benefit the other cases too?

HyukjinKwon · 2018-07-16T03:23:49Z

ping @dilipbiswal for an update.

Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.

[SPARK-21274] Add a new generator function replicate_rows to support …

90efeff

…EXCEPT ALL and INTERSECT ALL

maropu reviewed May 5, 2018

View reviewed changes

viirya reviewed May 5, 2018

View reviewed changes

Review comments

748003a

viirya reviewed May 6, 2018

View reviewed changes

Review comments

02ed058

viirya reviewed May 6, 2018

View reviewed changes

more comments

1761068

viirya reviewed May 6, 2018

View reviewed changes

fix

4ab3af0

maropu reviewed May 8, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

		import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
		import org.apache.spark.sql.types._

[SPARK-21274][SQL] Add a new generator function replicate_rows to support EXCEPT ALL and INTERSECT ALL #21240

[SPARK-21274][SQL] Add a new generator function replicate_rows to support EXCEPT ALL and INTERSECT ALL #21240

Uh oh!

Conversation

dilipbiswal commented May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented May 5, 2018

Uh oh!

maropu commented May 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented May 5, 2018

Uh oh!

maropu commented May 5, 2018

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dilipbiswal commented May 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya May 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented May 6, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

dilipbiswal commented May 6, 2018

Uh oh!

SparkQA commented May 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2018

Uh oh!

viirya commented May 7, 2018

Uh oh!

SparkQA commented May 7, 2018

Uh oh!

dilipbiswal commented May 4, 2018 •

edited

Loading

maropu commented May 5, 2018 •

edited

Loading

viirya May 6, 2018 •

edited

Loading

maropu May 8, 2018 •

edited

Loading