[SPARK-31826][SQL] Support composed type of case class for typed Scala UDF #28645

Ngone51 · 2020-05-26T13:43:32Z

What changes were proposed in this pull request?

This PR adds support for typed Scala UDF to accept composed type of case class, e.g. Seq[T], Array[T], Map[Int, T] (assuming T is case class type), as input parameter type.

Why are the changes needed?

After #27937, typed Scala UDF now has supported case class as its input parameter type. However, it can not accept the composed type of case class, such as Seq[T], Array[T], Map[Int, T] (assuming T is case class type), which causing confuse(e.g. #27937 (comment)) to the user.

Does this PR introduce any user-facing change?

Yes.

Run the query:

scala> case class Person(name: String, age: Int)
scala> Seq((1, Seq(Person("Jack", 5)))).toDF("id", "persons").withColumn("ages", udf{ s: Seq[Person] => s.head.age }.apply(col("persons"))).show

Before:


org.apache.spark.SparkException: Failed to execute user defined function($read$$Lambda$2861/628175152: (array<struct<name:string,age:int>>) => int)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1129)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
  at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
  at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.$anonfun$applyOrElse$69(Optimizer.scala:1492)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)

....

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person
  at $anonfun$res3$1(<console>:30)
  at $anonfun$res3$1$adapted(<console>:30)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:156)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1126)
  ... 142 more

After:

+---+-----------+----+
| id|    persons|ages|
+---+-----------+----+
|  1|[[Jack, 5]]| [5]|
+---+-----------+----+

How was this patch tested?

Added tests.

Ngone51 · 2020-05-26T13:45:38Z

ping @koertkuipers @cloud-fan Please take a look, thanks!

cloud-fan · 2020-05-26T15:52:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

+            row: Any => fromRow(row.asInstanceOf[InternalRow])
+          } else {
+            val child = children(i)
+            val attrs = new StructType().add(s"$child", child.dataType).toAttributes


child.toString can be expensive. how about "child"? The name doesn't matter anyway.

cloud-fan · 2020-05-26T15:53:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

+      encoder match {
+        case Some(enc) =>
+          if (enc.isSerializedAsStructForTopLevel) {
+            val fromRow = enc.resolveAndBind().createDeserializer()


to be consistent, shall we bind with child.dataType.asInstanceOf[StructType].toAttributes?

cloud-fan · 2020-05-26T15:56:23Z

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

    checkAnswer(df.select(myUdf(Column("col1"), Column("col2"))), Row(2020) :: Nil)
  }
+
+  test("case class as element type of Seq/Array") {


can we also test some special cases:

the catalyst schema has more fields than the case class. e.g. struct<key: int, value: string, col: int> and case class TestData(key: Int, value: String)

the fields order doesn't match, e.g. struct<value: string, key: int> and case class TestData(key: Int, value: String)

the catalyst schema has missing fields, e.g. struct<key: int> and case class TestData(key: Int, value: String)

SparkQA · 2020-05-26T16:26:06Z

Test build #123125 has finished for PR 28645 at commit c1a2d1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

koertkuipers · 2020-05-26T17:05:50Z

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

    checkAnswer(df.select(myUdf(Column("col1"), Column("col2"))), Row(2020) :: Nil)
  }
+
+  test("case class as element type of Seq/Array") {


some option tests would be good too.

option in a simple seq:
Seq(Seq(Some(1), None)).toDF.withColumn("value", udf{ s: Seq[Option[Int]] => s.map(_.map(_ + 1)) }.apply(col("value")) )

option for function argument:
Seq(None, Some(1), None).toDF.withColumn("value", udf{ o: Option[Int] => o.map(_ + 1) }.apply(col("value")))

note that the top level option to express nullability is a very common use case in particular and supported by encoders. the equivalent in Dataset is:
Seq(None, Some(1), None).toDS.map{ o: Option[Int] => o.map(_ + 1) }

There's implementation difference between udf and Dataset.map. So for the second case you're mentioned, it only work in Dataset.map but fail in udf.

There's implementation difference between udf and Dataset.map. So for the second case you're mentioned, it only work in Dataset.map but fail in udf.

if the goal is to stop people from writing their own implementations of udf then the second case is also needed...

@koertkuipers I've updated to support the second case.

SparkQA · 2020-05-27T14:16:22Z

Test build #123192 has finished for PR 28645 at commit bb26320.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
test(\"case class as generic type of Option\")
case class TestData4(a: Int, b: Int, c: Int)
case class TestData5(value: String, key: Int)

SparkQA · 2020-05-28T04:38:28Z

Test build #123209 has finished for PR 28645 at commit 5e0f445.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-29T07:52:00Z

Test build #123267 has finished for PR 28645 at commit 8576d28.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-02T07:27:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

    nullable: Boolean = true,
-    udfDeterministic: Boolean = true)
+    udfDeterministic: Boolean = true,
+    inputDeserializers: Seq[Option[Deserializer[_]]] = Nil)


do we need the inputEncoders parameter anymore?

SparkQA · 2020-06-03T15:21:37Z

Test build #123478 has finished for PR 28645 at commit 86035fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-06-04T01:35:23Z

cc @cloud-fan Could you please take another look?

SparkQA · 2020-06-05T07:05:02Z

Test build #123550 has finished for PR 28645 at commit 86035fa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2020-06-08T12:50:29Z

retest this please

SparkQA · 2020-06-08T17:21:45Z

Test build #123635 has finished for PR 28645 at commit 86035fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T07:05:02Z

Test build #123666 has finished for PR 28645 at commit 2527c69.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-09T11:47:01Z

retest this please

cloud-fan · 2020-06-18T11:53:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

-      createToScalaConverter(i, c.dataType)
-    }.toArray :+ CatalystTypeConverters.createToCatalystConverter(dataType)
+      scalaConverter(i, c.dataType)
+    }.toArray :+ createToCatalystConverter(dataType)


does the UDF return type support case class?

cloud-fan · 2020-06-18T11:57:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

-        val initArg = if (CatalystTypeConverters.isPrimitive(dt)) {
+        // Check `inputPrimitives` when it's not empty in order to figure out the Option
+        // type as non primitive type, e.g., Option[Int]. Fall back to `isPrimitive` when
+        // `inputPrimitives` is empty for other cases, e.g., Java UDF, untyped Scala UDF


so untyped Scala UDF doesn't support Option?

Yea. We require the encoder to support Option but untyped Scala UDF can't provide the encoder.

cloud-fan · 2020-06-18T11:59:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

-          s"Object $argTerm = ${eval.isNull} ? null : $convertersTerm[$i].apply(${eval.value});"
+          s"""
+             |Object $argTerm = null;
+             |// handle the top level Option type specifically


What's special for top-level Option?

For the top-level Option, e.g. Option[T], it's internal data type is T. However, for a udf, it always requires the external data type for its input values. So, when the ScalaUDF receives a null value of type T from the child, it needs to convert it to None instead of simply passing in the null value like other nullable data types.

SparkQA · 2020-06-18T12:58:42Z

Test build #124212 has finished for PR 28645 at commit 1c82558.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-18T18:57:55Z

Test build #124221 has finished for PR 28645 at commit 3e97fa5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-19T04:59:40Z

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

+    checkAnswer(df.select(myUdf(Column("col"))), Row(100) :: Row(null) :: Nil)
+  }
+
+  test("top level Option case class") {


This is already tested in case class as generic type of Option

cloud-fan · 2020-06-19T05:00:38Z

sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala

+    val df = spark.range(1)
+      .select(lit(50).as("a"))
+      .select(struct("a").as("col"))
+    val error = intercept[AnalysisException] (df.select(myUdf(Column("col"))))


nit: no space between intercept[AnalysisException] and (df.select...

SparkQA · 2020-06-19T07:05:02Z

Test build #124248 has finished for PR 28645 at commit e6bb55d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-19T07:05:02Z

Test build #124260 has finished for PR 28645 at commit f29a62a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-06-19T07:06:09Z

Jenkins, retest this please.

SparkQA · 2020-06-19T11:47:23Z

Test build #124267 has finished for PR 28645 at commit f29a62a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-19T12:45:41Z

thanks, merging to master!

Ngone51 · 2020-06-19T13:27:59Z

thanks all!!

…e rows in ScalaUDF as well ### What changes were proposed in this pull request? This PR tries to address the comment: #28645 (comment) It changes `canUpCast/canCast` to allow cast from sub UDT to base UDT, in order to achieve the goal to allow UserDefinedType to use `ExpressionEncoder` to deserialize rows in ScalaUDF as well. One thing that needs to mention is, even we allow cast from sub UDT to base UDT, it doesn't really do the cast in `Cast`. Because, yet, sub UDT and base UDT are considered as the same type(because of #16660), see: https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L81-L86 https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L92-L95 Therefore, the optimize rule `SimplifyCast` will eliminate the cast at the end. ### Why are the changes needed? Reduce the special case caused by `UserDefinedType` in `ResolveEncodersInUDF` and `ScalaUDF`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It should be covered by the test of `SPARK-19311`, which is also updated a little in this PR. Closes #28920 from Ngone51/fix-udf-udt. Authored-by: yi.wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. ### Why are the changes needed? Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue when using ScalaUDFs. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. Yes, fixes a correctness issue when using ScalaUDFs. Existing and new unit tests. No. Closes #46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. ### Why are the changes needed? Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue when using ScalaUDFs. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since apache#28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. Yes, fixes a correctness issue when using ScalaUDFs. Existing and new unit tests. No. Closes apache#46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

Ngone51 added 3 commits May 26, 2020 15:46

impr

25c881b

improve

4546f35

remove unused import

c1a2d1c

probot-autolabeler bot added the SQL label May 26, 2020

cloud-fan reviewed May 26, 2020

View reviewed changes

koertkuipers reviewed May 26, 2020

View reviewed changes

Ngone51 added 2 commits May 27, 2020 18:46

fix

d13f1c7

improve tests

bb26320

fix tests

5e0f445

fix

8576d28

cloud-fan reviewed Jun 2, 2020

View reviewed changes

Ngone51 added 2 commits June 3, 2020 10:27

update

1afbdf5

update

86035fa

viirya reviewed Jun 5, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

Ngone51 added 2 commits June 9, 2020 11:14

add doc

21e8aaf

add another line

2527c69

cloud-fan reviewed Jun 18, 2020

View reviewed changes

Ngone51 added 4 commits June 18, 2020 21:31

revert the function name

21ae72b

also for Java UDF

bdbd45b

create the internal row once

4db6401

update codegen

3e97fa5

fix

e6bb55d

cloud-fan reviewed Jun 19, 2020

View reviewed changes

cloud-fan approved these changes Jun 19, 2020

View reviewed changes

address comment

f29a62a

cloud-fan closed this in 5ee5cfd Jun 19, 2020

Ngone51 mentioned this pull request Jun 24, 2020

[SPARK-32087][SQL] Allow UserDefinedType to use encoder to deserialize rows in ScalaUDF as well #28920

Closed

eejbyfeldt mentioned this pull request Apr 22, 2024

[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

Closed

[SPARK-31826][SQL] Support composed type of case class for typed Scala UDF #28645

[SPARK-31826][SQL] Support composed type of case class for typed Scala UDF #28645

Uh oh!

Conversation

Ngone51 commented May 26, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Ngone51 commented May 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 27, 2020

Uh oh!

SparkQA commented May 28, 2020

Uh oh!

SparkQA commented May 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 3, 2020

Uh oh!

Ngone51 commented Jun 4, 2020

Uh oh!

SparkQA commented Jun 5, 2020

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Jun 8, 2020

Uh oh!

SparkQA commented Jun 8, 2020

Uh oh!

SparkQA commented Jun 9, 2020

Uh oh!

cloud-fan commented Jun 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2020

Uh oh!

SparkQA commented Jun 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 19, 2020

Uh oh!

SparkQA commented Jun 19, 2020

Uh oh!

Ngone51 commented Jun 19, 2020

Uh oh!

SparkQA commented Jun 19, 2020

Uh oh!