[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

eejbyfeldt · 2024-04-22T06:59:29Z

What changes were proposed in this pull request?

This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the UpdateNullability batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue.

I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions.

Why are the changes needed?

Currently the following code

scala> val ds1 = Seq(1).toDS()
     | val ds2 = Seq[Int]().toDS()
     | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity)
     | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect()
val ds1: org.apache.spark.sql.Dataset[Int] = [value: int]
val ds2: org.apache.spark.sql.Dataset[Int] = [value: int]
val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f08@6b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true)
val res0: Array[org.apache.spark.sql.Row] = Array([[0]])

results in an row containing 0 this is incorrect as the value should be null. Removing the udf call

scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect()
val res1: Array[org.apache.spark.sql.Row] = Array([[null]])

gives the correct value.

Does this PR introduce any user-facing change?

Yes, fixes a correctness issue when using ScalaUDFs.

How was this patch tested?

Existing and new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

eejbyfeldt · 2024-04-26T08:27:29Z

@cloud-fan Since you reviewed the original PR, maybe you could have a look?

cloud-fan · 2024-04-28T05:42:53Z

good catch!

cloud-fan · 2024-04-28T05:43:40Z

thanks, merging to master/3.5/3.4!

This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. Yes, fixes a correctness issue when using ScalaUDFs. Existing and new unit tests. No. Closes #46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since #28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. ### Why are the changes needed? Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue when using ScalaUDFs. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a followup of #46156 , to fix the wrong nullability of ScalaUDF output. ### Why are the changes needed? fix nullability ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #47081 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This is a followup of #46156 , to fix the wrong nullability of ScalaUDF output. fix nullability no new test no Closes #47081 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d89aad3) Signed-off-by: Wenchen Fan <[email protected]>

This PR fixes a correctness issue by moving the batch that resolves udf decoders to after the `UpdateNullability` batch. This means we now derive a decoder with the updated attributes which fixes a correctness issue. I think the issue has existed since apache#28645 when udf support case class arguments was added. So therefore this issue should be present in all currently supported versions. Currently the following code ``` scala> val ds1 = Seq(1).toDS() | val ds2 = Seq[Int]().toDS() | val f = udf[Tuple1[Option[Int]],Tuple1[Option[Int]]](identity) | ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(f(struct(ds2("value")))).collect() val ds1: org.apache.spark.sql.Dataset[Int] = [value: int] val ds2: org.apache.spark.sql.Dataset[Int] = [value: int] val f: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2481/0x00007f7f50961f086b1a2c9f,StructType(StructField(_1,IntegerType,true)),List(Some(class[_1[0]: int])),Some(class[_1[0]: int]),None,true,true) val res0: Array[org.apache.spark.sql.Row] = Array([[0]]) ``` results in an row containing `0` this is incorrect as the value should be `null`. Removing the udf call ``` scala> ds1.join(ds2, ds1("value") === ds2("value"), "left_outer").select(struct(ds2("value"))).collect() val res1: Array[org.apache.spark.sql.Row] = Array([[null]]) ``` gives the correct value. Yes, fixes a correctness issue when using ScalaUDFs. Existing and new unit tests. No. Closes apache#46156 from eejbyfeldt/SPARK-47927. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8b8ea60) Signed-off-by: Wenchen Fan <[email protected]>

This is a followup of apache#46156 , to fix the wrong nullability of ScalaUDF output. fix nullability no new test no Closes apache#47081 from cloud-fan/udf. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d89aad3) Signed-off-by: Wenchen Fan <[email protected]>

SPARK-47927: Correct nullability attribute in udf decoder

16d3d1a

github-actions bot added the SQL label Apr 22, 2024

eejbyfeldt marked this pull request as ready for review April 22, 2024 08:36

eejbyfeldt changed the title ~~[SPARK-47927][SQL]: Correct nullability attribute in udf decoder~~ [SPARK-47927][SQL]: Correct nullability attribute in UDF decoder Apr 22, 2024

eejbyfeldt changed the title ~~[SPARK-47927][SQL]: Correct nullability attribute in UDF decoder~~ [SPARK-47927][SQL]: Fix nullability attribute in UDF decoder Apr 22, 2024

cloud-fan approved these changes Apr 28, 2024

View reviewed changes

cloud-fan closed this in 8b8ea60 Apr 28, 2024

cloud-fan mentioned this pull request Jun 25, 2024

[SPARK-47927][SQL][FOLLOWUP] fix ScalaUDF output nullability #47081

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

Uh oh!

eejbyfeldt commented Apr 22, 2024 •

edited

Loading

Uh oh!

eejbyfeldt commented Apr 26, 2024

Uh oh!

cloud-fan commented Apr 28, 2024

Uh oh!

cloud-fan commented Apr 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

[SPARK-47927][SQL]: Fix nullability attribute in UDF decoder #46156

Uh oh!

Conversation

eejbyfeldt commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

eejbyfeldt commented Apr 26, 2024

Uh oh!

cloud-fan commented Apr 28, 2024

Uh oh!

cloud-fan commented Apr 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eejbyfeldt commented Apr 22, 2024 •

edited

Loading