Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #46156 , to fix the wrong nullability of ScalaUDF output.

Why are the changes needed?

fix nullability

Does this PR introduce any user-facing change?

no

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 25, 2024
@cloud-fan
Copy link
Contributor Author

cc @eejbyfeldt @HyukjinKwon

Batch("Nondeterministic", Once,
PullOutNondeterministic),
Batch("ScalaUDF", Once,
HandleNullInputsForUDF),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule adds null handling for ScalaUDF, so even if the ScalaUDF's nullable is false, the final expression (if-else) can also return null, so we need to run UpdateAttributeNullability after it.

Copy link
Contributor

@eejbyfeldt eejbyfeldt Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we also need to run it before? Or how could we correctly handle null inputs to a udf if we do not have the correct nullability for the inputs?

Seems to me like one should be able to write a test case involving an outer join that shows that the proposed code will not handle such nullable input correctly.

@eejbyfeldt
Copy link
Contributor

Feels like this does not address the root issue. To me it seems problematic that nullable for ScalaUDF does not correctly reflect if the expression is nullable or not. Could that not cause bugs in other parts of the code?

Would it be better to address this issue by changing the nullable implementation? Something like #47085

@cloud-fan
Copy link
Contributor Author

@eejbyfeldt I did consider this approach before. However, the null check happens outside of ScalaUDF. If we do invoke the UDF, ScalaUDF.nullable = false means the UDF result won't be null.

@eejbyfeldt
Copy link
Contributor

eejbyfeldt commented Jun 25, 2024

@eejbyfeldt I did consider this approach before. However, the null check happens outside of ScalaUDF. If we do invoke the UDF, ScalaUDF.nullable = false means the UDF result won't be null.

Could you expand on how to trigger this error? I am not sure I am following. (But I do agree that it feels a bit off the the implementation of nullable does not match implementation of the execution)

Here is a code that still produces incorrect result with this branch (and master and prior to #46156 ):

scala> val f = udf[Int, Int](_ + 1)
scala> sc.parallelize(Seq(Some(1), None)).toDS.select(f($"value").as("t"), f($"t")).show()
+----+-----------------------------+
|   t|UDF(lateralAliasReference(t))|
+----+-----------------------------+
|   2|                            3|
|NULL|                            1|
+----+-----------------------------+

To solve (while keeping the ScalaUDF.nullable implementation) this we would probably need to make the update nullability batch be something like

    Batch("UpdateNullability", fixedPoint,
      HandleNullInputsForUDF,
      UpdateAttributeNullability),

@cloud-fan
Copy link
Contributor Author

@eejbyfeldt good catch! I've fixed it and added a test

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 27, 2024

thanks for the reviews, merging to master/3.5/3.4!

@cloud-fan cloud-fan closed this in d89aad3 Jun 27, 2024
cloud-fan added a commit that referenced this pull request Jun 27, 2024
This is a followup of #46156 , to fix the wrong nullability of ScalaUDF output.

fix nullability

no

new test

no

Closes #47081 from cloud-fan/udf.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d89aad3)
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan added a commit that referenced this pull request Jun 27, 2024
This is a followup of #46156 , to fix the wrong nullability of ScalaUDF output.

fix nullability

no

new test

no

Closes #47081 from cloud-fan/udf.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d89aad3)
Signed-off-by: Wenchen Fan <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
This is a followup of apache#46156 , to fix the wrong nullability of ScalaUDF output.

fix nullability

no

new test

no

Closes apache#47081 from cloud-fan/udf.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d89aad3)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants