[SPARK-22347][SQL][PySpark] Support optionally running PythonUDFs in conditional expressions #19592

viirya · 2017-10-27T23:56:40Z

What changes were proposed in this pull request?

Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. The execution of batch Python UDFs evaluates the UDFs in an operator at all rows. It breaks the semantics of the conditional expressions and causes failures under some cases:

from pyspark.sql import Row
from pyspark.sql.functions import col, udf, when
from pyspark.sql.types import IntegerType

df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
f = udf(lambda value: 10 // int(value), IntegerType())
whenExpr1 = when((col('x') > 0), f(col('x')))
df.select(whenExpr1).collect() ## Raise a division by zero error

Even from performance perspective, to evaluate all Python UDFs used in conditional expressions can be waste of computing, if only small portion of rows satisfies the conditions.

The patch fixes the issue by adding an extra argument for Python UDFs used with conditional expressions. The argument takes the evaluated value of conditions. In Python side, we can optionally run Python UDFs based on the condition value.

Question: How about vectorized Python UDFs?

Seems it doesn't make much sense to do similar with vectorized UDFs. Vectoroized UDFs process input as batch of rows, instead of single row at one time. We can't simply optionally run vectorized UDFs only on valid rows. But as pandas Series can be more resistant to such error and evaluate to inf for shown case, it should be less serious than batch UDFs. As vectorized Python UDFs are not in releases, maybe we can consider to disable using it with conditional expression and don't worry breaking compatibility.

How was this patch tested?

Added python tests.

SparkQA · 2017-10-27T23:59:33Z

Test build #83139 has finished for PR 19592 at commit 0515435.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

viirya · 2017-10-28T00:26:56Z

cc @HyukjinKwon @ueshin @BryanCutler @cloud-fan

SparkQA · 2017-10-28T02:58:31Z

Test build #83140 has finished for PR 19592 at commit 1c523d1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

SparkQA · 2017-10-28T03:10:36Z

Test build #83141 has finished for PR 19592 at commit 9744e77.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

viirya · 2017-10-28T03:21:20Z

Seems it fails on python3.4, let me check it locally.

Note: The failure is due to using / under python3 which returns a double that doesn't match the schema.

rxin · 2017-10-28T03:33:04Z

Is this complexity worth it? Can we just document it as a behavior and users need to be careful with it?

viirya · 2017-10-28T03:37:01Z

Yeah, it is also an option. It is relatively easy to incorporate the conditional logic into Python UDFs in user side.

viirya · 2017-10-28T03:45:26Z

One question is this behavior isn't so much intuitive for end users without knowledge of Python UDFs internals. It might be a bit weird to think that Python UDFs in conditional expressions are not really conditional. So I'm just not sure if making it as it is is a best option, even with document for it.

viirya · 2017-10-28T05:08:18Z

A large part of the change is refactoring. IMHO, if possibly, it is better to allow Python UDFs running with conditional expressions normally. Thanks.

…xpressions.

SparkQA · 2017-10-28T07:05:01Z

Test build #83150 has finished for PR 19592 at commit 3a5c4c8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

SparkQA · 2017-10-28T07:05:01Z

Test build #83152 has finished for PR 19592 at commit 138a366.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

viirya · 2017-10-28T07:19:15Z

retest this please.

HyukjinKwon · 2017-10-28T09:19:16Z

To me, I think I slightly more prefer documenting this limitation, given complexity vs gain for now. But want to know what others think.

viirya · 2017-10-28T09:23:43Z

@HyukjinKwon Thanks for the comment. Yeah, want to know if we have consensus that just to document it.

SparkQA · 2017-10-28T10:06:03Z

Test build #83156 has finished for PR 19592 at commit 138a366.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

viirya · 2017-10-28T10:10:35Z

retest this please.

SparkQA · 2017-10-28T13:27:38Z

Test build #83167 has finished for PR 19592 at commit 138a366.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BatchEvalPythonExecBase(
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
case class BatchOptEvalPythonExec(

viirya · 2017-10-29T14:02:17Z

ping @ueshin @BryanCutler @cloud-fan Would you mind to provide some insights? Should we add just a document for it or fix it in your opinions? Thanks.

BryanCutler

Having another python udf eval type just for this special case in addition to the added complexity, I am also slightly leaning to just documenting this for now.

BryanCutler · 2017-10-30T17:56:50Z

python/pyspark/worker.py

-    else:
+    elif eval_type == PythonEvalType.SQL_BATCHED_UDF:
        return arg_offsets, wrap_udf(row_func, return_type)
+    elif eval_type == PythonEvalType.SQL_BATCHED_OPT_UDF:


Would it be possible to do this type of wrapping in BatchEvalPython, and remove the need to add another eval_type? If so then you could just the true/false result as is and not have to add anything in python. I think that would reduce the scope of this and simplify things a bit.

Because the python functions are serialized and maybe broadcasted further, I didn't figure out a way to do this wrapping in BatchEvalPython in Scala side.

One possible is, we do the wrapping when creating UDFs in Python side. Even for UDFs not used in conditional expressions, we still add an extra boolean argument to the end of its argument list. We don't need another eval_type with this fix.

But currently I think documenting it seems a more acceptable fix for others.

viirya · 2017-10-31T04:15:33Z

After collected the opinions so far, doing just document is the consensus. I will close this for now and submit a simple PR to document it later.

…fs with conditional expressions ## What changes were proposed in this pull request? Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. Since to fix it might need the change not small (e.g., apache#19592) and this issue has simpler workaround. We should just notice users in the document about this. ## How was this patch tested? Only document change. Author: Liang-Chi Hsieh <[email protected]> Closes apache#19617 from viirya/SPARK-22347-3.

viirya force-pushed the SPARK-22347 branch 2 times, most recently from 1c523d1 to 9744e77 Compare October 28, 2017 00:22

viirya force-pushed the SPARK-22347 branch from 9744e77 to 3a5c4c8 Compare October 28, 2017 04:57

viirya force-pushed the SPARK-22347 branch from 3a5c4c8 to 138a366 Compare October 28, 2017 05:24

Support optionally running a python udf when using with conditional e…

138a366

…xpressions.

BryanCutler reviewed Oct 30, 2017

View reviewed changes

viirya closed this Oct 31, 2017

viirya mentioned this pull request Oct 31, 2017

[SPARK-22347][PySpark][DOC] Add document to notice users for using udfs with conditional expressions #19617

Closed

viirya deleted the SPARK-22347 branch December 27, 2023 18:21

[SPARK-22347][SQL][PySpark] Support optionally running PythonUDFs in conditional expressions #19592

[SPARK-22347][SQL][PySpark] Support optionally running PythonUDFs in conditional expressions #19592

Uh oh!

Conversation

viirya commented Oct 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

viirya commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

viirya commented Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Oct 28, 2017

Uh oh!

viirya commented Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

viirya commented Oct 28, 2017

Uh oh!

HyukjinKwon commented Oct 28, 2017

Uh oh!

viirya commented Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

viirya commented Oct 28, 2017

Uh oh!

SparkQA commented Oct 28, 2017

Uh oh!

viirya commented Oct 29, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Oct 30, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Oct 31, 2017

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 31, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Oct 27, 2017 •

edited

Loading

viirya commented Oct 28, 2017 •

edited

Loading

viirya commented Oct 28, 2017 •

edited

Loading

viirya commented Oct 28, 2017 •

edited

Loading

viirya commented Oct 28, 2017 •

edited

Loading