-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22347][SQL][PySpark] Support optionally running PythonUDFs in conditional expressions #19592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #83139 has finished for PR 19592 at commit
|
1c523d1 to
9744e77
Compare
|
Test build #83140 has finished for PR 19592 at commit
|
|
Test build #83141 has finished for PR 19592 at commit
|
|
Seems it fails on python3.4, let me check it locally. Note: The failure is due to using |
|
Is this complexity worth it? Can we just document it as a behavior and users need to be careful with it? |
|
Yeah, it is also an option. It is relatively easy to incorporate the conditional logic into Python UDFs in user side. |
|
One question is this behavior isn't so much intuitive for end users without knowledge of Python UDFs internals. It might be a bit weird to think that Python UDFs in conditional expressions are not really conditional. So I'm just not sure if making it as it is is a best option, even with document for it. |
|
A large part of the change is refactoring. IMHO, if possibly, it is better to allow Python UDFs running with conditional expressions normally. Thanks. |
|
Test build #83150 has finished for PR 19592 at commit
|
|
Test build #83152 has finished for PR 19592 at commit
|
|
retest this please. |
|
To me, I think I slightly more prefer documenting this limitation, given complexity vs gain for now. But want to know what others think. |
|
@HyukjinKwon Thanks for the comment. Yeah, want to know if we have consensus that just to document it. |
|
Test build #83156 has finished for PR 19592 at commit
|
|
retest this please. |
|
Test build #83167 has finished for PR 19592 at commit
|
|
ping @ueshin @BryanCutler @cloud-fan Would you mind to provide some insights? Should we add just a document for it or fix it in your opinions? Thanks. |
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having another python udf eval type just for this special case in addition to the added complexity, I am also slightly leaning to just documenting this for now.
| else: | ||
| elif eval_type == PythonEvalType.SQL_BATCHED_UDF: | ||
| return arg_offsets, wrap_udf(row_func, return_type) | ||
| elif eval_type == PythonEvalType.SQL_BATCHED_OPT_UDF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to do this type of wrapping in BatchEvalPython, and remove the need to add another eval_type? If so then you could just the true/false result as is and not have to add anything in python. I think that would reduce the scope of this and simplify things a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the python functions are serialized and maybe broadcasted further, I didn't figure out a way to do this wrapping in BatchEvalPython in Scala side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible is, we do the wrapping when creating UDFs in Python side. Even for UDFs not used in conditional expressions, we still add an extra boolean argument to the end of its argument list. We don't need another eval_type with this fix.
But currently I think documenting it seems a more acceptable fix for others.
|
After collected the opinions so far, doing just document is the consensus. I will close this for now and submit a simple PR to document it later. |
…fs with conditional expressions ## What changes were proposed in this pull request? Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. Since to fix it might need the change not small (e.g., apache#19592) and this issue has simpler workaround. We should just notice users in the document about this. ## How was this patch tested? Only document change. Author: Liang-Chi Hsieh <[email protected]> Closes apache#19617 from viirya/SPARK-22347-3.
What changes were proposed in this pull request?
Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. The execution of batch Python UDFs evaluates the UDFs in an operator at all rows. It breaks the semantics of the conditional expressions and causes failures under some cases:
Even from performance perspective, to evaluate all Python UDFs used in conditional expressions can be waste of computing, if only small portion of rows satisfies the conditions.
The patch fixes the issue by adding an extra argument for Python UDFs used with conditional expressions. The argument takes the evaluated value of conditions. In Python side, we can optionally run Python UDFs based on the condition value.
Question: How about vectorized Python UDFs?
Seems it doesn't make much sense to do similar with vectorized UDFs. Vectoroized UDFs process input as batch of rows, instead of single row at one time. We can't simply optionally run vectorized UDFs only on valid rows. But as pandas Series can be more resistant to such error and evaluate to
inffor shown case, it should be less serious than batch UDFs. As vectorized Python UDFs are not in releases, maybe we can consider to disable using it with conditional expression and don't worry breaking compatibility.How was this patch tested?
Added python tests.