[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. #17927

JoshRosen · 2017-05-09T22:22:07Z

What changes were proposed in this pull request?

There's a latent corner-case bug in PySpark UDF evaluation where executing a BatchPythonEvaluation with a single multi-argument UDF where at least one argument value is repeated will crash at execution with a confusing error.

This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).

This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats (x) as equivalent to x, not as a single-argument tuple.

How was this patch tested?

New regression test in pyspark.python.sql.tests module (tested and confirmed that it fails before my fix).

SparkQA · 2017-05-09T22:58:45Z

Test build #76707 has finished for PR 17927 at commit 17e69b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM

gatorsmile · 2017-05-10T23:51:38Z

Thanks! Merging to master/2.2/2.1

…repeated arg. ## What changes were proposed in this pull request? There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error. This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python). This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple. ## How was this patch tested? New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix). Author: Josh Rosen <[email protected]> Closes #17927 from JoshRosen/SPARK-20685. (cherry picked from commit 8ddbc43) Signed-off-by: Xiao Li <[email protected]>

…repeated arg. ## What changes were proposed in this pull request? There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error. This problem was introduced in apache#12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python). This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple. ## How was this patch tested? New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix). Author: Josh Rosen <[email protected]> Closes apache#17927 from JoshRosen/SPARK-20685.

…repeated arg. There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error. This problem was introduced in apache#12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python). This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple. New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix). Author: Josh Rosen <[email protected]> Closes apache#17927 from JoshRosen/SPARK-20685. (cherry picked from commit 8ddbc43) Signed-off-by: Xiao Li <[email protected]>

JoshRosen added 2 commits May 9, 2017 14:53

Add (failing) regression test.

6912cb2

Fix SPARK-20685

17e69b5

gatorsmile approved these changes May 10, 2017

View reviewed changes

asfgit closed this in 8ddbc43 May 10, 2017

JoshRosen deleted the SPARK-20685 branch May 10, 2017 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. #17927

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. #17927

Uh oh!

JoshRosen commented May 9, 2017 •

edited

Loading

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

gatorsmile left a comment

Uh oh!

gatorsmile commented May 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. #17927

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. #17927

Uh oh!

Conversation

JoshRosen commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 9, 2017

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoshRosen commented May 9, 2017 •

edited

Loading