Commit a511544
[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2.
## What changes were proposed in this pull request?
In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g.,
```python
from pyspark.sql.functions import pandas_udf, col
import pandas as pd
df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()
```
raises the following exception:
```
...
java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
...
```
Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2.
This pr adds a workaround for the case.
## How was this patch tested?
Added a test and existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes #20507 from ueshin/issues/SPARK-23334.
(cherry picked from commit 63c5bf1)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>1 parent 4493303 commit a511544
2 files changed
+13
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
233 | 237 | | |
234 | 238 | | |
235 | 239 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3896 | 3896 | | |
3897 | 3897 | | |
3898 | 3898 | | |
| 3899 | + | |
| 3900 | + | |
| 3901 | + | |
| 3902 | + | |
| 3903 | + | |
| 3904 | + | |
| 3905 | + | |
| 3906 | + | |
| 3907 | + | |
3899 | 3908 | | |
3900 | 3909 | | |
3901 | 3910 | | |
| |||
0 commit comments