-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625
Conversation
|
Test build #87502 has finished for PR 20625 at commit
|
|
Test build #87504 has finished for PR 20625 at commit
|
| msg = ( | ||
| "Note: toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " | ||
| "to disable this.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this says why it's trying arrow and how to turn it off, but doesn't say why I have to turn it off? perhaps say something like pyarrow is not found (if it is the cause if we know)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that should be part of the original message. For example, I don't have PyArrow in pypy in my local. it shows the error like:
RuntimeError: PyArrow >= 0.8.0 must be installed; however, it was not found.
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
|
LGTM Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. Will merge it now. We can improve the fix later if anybody has better ideas. Thanks! Merged to master/2.3 Happy Lunar New Year! |
## What changes were proposed in this pull request?
This PR explicitly specifies and checks the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:
```python
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
```
```
_1
0 [97]
_1
0 a
```
This should be disallowed. I think the same things also apply to nested timestamps too.
I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message.
## How was this patch tested?
Manually tested and tests added in `python/pyspark/sql/tests.py`.
Author: hyukjinkwon <[email protected]>
Closes #20625 from HyukjinKwon/pandas_convertion_supported_type.
(cherry picked from commit c5857e4)
Signed-off-by: gatorsmile <[email protected]>
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @HyukjinKwon ! I just had a small question of the type of error raised, otherwise LGTM
| "Note: toPandas attempted Arrow optimization because " | ||
| "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " | ||
| "to disable this.") | ||
| raise RuntimeError("%s\n%s" % (_exception_message(e), msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the same type of error be raised instead of RuntimeError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, please open a PR if you have a better idea.
|
This was my best for a small and safe fix as possible as I could. Thanks for mering it @gatorsmile sincirely. This was my last concern about PyArrow abd Pandas. I don't mind at all if anyone opens another PR with a better idea to be clear. |
|
I think |
|
Thank you @BryanCutler! |
What changes were proposed in this pull request?
This PR explicitly specifies and checks the types we supported in
toPandas. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:This should be disallowed. I think the same things also apply to nested timestamps too.
I also added some nicer message about
spark.sql.execution.arrow.enabledin the error message.How was this patch tested?
Manually tested and tests added in
python/pyspark/sql/tests.py.