[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625

HyukjinKwon · 2018-02-16T07:51:39Z

What changes were proposed in this pull request?

This PR explicitly specifies and checks the types we supported in toPandas. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()

This should be disallowed. I think the same things also apply to nested timestamps too.

I also added some nicer message about spark.sql.execution.arrow.enabled in the error message.

How was this patch tested?

Manually tested and tests added in python/pyspark/sql/tests.py.

SparkQA · 2018-02-16T08:05:01Z

Test build #87502 has finished for PR 20625 at commit c79c6df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-16T08:32:30Z

Test build #87504 has finished for PR 20625 at commit 4e5708c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-02-16T09:37:37Z

python/pyspark/sql/dataframe.py

+                msg = (
+                    "Note: toPandas attempted Arrow optimization because "
+                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
+                    "to disable this.")


hmm, this says why it's trying arrow and how to turn it off, but doesn't say why I have to turn it off? perhaps say something like pyarrow is not found (if it is the cause if we know)?

Oh, that should be part of the original message. For example, I don't have PyArrow in pypy in my local. it shows the error like:

RuntimeError: PyArrow >= 0.8.0 must be installed; however, it was not found. Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.

gatorsmile · 2018-02-16T17:40:02Z

LGTM

Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. Will merge it now. We can improve the fix later if anybody has better ideas.

Thanks! Merged to master/2.3

Happy Lunar New Year!

## What changes were proposed in this pull request? This PR explicitly specifies and checks the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below: ```python spark.conf.set("spark.sql.execution.arrow.enabled", "false") df = spark.createDataFrame([[bytearray("a")]]) df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() ``` ``` _1 0 [97] _1 0 a ``` This should be disallowed. I think the same things also apply to nested timestamps too. I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message. ## How was this patch tested? Manually tested and tests added in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Closes #20625 from HyukjinKwon/pandas_convertion_supported_type. (cherry picked from commit c5857e4) Signed-off-by: gatorsmile <[email protected]>

BryanCutler

Thanks for doing this @HyukjinKwon ! I just had a small question of the type of error raised, otherwise LGTM

BryanCutler · 2018-02-16T17:43:16Z

python/pyspark/sql/dataframe.py

+                    "Note: toPandas attempted Arrow optimization because "
+                    "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
+                    "to disable this.")
+                raise RuntimeError("%s\n%s" % (_exception_message(e), msg))


Should the same type of error be raised instead of RuntimeError?

Yup, please open a PR if you have a better idea.

HyukjinKwon · 2018-02-16T17:56:16Z

This was my best for a small and safe fix as possible as I could. Thanks for mering it @gatorsmile sincirely. This was my last concern about PyArrow abd Pandas.

I don't mind at all if anyone opens another PR with a better idea to be clear.

BryanCutler · 2018-02-16T17:59:43Z

I think RuntimeError is fine for now and we can improve this later with logic to fallback too - best not to try and get too clever so close to the release :) Thanks for catching this and the quick fix @HyukjinKwon !

HyukjinKwon · 2018-02-16T18:02:49Z

Thank you @BryanCutler!

Explicitly specify supported types in toPandas

c79c6df

HyukjinKwon mentioned this pull request Feb 16, 2018

[SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible #20567

Closed

Simply swap the import

4e5708c

felixcheung reviewed Feb 16, 2018

View reviewed changes

asfgit closed this in c5857e4 Feb 16, 2018

BryanCutler reviewed Feb 16, 2018

View reviewed changes

HyukjinKwon deleted the pandas_convertion_supported_type branch October 16, 2018 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625

[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625

Uh oh!

HyukjinKwon commented Feb 16, 2018 •

edited

Loading

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

felixcheung Feb 16, 2018

Uh oh!

HyukjinKwon Feb 16, 2018

Uh oh!

gatorsmile commented Feb 16, 2018 •

edited

Loading

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler Feb 16, 2018

Uh oh!

HyukjinKwon Feb 16, 2018

Uh oh!

HyukjinKwon commented Feb 16, 2018

Uh oh!

BryanCutler commented Feb 16, 2018

Uh oh!

HyukjinKwon commented Feb 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625

[SPARK-23446][PYTHON] Explicitly check supported types in toPandas #20625

Uh oh!

Conversation

HyukjinKwon commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

SparkQA commented Feb 16, 2018

Uh oh!

felixcheung Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 16, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 16, 2018

Uh oh!

BryanCutler commented Feb 16, 2018

Uh oh!

HyukjinKwon commented Feb 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Feb 16, 2018 •

edited

Loading

gatorsmile commented Feb 16, 2018 •

edited

Loading