Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 16, 2018

What changes were proposed in this pull request?

This PR explicitly specifies and checks the types we supported in toPandas. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
     _1
0  [97]
  _1
0  a

This should be disallowed. I think the same things also apply to nested timestamps too.

I also added some nicer message about spark.sql.execution.arrow.enabled in the error message.

How was this patch tested?

Manually tested and tests added in python/pyspark/sql/tests.py.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87502 has finished for PR 20625 at commit c79c6df.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2018

Test build #87504 has finished for PR 20625 at commit 4e5708c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

msg = (
"Note: toPandas attempted Arrow optimization because "
"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
"to disable this.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this says why it's trying arrow and how to turn it off, but doesn't say why I have to turn it off? perhaps say something like pyarrow is not found (if it is the cause if we know)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that should be part of the original message. For example, I don't have PyArrow in pypy in my local. it shows the error like:

RuntimeError: PyArrow >= 0.8.0 must be installed; however, it was not found.
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 16, 2018

LGTM

Thanks for the fast fix! We need to merge it to SPARK-2.3.0 before RC4. Will merge it now. We can improve the fix later if anybody has better ideas.

Thanks! Merged to master/2.3

Happy Lunar New Year!

asfgit pushed a commit that referenced this pull request Feb 16, 2018
## What changes were proposed in this pull request?

This PR explicitly specifies and checks the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below:

```python
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
```

```
     _1
0  [97]
  _1
0  a
```

This should be disallowed. I think the same things also apply to nested timestamps too.

I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message.

## How was this patch tested?

Manually tested and tests added in `python/pyspark/sql/tests.py`.

Author: hyukjinkwon <[email protected]>

Closes #20625 from HyukjinKwon/pandas_convertion_supported_type.

(cherry picked from commit c5857e4)
Signed-off-by: gatorsmile <[email protected]>
@asfgit asfgit closed this in c5857e4 Feb 16, 2018
Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @HyukjinKwon ! I just had a small question of the type of error raised, otherwise LGTM

"Note: toPandas attempted Arrow optimization because "
"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
"to disable this.")
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same type of error be raised instead of RuntimeError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, please open a PR if you have a better idea.

@HyukjinKwon
Copy link
Member Author

This was my best for a small and safe fix as possible as I could. Thanks for mering it @gatorsmile sincirely. This was my last concern about PyArrow abd Pandas.

I don't mind at all if anyone opens another PR with a better idea to be clear.

@BryanCutler
Copy link
Member

I think RuntimeError is fine for now and we can improve this later with logic to fallback too - best not to try and get too clever so close to the release :) Thanks for catching this and the quick fix @HyukjinKwon !

@HyukjinKwon
Copy link
Member Author

Thank you @BryanCutler!

@HyukjinKwon HyukjinKwon deleted the pandas_convertion_supported_type branch October 16, 2018 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants