[SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible #20567

HyukjinKwon · 2018-02-10T09:46:21Z

What changes were proposed in this pull request?

This PR proposes to fallback to non-Arrow optimization if possible - for unsupported schema, PyArrow version mismatch and PyAarrow missing.

For example, see the unsupported schema case below:

df = spark.createDataFrame([[{'a': 1}]])

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()

Before

...
py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: map<string,bigint>

After

...
          _1
0  {u'a': 1}

... UserWarning: Arrow will not be used in toPandas: Unsupported type in conversion to Arrow: MapType(StringType,LongType,true)
...
          _1
0  {u'a': 1}

Note that, in case of createDataFrame, we already fallback to make this at least working even though the optimisation is disabled:

df = spark.createDataFrame([[{'a': 1}]])
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = df.toPandas()
spark.createDataFrame(pdf).show()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.createDataFrame(pdf).show()

...
... UserWarning: Arrow will not be used in createDataFrame: Error inferring Arrow type ...
+--------+
|      _1|
+--------+
|[a -> 1]|
+--------+

How was this patch tested?

Manually tested and unit tests were added in python/pyspark/sql/tests.py.

HyukjinKwon · 2018-02-10T09:47:51Z

python/pyspark/sql/dataframe.py

-            pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

-            dtype = {}
+        pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)


Actual diff here is just else:. It was removed and it fixes the indentation.

HyukjinKwon · 2018-02-10T09:48:17Z

python/pyspark/sql/dataframe.py

            timezone = None

        if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true":
+            should_fall_back = False


Here is the main change.

HyukjinKwon · 2018-02-10T09:52:55Z

Seems it happened to fix this case too:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()

Before

After

     _1
0  [97]
... UserWarning...
     _1
0  [97]

HyukjinKwon · 2018-02-10T09:53:18Z

cc @ueshin, @BryanCutler and @icexelloss, could you take a look please when you have some time?

SparkQA · 2018-02-10T10:17:10Z

Test build #87284 has finished for PR 20567 at commit d87547c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-02-11T00:58:34Z

python/pyspark/sql/dataframe.py

+                # Check if its schema is convertible in Arrow format.
+                to_arrow_schema(self.schema)
+            except Exception as e:
+                # Fallback to convert to Pandas DataFrame without arrow if raise some exception


Does this PR fall back to the original path if any exception occurs? E.g. ImportError happens while the current code throws an exception with the message?
Would it be good to note this change in the description, too?

Yup. It does fall back for unsupported schema, PyArrow version mismatch and PyAarrow missing. Will add a note in PR description.

gatorsmile · 2018-02-12T00:12:36Z

Since this PR is not a bug fix, we will not merge it to 2.3. How about submitting another PR to throw a better error message in to-be-released 2.3?

HyukjinKwon · 2018-02-12T00:39:25Z

#20567 (comment) case is actually closer to a bug as both output from one without Arrow and with Arrow are different and inconsistent. The problem is, that we already allow inconsistent conversion in BinaryType where we don't allow in other paths like createDataFrame and pandas_udf.

In addition, I believe it is good to match the behaviour between toPandas and createDataFrame with Pandas's DataFrame as input in 2.3.0.

The change is kind of safe. Actual change is basically:

from

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    # return the one with Arrow
else:
    # return the one without Arrow

to

if # 'spark.sql.execution.arrow.enabled' true?
    should_fall_back = False
    try:
        require_minimum_pyarrow_version()
        to_arrow_schema(self.schema)
    except Exception as e:
        should_fall_back = True

    if not should_fall_back:
        # return the one with Arrow
# return the one without Arrow

The error message looks already okay for now. If you feel strongly about this, I am fine with going ahead with this only into master.

ueshin

I'm wondering whether we can do return the one with Arrow in the try block? I mean:

if # 'spark.sql.execution.arrow.enabled' true?
    try:
        require_minimum_pyarrow_version()
        # return the one with Arrow
    except Exception as e:
        # warn

# return the one without Arrow

ueshin · 2018-02-12T05:00:15Z

python/pyspark/sql/tests.py

 else:
    import unittest

+from pyspark.util import _exception_message


nit: add an empty line between this import and _pandas_requirement_message line.

HyukjinKwon · 2018-02-12T07:32:30Z

@ueshin, yup, I initially thought so but realised that it might collect twice (_collectAsArrow, collect) and trigger two jobs due to one failure in execution time. Also, seems it could catch some arbitrary exceptions (not by Arrow conversion itself) in execution time.

For createDataFrame, I thought we are fine because it won't trigger multiple jobs at least.

SparkQA · 2018-02-12T08:05:02Z

Test build #87324 has finished for PR 20567 at commit f46540e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-12T08:05:27Z

retest this please

SparkQA · 2018-02-12T08:39:01Z

Test build #87327 has finished for PR 20567 at commit f46540e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-12T09:14:37Z

LGTM. I'd leave it to @HyukjinKwon and @gatorsmile whether we should merge this into branch-2.3 or not.

HyukjinKwon · 2018-02-12T12:23:28Z

retest this please

SparkQA · 2018-02-12T13:09:24Z

Test build #87332 has finished for PR 20567 at commit f46540e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-02-12T15:19:09Z

Sorry I am late to the party. #20567 (comment) Does look like a bug to me. However, I am a bit concerned that such magic behavior would be not ideal to some users. At least from python users at Two Sigma, most of they would prefer a "fail fast" exception rather than fall back to non-Arrow path, because the non-Arrow path could often take a long time to complete, or worse, "fail slow". Implementing this behavior could be problematic for users that transfers non trivial amounts of data from Spark to Pandas.

gatorsmile · 2018-02-12T17:04:08Z

This is kind of like what we did for whole-stage codegen. We have a conf like spark.sql.codegen.fallback to decide whether we should fail fast or go back to the slow path. I would suggest to introduce a similar conf.

gatorsmile · 2018-02-12T17:17:59Z

Regarding the error message, this is a good example to show how to provide a user-friendly message. To the external end users, most of them do not care internal implementation. They might not be aware of Apache ARROW is being used. They might not even know what is Apache Arrow. The conf might be set by the system admin or others. Thus, this error message is confusing to them.

...
py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: map<string,bigint>

Ideally, we could let users know how to bypass the issue. For example, let them disable the conf spark.sql.execution.arrow.enabled.

rxin · 2018-02-12T17:27:48Z

A quick bit: fallback is a single word.

BryanCutler

I agree that the behavior should match createDataFrame to also fallback, but a big +1 on adding a conf to allow disabling of fallback. I can see how some users might want this and it would make it easier on development too so that if something Arrow related is failing, it is not passing tests because of falling back.

BryanCutler · 2018-02-12T17:53:45Z

python/pyspark/sql/dataframe.py

+                from pyspark.sql.types import _check_dataframe_convert_date, \
+                    _check_dataframe_localize_timestamps
+
                tables = self._collectAsArrow()


shouldn't this be in the try block?

Please see #20567 (comment). @ueshin raised a similar concern.

I see, we don't want to collect twice and you manually run a schema conversion to fallback in that case. I think there still might be some cases where the Arrow path could fail, like maybe if there were incompatible arrow versions (like using a possible future version of pyarrow with Java still at 0.8) but this should cover the most common cases, so seems fine to me.

HyukjinKwon · 2018-02-12T21:14:57Z

Yup, I also agree with adding a configuration to control this. I will work on it for master only later.

For #20567 (comment), yup. I agree with that but to do this, we should do something like:

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    try:
        to_arrow_schema(self.schema)
        # return the one with Arrow
    except Exception as e:
        raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...")
else:
    # return the one without Arrow

the diff and complexity is pretty similar with fallback one:

if # 'spark.sql.execution.arrow.enabled' true?
    should_fall_back = False
    try:
        require_minimum_pyarrow_version()
        to_arrow_schema(self.schema)
    except Exception as e:
        should_fall_back = True

    if not should_fall_back:
        # return the one with Arrow
# return the one without Arrow

Note that, in case of spark.sql.codegen.fallback, it's true by default, if I did't misunderstand. Also, we can match the behaviour to createDataFrame with Pandas as input for now in the latter way.

I have been thought this feature is in transition and am trying to fix and match the behaviour first before the release.

HyukjinKwon · 2018-02-12T21:19:37Z

I mean I got that a nicer error message is useful of course but wouldn't it be better to match the behaviour between toPandas and createDataFrame before 2.3.0 if its complexity looks similar?

gatorsmile · 2018-02-12T22:02:37Z

My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.

BryanCutler · 2018-02-12T22:40:47Z

python/pyspark/sql/dataframe.py

+                to_arrow_schema(self.schema)
+            except Exception as e:
+                # Fallback to convert to Pandas DataFrame without arrow if raise some exception
+                should_fall_back = True


nit: should_fall_back -> should_fallback other places below too

HyukjinKwon · 2018-02-12T22:56:13Z

My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.

Mind if I ask to elaborate why? Want to know why this one should be specially excluded in 2.3.0 alone although it can be backported to branch-2.3.

I thought it's good to add it into 2.3.0 because this this is kind of safe, fixes a actual bug and matches the behaviour with createDataFrame too, and it's new feature in 2.3.0.

gatorsmile · 2018-02-12T23:05:25Z

This issue does not cause the regression since spark.sql.execution.arrow.enabled is off by default. We need to make it configurable before merging it. Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3. Thus, we would suggest to delay merging it until 2.3.0 is out.

HyukjinKwon · 2018-02-12T23:20:07Z

This issue does not cause the regression since spark.sql.execution.arrow.enabled is off by default.

It doesn't block the release but we can still backport it because it fixes an actual bug fix with a minimal change whether 2.3.0 is released or not.

We need to make it configurable before merging it

I thought this is another step. We need to make them consistent first.

Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3

Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0.

gatorsmile · 2018-02-12T23:31:36Z

The feedback is partially from @rxin Maybe he can provide more inputs later.

SparkQA · 2018-02-12T23:34:39Z

Test build #87355 has finished for PR 20567 at commit 42dec46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-12T23:36:25Z

I thought this is another step. We need to make them consistent first.

Based on the comments from @icexelloss , I do not think we should blindly switch back to the original version. At least, provide an option to the end users.

gatorsmile · 2018-02-12T23:37:35Z

Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0.

Yeah. This PR is not ready to merge yet.

HyukjinKwon · 2018-02-12T23:50:41Z

^ I am not saying that we should merge it now. I can do the opposite for createDataFrame given #20567 (comment) . My point is why it should be exclueded in 2.3.0 specifically while this can be considered as a backport - #20567 (comment)

gatorsmile · 2018-02-13T00:42:40Z

RC3 is out. Just to avoid new regressions that might be introduced in the new PR.

HyukjinKwon · 2018-02-13T01:07:32Z

RC3 is out. This change could be in 2.3.1 f the vote passes, or in 2.3.0 If the vote fails. It sounds we can't backport and change anything in the main codes until the release 2.3.0 for the reason above.

So, you are worried of delaying the release more because it has been delayed pretty much already? I understand this but I would like to ask to get this (whether it throws an exception for both toPandas and createDataFrame or fallback for both) in to make the new feature out with consistency please, if this can be considered into branch-2.3.

@rxin, do you think we should take this out in 2.3.0 too? Was this your opinion (#20567 (comment))?

HyukjinKwon · 2018-02-14T15:44:07Z

@gatorsmile and @rxin,

The problem here is that toPandas just fails on unsupported types later and allows BinaryType with inconsistent conversion (#20567 (comment)) in Arrow whereas createDataFrame allows fallback in both cases.

This is the last one left (for now) about PySpark/Pandas interoperability which I found while testing out and I was thinking about targeting 2.3.0.

So, for clarification, would you be uncomfortable with one of:

matching both toPandas and createDataFrame to fallback with a warning
matching both toPandas and createDataFrame to throw an exception
adding a configuration to control the fallback for both

to target 2.3.0 (or 2.3.1 if the vote passes)? FYI, the current one in this PR is 1.

If so, let me have two PRs, one for the error message for now to target 2.3.0 (or 2.3.1 if the vote passes), and one for adding a configuration to control the fallback to target master (and maybe 2.3.1).

Does that make sense to both of you?

cc @cloud-fan too.

gatorsmile · 2018-02-14T17:59:50Z

We are unable to contain option 3 in Spark 2.3.0. This is too big to merge it in the current stage. We still can do it in 2.3.1.

If needed, I am fine to throw a better error message if the PR size is very small; otherwise, keep it unchanged in 2.3.0.

Also cc @liancheng @yhuai

HyukjinKwon · 2018-02-14T22:04:05Z

Just FYI, except option 3., the complexity in other options and the PR size will be all similar - #20567 (comment) and #20567 (comment)

gatorsmile · 2018-02-14T23:54:32Z

Then, let us wait for the release of Spark 2.3.0. Thanks!

HyukjinKwon · 2018-02-15T00:02:27Z

I mean the actual change here is small. The diff maybe looks larger here because of removed else. Please check out the diff. It's quite a safe change.

gatorsmile · 2018-02-15T00:22:54Z

The behavior inconsistency between toPandas and createDataFrame looks confusing to end users, I have to admit.

In the current stage, we are unable to merge the fix for these new features to Spark 2.3 branch. Let us wait for the release of Spark 2.3.0

HyukjinKwon · 2018-02-15T06:00:27Z

There is one more thing - #20567 (comment) We didn't complete binary type support yet in Python side but there is a hole here ..

gatorsmile · 2018-02-15T07:49:52Z

What is the root cause? Do we have a trivial fix to resolve/block it?

HyukjinKwon · 2018-02-15T08:16:50Z

The root cause is Arrow conversion in Python side interprets binaries as str, and I here avoided this by checking if the type is what we supported or not.

This is the most trivial fix. I made a fix safe and small as possible as I can here. I can fix the error message only but the size of change and diff is virtually the same - #20567 (comment).

cloud-fan · 2018-02-15T08:57:37Z

The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception.

The inconsistent behavior between toPandas and createDataFrame is confusing but may not be a blocker. We can fix it in Spark 2.4 and add a note in the migration guide.

HyukjinKwon · 2018-02-15T09:03:32Z

The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception.

That's basically (#20567 (comment)):

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    try:
        to_arrow_schema(self.schema)
        # return the one with Arrow
    except Exception as e:
        raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...")
else:
    # return the one without Arrow

because to_arrow_schema(self.schema) checks the supported types like other Pandas/Arrow functionalities.

cloud-fan · 2018-02-15T09:05:42Z

^ this change LGTM. Can we make a PR for this change only and leave the fallback part for Spark 2.4?

HyukjinKwon · 2018-02-15T09:09:25Z

Sure.

icexelloss · 2018-02-15T19:17:55Z

python/pyspark/sql/dataframe.py

                require_minimum_pyarrow_version()
+                # Check if its schema is convertible in Arrow format.
+                to_arrow_schema(self.schema)
+            except Exception as e:


Do we want to catch more specific exceptions here? i.e. TypeError and ImportError?

Hm, it might depend on which message we want to show. Will open another PR as discussed above.

gatorsmile · 2018-02-16T07:22:45Z

@HyukjinKwon Will you submit a fix for the binary type today? We are very close to RC4. This is kind of urgent if we still want to block it in the Spark 2.3.0 release.

HyukjinKwon · 2018-02-16T07:28:23Z

Yup, I will. Sorry for delaying it. I was trying to make the fix small as possible as I can. Let me just open it as a simplest way.

HyukjinKwon · 2018-02-16T07:52:14Z

I just opened #20625. I believe this is the smallest and simplest change. Will turn this PR to add a configuration later for 2.4 as discussed.

gatorsmile · 2018-02-16T17:35:57Z

Thanks! Happy Lunar New Year!

HyukjinKwon · 2018-02-26T15:11:05Z

I just opened another PR for adding a configuration - #20678. Let me close this one.

toPandas conversion cleanup

d87547c

HyukjinKwon commented Feb 10, 2018

View reviewed changes

kiszk reviewed Feb 11, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-23380][PYTHON] Make toPandas fall back to Arrow optimization disabled when schema is not supported in Arrow optimization~~ [SPARK-23380][PYTHON] Make toPandas fall back to non-Arrow optimization if possible Feb 11, 2018

ueshin reviewed Feb 12, 2018

View reviewed changes

Fix a nit while I am here :-)

f46540e

BryanCutler reviewed Feb 12, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-23380][PYTHON] Make toPandas fall back to non-Arrow optimization if possible~~ [SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible Feb 12, 2018

BryanCutler reviewed Feb 12, 2018

View reviewed changes

icexelloss reviewed Feb 15, 2018

View reviewed changes

HyukjinKwon closed this Feb 26, 2018

HyukjinKwon deleted the pandas_conversion_cleanup branch October 16, 2018 12:44

[SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible #20567

[SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible #20567

Uh oh!

Conversation

HyukjinKwon commented Feb 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 10, 2018

Uh oh!

SparkQA commented Feb 10, 2018

Uh oh!

kiszk Feb 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 12, 2018

Uh oh!

HyukjinKwon commented Feb 12, 2018

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 12, 2018

Uh oh!

HyukjinKwon commented Feb 12, 2018

Uh oh!

SparkQA commented Feb 12, 2018

Uh oh!

ueshin commented Feb 12, 2018

Uh oh!

HyukjinKwon commented Feb 12, 2018

Uh oh!

SparkQA commented Feb 12, 2018

Uh oh!

icexelloss commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Feb 12, 2018

Uh oh!

gatorsmile commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Feb 12, 2018

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 12, 2018

Uh oh!

HyukjinKwon commented Feb 12, 2018

Uh oh!

gatorsmile commented Feb 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

HyukjinKwon commented Feb 10, 2018 •

edited

Loading

HyukjinKwon commented Feb 10, 2018 •

edited

Loading

kiszk Feb 11, 2018 •

edited

Loading

HyukjinKwon commented Feb 12, 2018 •

edited

Loading

icexelloss commented Feb 12, 2018 •

edited

Loading

gatorsmile commented Feb 12, 2018 •

edited

Loading

HyukjinKwon commented Feb 12, 2018 •

edited

Loading

gatorsmile commented Feb 12, 2018 •

edited

Loading

HyukjinKwon commented Feb 13, 2018 •

edited

Loading

HyukjinKwon commented Feb 14, 2018 •

edited

Loading