Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 10, 2018

What changes were proposed in this pull request?

This PR proposes to fallback to non-Arrow optimization if possible - for unsupported schema, PyArrow version mismatch and PyAarrow missing.

For example, see the unsupported schema case below:

df = spark.createDataFrame([[{'a': 1}]])

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()

Before

...
py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: map<string,bigint>

After

...
          _1
0  {u'a': 1}

... UserWarning: Arrow will not be used in toPandas: Unsupported type in conversion to Arrow: MapType(StringType,LongType,true)
...
          _1
0  {u'a': 1}

Note that, in case of createDataFrame, we already fallback to make this at least working even though the optimisation is disabled:

df = spark.createDataFrame([[{'a': 1}]])
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = df.toPandas()
spark.createDataFrame(pdf).show()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.createDataFrame(pdf).show()
...
... UserWarning: Arrow will not be used in createDataFrame: Error inferring Arrow type ...
+--------+
|      _1|
+--------+
|[a -> 1]|
+--------+

How was this patch tested?

Manually tested and unit tests were added in python/pyspark/sql/tests.py.

pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

dtype = {}
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual diff here is just else:. It was removed and it fixes the indentation.

timezone = None

if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true":
should_fall_back = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the main change.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 10, 2018

Seems it happened to fix this case too:

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df = spark.createDataFrame([[bytearray("a")]])
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()

Before

     _1
0  [97]
  _1
0  a

After

     _1
0  [97]
... UserWarning...
     _1
0  [97]

@HyukjinKwon
Copy link
Member Author

cc @ueshin, @BryanCutler and @icexelloss, could you take a look please when you have some time?

@SparkQA
Copy link

SparkQA commented Feb 10, 2018

Test build #87284 has finished for PR 20567 at commit d87547c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Check if its schema is convertible in Arrow format.
to_arrow_schema(self.schema)
except Exception as e:
# Fallback to convert to Pandas DataFrame without arrow if raise some exception
Copy link
Member

@kiszk kiszk Feb 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR fall back to the original path if any exception occurs? E.g. ImportError happens while the current code throws an exception with the message?
Would it be good to note this change in the description, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. It does fall back for unsupported schema, PyArrow version mismatch and PyAarrow missing. Will add a note in PR description.

@HyukjinKwon HyukjinKwon changed the title [SPARK-23380][PYTHON] Make toPandas fall back to Arrow optimization disabled when schema is not supported in Arrow optimization [SPARK-23380][PYTHON] Make toPandas fall back to non-Arrow optimization if possible Feb 11, 2018
@gatorsmile
Copy link
Member

Since this PR is not a bug fix, we will not merge it to 2.3. How about submitting another PR to throw a better error message in to-be-released 2.3?

@HyukjinKwon
Copy link
Member Author

#20567 (comment) case is actually closer to a bug as both output from one without Arrow and with Arrow are different and inconsistent. The problem is, that we already allow inconsistent conversion in BinaryType where we don't allow in other paths like createDataFrame and pandas_udf.

In addition, I believe it is good to match the behaviour between toPandas and createDataFrame with Pandas's DataFrame as input in 2.3.0.

The change is kind of safe. Actual change is basically:

from

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    # return the one with Arrow
else:
    # return the one without Arrow

to

if # 'spark.sql.execution.arrow.enabled' true?
    should_fall_back = False
    try:
        require_minimum_pyarrow_version()
        to_arrow_schema(self.schema)
    except Exception as e:
        should_fall_back = True

    if not should_fall_back:
        # return the one with Arrow
# return the one without Arrow

The error message looks already okay for now. If you feel strongly about this, I am fine with going ahead with this only into master.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we can do return the one with Arrow in the try block? I mean:

if # 'spark.sql.execution.arrow.enabled' true?
    try:
        require_minimum_pyarrow_version()
        # return the one with Arrow
    except Exception as e:
        # warn

# return the one without Arrow

else:
import unittest

from pyspark.util import _exception_message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add an empty line between this import and _pandas_requirement_message line.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 12, 2018

@ueshin, yup, I initially thought so but realised that it might collect twice (_collectAsArrow, collect) and trigger two jobs due to one failure in execution time. Also, seems it could catch some arbitrary exceptions (not by Arrow conversion itself) in execution time.

For createDataFrame, I thought we are fine because it won't trigger multiple jobs at least.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87324 has finished for PR 20567 at commit f46540e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87327 has finished for PR 20567 at commit f46540e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Feb 12, 2018

LGTM. I'd leave it to @HyukjinKwon and @gatorsmile whether we should merge this into branch-2.3 or not.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87332 has finished for PR 20567 at commit f46540e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss
Copy link
Contributor

icexelloss commented Feb 12, 2018

Sorry I am late to the party. #20567 (comment) Does look like a bug to me. However, I am a bit concerned that such magic behavior would be not ideal to some users. At least from python users at Two Sigma, most of they would prefer a "fail fast" exception rather than fall back to non-Arrow path, because the non-Arrow path could often take a long time to complete, or worse, "fail slow". Implementing this behavior could be problematic for users that transfers non trivial amounts of data from Spark to Pandas.

@gatorsmile
Copy link
Member

This is kind of like what we did for whole-stage codegen. We have a conf like spark.sql.codegen.fallback to decide whether we should fail fast or go back to the slow path. I would suggest to introduce a similar conf.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 12, 2018

Regarding the error message, this is a good example to show how to provide a user-friendly message. To the external end users, most of them do not care internal implementation. They might not be aware of Apache ARROW is being used. They might not even know what is Apache Arrow. The conf might be set by the system admin or others. Thus, this error message is confusing to them.

...
py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: map<string,bigint>

Ideally, we could let users know how to bypass the issue. For example, let them disable the conf spark.sql.execution.arrow.enabled.

@rxin
Copy link
Contributor

rxin commented Feb 12, 2018

A quick bit: fallback is a single word.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the behavior should match createDataFrame to also fallback, but a big +1 on adding a conf to allow disabling of fallback. I can see how some users might want this and it would make it easier on development too so that if something Arrow related is failing, it is not passing tests because of falling back.

from pyspark.sql.types import _check_dataframe_convert_date, \
_check_dataframe_localize_timestamps

tables = self._collectAsArrow()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be in the try block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #20567 (comment). @ueshin raised a similar concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we don't want to collect twice and you manually run a schema conversion to fallback in that case. I think there still might be some cases where the Arrow path could fail, like maybe if there were incompatible arrow versions (like using a possible future version of pyarrow with Java still at 0.8) but this should cover the most common cases, so seems fine to me.

@HyukjinKwon
Copy link
Member Author

Yup, I also agree with adding a configuration to control this. I will work on it for master only later.

For #20567 (comment), yup. I agree with that but to do this, we should do something like:

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    try:
        to_arrow_schema(self.schema)
        # return the one with Arrow
    except Exception as e:
        raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...")
else:
    # return the one without Arrow

the diff and complexity is pretty similar with fallback one:

if # 'spark.sql.execution.arrow.enabled' true?
    should_fall_back = False
    try:
        require_minimum_pyarrow_version()
        to_arrow_schema(self.schema)
    except Exception as e:
        should_fall_back = True

    if not should_fall_back:
        # return the one with Arrow
# return the one without Arrow

Note that, in case of spark.sql.codegen.fallback, it's true by default, if I did't misunderstand. Also, we can match the behaviour to createDataFrame with Pandas as input for now in the latter way.

I have been thought this feature is in transition and am trying to fix and match the behaviour first before the release.

@HyukjinKwon
Copy link
Member Author

I mean I got that a nicer error message is useful of course but wouldn't it be better to match the behaviour between toPandas and createDataFrame before 2.3.0 if its complexity looks similar?

@HyukjinKwon HyukjinKwon changed the title [SPARK-23380][PYTHON] Make toPandas fall back to non-Arrow optimization if possible [SPARK-23380][PYTHON] Make toPandas fallback to non-Arrow optimization if possible Feb 12, 2018
@gatorsmile
Copy link
Member

My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.

to_arrow_schema(self.schema)
except Exception as e:
# Fallback to convert to Pandas DataFrame without arrow if raise some exception
should_fall_back = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should_fall_back -> should_fallback other places below too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 12, 2018

My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.

Mind if I ask to elaborate why? Want to know why this one should be specially excluded in 2.3.0 alone although it can be backported to branch-2.3.

I thought it's good to add it into 2.3.0 because this this is kind of safe, fixes a actual bug and matches the behaviour with createDataFrame too, and it's new feature in 2.3.0.

@gatorsmile
Copy link
Member

gatorsmile commented Feb 12, 2018

This issue does not cause the regression since spark.sql.execution.arrow.enabled is off by default. We need to make it configurable before merging it. Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3. Thus, we would suggest to delay merging it until 2.3.0 is out.

@HyukjinKwon
Copy link
Member Author

This issue does not cause the regression since spark.sql.execution.arrow.enabled is off by default.

It doesn't block the release but we can still backport it because it fixes an actual bug fix with a minimal change whether 2.3.0 is released or not.

We need to make it configurable before merging it

I thought this is another step. We need to make them consistent first.

Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3

Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0.

@gatorsmile
Copy link
Member

The feedback is partially from @rxin Maybe he can provide more inputs later.

@SparkQA
Copy link

SparkQA commented Feb 12, 2018

Test build #87355 has finished for PR 20567 at commit 42dec46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

I thought this is another step. We need to make them consistent first.

Based on the comments from @icexelloss , I do not think we should blindly switch back to the original version. At least, provide an option to the end users.

@gatorsmile
Copy link
Member

Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0.

Yeah. This PR is not ready to merge yet.

@HyukjinKwon
Copy link
Member Author

^ I am not saying that we should merge it now. I can do the opposite for createDataFrame given #20567 (comment) . My point is why it should be exclueded in 2.3.0 specifically while this can be considered as a backport - #20567 (comment)

@gatorsmile
Copy link
Member

RC3 is out. Just to avoid new regressions that might be introduced in the new PR.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 13, 2018

RC3 is out. This change could be in 2.3.1 f the vote passes, or in 2.3.0 If the vote fails. It sounds we can't backport and change anything in the main codes until the release 2.3.0 for the reason above.

So, you are worried of delaying the release more because it has been delayed pretty much already? I understand this but I would like to ask to get this (whether it throws an exception for both toPandas and createDataFrame or fallback for both) in to make the new feature out with consistency please, if this can be considered into branch-2.3.

@rxin, do you think we should take this out in 2.3.0 too? Was this your opinion (#20567 (comment))?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 14, 2018

@gatorsmile and @rxin,

The problem here is that toPandas just fails on unsupported types later and allows BinaryType with inconsistent conversion (#20567 (comment)) in Arrow whereas createDataFrame allows fallback in both cases.

This is the last one left (for now) about PySpark/Pandas interoperability which I found while testing out and I was thinking about targeting 2.3.0.

So, for clarification, would you be uncomfortable with one of:

  1. matching both toPandas and createDataFrame to fallback with a warning
  2. matching both toPandas and createDataFrame to throw an exception
  3. adding a configuration to control the fallback for both

to target 2.3.0 (or 2.3.1 if the vote passes)? FYI, the current one in this PR is 1.

If so, let me have two PRs, one for the error message for now to target 2.3.0 (or 2.3.1 if the vote passes), and one for adding a configuration to control the fallback to target master (and maybe 2.3.1).

Does that make sense to both of you?

cc @cloud-fan too.

@gatorsmile
Copy link
Member

We are unable to contain option 3 in Spark 2.3.0. This is too big to merge it in the current stage. We still can do it in 2.3.1.

If needed, I am fine to throw a better error message if the PR size is very small; otherwise, keep it unchanged in 2.3.0.

Also cc @liancheng @yhuai

@HyukjinKwon
Copy link
Member Author

Just FYI, except option 3., the complexity in other options and the PR size will be all similar - #20567 (comment) and #20567 (comment)

@gatorsmile
Copy link
Member

Then, let us wait for the release of Spark 2.3.0. Thanks!

@HyukjinKwon
Copy link
Member Author

I mean the actual change here is small. The diff maybe looks larger here because of removed else. Please check out the diff. It's quite a safe change.

@gatorsmile
Copy link
Member

The behavior inconsistency between toPandas and createDataFrame looks confusing to end users, I have to admit.

In the current stage, we are unable to merge the fix for these new features to Spark 2.3 branch. Let us wait for the release of Spark 2.3.0

@HyukjinKwon
Copy link
Member Author

There is one more thing - #20567 (comment) We didn't complete binary type support yet in Python side but there is a hole here ..

@gatorsmile
Copy link
Member

What is the root cause? Do we have a trivial fix to resolve/block it?

@HyukjinKwon
Copy link
Member Author

The root cause is Arrow conversion in Python side interprets binaries as str, and I here avoided this by checking if the type is what we supported or not.

This is the most trivial fix. I made a fix safe and small as possible as I can here. I can fix the error message only but the size of change and diff is virtually the same - #20567 (comment).

@cloud-fan
Copy link
Contributor

The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception.

The inconsistent behavior between toPandas and createDataFrame is confusing but may not be a blocker. We can fix it in Spark 2.4 and add a note in the migration guide.

@HyukjinKwon
Copy link
Member Author

The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception.

That's basically (#20567 (comment)):

if # 'spark.sql.execution.arrow.enabled' true?
    require_minimum_pyarrow_version()
    try:
        to_arrow_schema(self.schema)
        # return the one with Arrow
    except Exception as e:
        raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...")
else:
    # return the one without Arrow

because to_arrow_schema(self.schema) checks the supported types like other Pandas/Arrow functionalities.

@cloud-fan
Copy link
Contributor

^ this change LGTM. Can we make a PR for this change only and leave the fallback part for Spark 2.4?

@HyukjinKwon
Copy link
Member Author

Sure.

require_minimum_pyarrow_version()
# Check if its schema is convertible in Arrow format.
to_arrow_schema(self.schema)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to catch more specific exceptions here? i.e. TypeError and ImportError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it might depend on which message we want to show. Will open another PR as discussed above.

@gatorsmile
Copy link
Member

@HyukjinKwon Will you submit a fix for the binary type today? We are very close to RC4. This is kind of urgent if we still want to block it in the Spark 2.3.0 release.

@HyukjinKwon
Copy link
Member Author

Yup, I will. Sorry for delaying it. I was trying to make the fix small as possible as I can. Let me just open it as a simplest way.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 16, 2018

I just opened #20625. I believe this is the smallest and simplest change. Will turn this PR to add a configuration later for 2.4 as discussed.

@gatorsmile
Copy link
Member

Thanks! Happy Lunar New Year!

@HyukjinKwon
Copy link
Member Author

I just opened another PR for adding a configuration - #20678. Let me close this one.

@HyukjinKwon HyukjinKwon deleted the pandas_conversion_cleanup branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants