[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled #20839

BryanCutler · 2018-03-15T23:20:40Z

What changes were proposed in this pull request?

When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3.

How was this patch tested?

Updated existing tests to verify error type.

SparkQA · 2018-03-15T23:54:04Z

Test build #88281 has finished for PR 20839 at commit 5caf63c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-16T00:19:10Z

@HyukjinKwon, @ueshin please take a look when you can

HyukjinKwon · 2018-03-16T03:43:05Z

python/pyspark/sql/session.py

                        warnings.warn(msg)
                    else:
-                        msg = (
+                        e.message = (


@BryanCutler, I think message attribute is only in Python 2. Also, are you doubly sure that this wraps the exception message in console too .. ?

Ah yes, you're right. The tests check the exception message so I thought all was good. Let me try something else.

BryanCutler · 2018-03-21T21:22:30Z

Before with Python 2, missing traceback and doesn't show as `ImportError`

In [4]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

/home/bryan/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    686                             "For fallback to non-optimization automatically, please set true to "
    687                             "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e))
--> 688                         raise RuntimeError(msg)
    689             data = self._convert_from_pandas(data, schema, timezone)
    690 

RuntimeError: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.

Before with Python 3, each time another error is raised in the catch block it gets chained

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    139     try:
--> 140         import pyarrow
    141     except ImportError:

ModuleNotFoundError: No module named 'pyarrow'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    142         raise ImportError("PyArrow >= %s must be installed; however, "
--> 143                           "it was not found." % minimum_pyarrow_version)
    144     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    686                             "For fallback to non-optimization automatically, please set true to "
    687                             "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e))
--> 688                         raise RuntimeError(msg)
    689             data = self._convert_from_pandas(data, schema, timezone)
    690 

RuntimeError: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.

After change with Python 2 & 3, warning is printed then error is re-raised

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
/home/bryan/git/spark/python/pyspark/sql/session.py:686: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.
  warnings.warn(msg)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    665                     and len(data) > 0:
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:
    669                     from pyspark.util import _exception_message

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    508 
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 
    512         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    147     if not have_arrow:
    148         raise ImportError("PyArrow >= %s must be installed; however, "
--> 149                           "it was not found." % minimum_pyarrow_version)
    150     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
    151         raise ImportError("PyArrow >= %s must be installed; however, "

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

SparkQA · 2018-03-21T21:29:57Z

Test build #88487 has finished for PR 20839 at commit 17dd605.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-21T22:14:43Z

Test build #88488 has finished for PR 20839 at commit 55209b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2018-03-21T23:46:11Z

@HyukjinKwon and @ueshin I came close to a solution that would wrap the exception message like before but still keep the exception type and traceback. The problem was there are too many variations in python exceptions and it was getting too complicated to be foolproof. For instance, ImportError does not use args for the message, but has another attribute message in Python 2 and msg in Python 3.

To keep things simple for this, I thought it would be fine when fallback is disabled, to still print a warning message to indicate the related conf, and then use raise to re-raise the exception which retains the original exception with traceback. I manually ran tests of toPandas and createDataFrame with Python 2 & 3 and it looks cleaner imo. What do you guys think?

HyukjinKwon · 2018-03-22T03:50:01Z

Yea, I agree that the warn way is the min fix. I actually attempted a try to fix this locally before and failed to make a clean fix, and I understand the complexity here, in particular, because of compatibility between Python 2 and Python 3.

I also took a look for six and python-future to wrap and reraise the exceptions in Python 2. The implementation of six was missing and implementation of python-future was incomplete (if I read them correctly).

Also, I took a look for the complexity you mentioned when I initially proposed the changes in this logic. Looked not quite easy to wrap it manually too.

Only one main concern is, if it makes sense to call it a warn because the message contains, for example, the warning message would contain:

                        ... 
                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
                         "failed unexpectedly:\n  %s\n"
                        ...

HyukjinKwon · 2018-03-22T03:54:36Z

python/pyspark/sql/utils.py

+        have_pandas = True
    except ImportError:
+        have_pandas = False
+    if not have_pandas:


So, this is for making the error message clean when it's reraised in Python 3? I think it's fine to leave it as was. I believe the downside of it is to lose the information where exactly the error was thrown in Python 3.

I think having the traceback to the raise ImportError below is all the information needed. If that happens, then the only possible cause is that the import failed from here. The problem with how it was before is that for Python 3, it will print out During handling of the above exception, another exception occurred: which makes it seem like it is not being handled correctly, since it's really just a failed import.

HyukjinKwon · 2018-03-22T04:01:48Z

For the best and the min fix, can we rephrase the warning messages to make them look more like a warn not an error?

BryanCutler · 2018-03-23T19:00:47Z

Thanks @HyukjinKwon for reviewing! I agree with what you said about the rephrasing the warning message, I'll try to make that sound better.

This reverts commit 17dd605.

BryanCutler · 2018-03-23T22:23:40Z

After Reworded Warnings

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
/home/bryan/git/spark/python/pyspark/sql/session.py:688: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.fallback.enabled' has been set to false.
  PyArrow >= 0.8.0 must be installed; however, it was not found.
  warnings.warn(msg)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    665                     and len(data) > 0:
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:
    669                     from pyspark.util import _exception_message

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    508 
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 
    512         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    147     if not have_arrow:
    148         raise ImportError("PyArrow >= %s must be installed; however, "
--> 149                           "it was not found." % minimum_pyarrow_version)
    150     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
    151         raise ImportError("PyArrow >= %s must be installed; however, "

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

SparkQA · 2018-03-23T22:49:28Z

Test build #88552 has finished for PR 20839 at commit 1a6be1d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-23T23:44:06Z

Test build #88554 has finished for PR 20839 at commit 5a43edf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

lgtm I don't have a better idea for now

BryanCutler · 2018-03-26T18:15:53Z

Thanks @HyukjinKwon ! I'll merge later today if no more comments.

BryanCutler · 2018-03-28T03:08:36Z

merged to master

… enabled ## What changes were proposed in this pull request? When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3. ## How was this patch tested? Updated existing tests to verify error type. Author: Bryan Cutler <[email protected]> Closes apache#20839 from BryanCutler/arrow-raise-same-error-SPARK-23699.

raise same type of error when not falling back

5caf63c

HyukjinKwon reviewed Mar 16, 2018

View reviewed changes

BryanCutler added 3 commits March 16, 2018 10:21

create new exception with modified msg

dc5ed47

try printing warning msg before raise

39c3473

format msg

17dd605

made a small fix in warning msg

55209b0

HyukjinKwon reviewed Mar 22, 2018

View reviewed changes

BryanCutler added 2 commits March 23, 2018 14:22

Revert "format msg"

e9a3e2c

This reverts commit 17dd605.

reword warnings

1a6be1d

made a small fix in warning msg (again)

5a43edf

HyukjinKwon approved these changes Mar 25, 2018

View reviewed changes

asfgit closed this in ed72bad Mar 28, 2018

BryanCutler deleted the arrow-raise-same-error-SPARK-23699 branch March 28, 2018 03:10

[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled #20839

[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled #20839

Uh oh!

Conversation

BryanCutler commented Mar 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 15, 2018

Uh oh!

BryanCutler commented Mar 16, 2018

Uh oh!

HyukjinKwon Mar 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 16, 2018

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Mar 21, 2018

Before with Python 2, missing traceback and doesn't show as ImportError

Before with Python 3, each time another error is raised in the catch block it gets chained

After change with Python 2 & 3, warning is printed then error is re-raised

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

SparkQA commented Mar 21, 2018

Uh oh!

BryanCutler commented Mar 21, 2018

Uh oh!

HyukjinKwon commented Mar 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Mar 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 23, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 22, 2018

Uh oh!

BryanCutler commented Mar 23, 2018

Uh oh!

BryanCutler commented Mar 23, 2018

After Reworded Warnings

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Mar 26, 2018

Uh oh!

BryanCutler commented Mar 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BryanCutler commented Mar 15, 2018 •

edited

Loading

HyukjinKwon Mar 16, 2018 •

edited

Loading

Before with Python 2, missing traceback and doesn't show as `ImportError`

HyukjinKwon commented Mar 22, 2018 •

edited

Loading

HyukjinKwon Mar 22, 2018 •

edited

Loading