Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Mar 15, 2018

What changes were proposed in this pull request?

When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3.

How was this patch tested?

Updated existing tests to verify error type.

@SparkQA
Copy link

SparkQA commented Mar 15, 2018

Test build #88281 has finished for PR 20839 at commit 5caf63c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@HyukjinKwon, @ueshin please take a look when you can

warnings.warn(msg)
else:
msg = (
e.message = (
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler, I think message attribute is only in Python 2. Also, are you doubly sure that this wraps the exception message in console too .. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right. The tests check the exception message so I thought all was good. Let me try something else.

@BryanCutler
Copy link
Member Author

Before with Python 2, missing traceback and doesn't show as ImportError

In [4]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

/home/bryan/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    686                             "For fallback to non-optimization automatically, please set true to "
    687                             "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e))
--> 688                         raise RuntimeError(msg)
    689             data = self._convert_from_pandas(data, schema, timezone)
    690 

RuntimeError: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.

Before with Python 3, each time another error is raised in the catch block it gets chained

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    139     try:
--> 140         import pyarrow
    141     except ImportError:

ModuleNotFoundError: No module named 'pyarrow'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    142         raise ImportError("PyArrow >= %s must be installed; however, "
--> 143                           "it was not found." % minimum_pyarrow_version)
    144     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    686                             "For fallback to non-optimization automatically, please set true to "
    687                             "'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e))
--> 688                         raise RuntimeError(msg)
    689             data = self._convert_from_pandas(data, schema, timezone)
    690 

RuntimeError: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.

After change with Python 2 & 3, warning is printed then error is re-raised

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
/home/bryan/git/spark/python/pyspark/sql/session.py:686: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  PyArrow >= 0.8.0 must be installed; however, it was not found.
For fallback to non-optimization automatically, please set true to 'spark.sql.execution.arrow.fallback.enabled'.
  warnings.warn(msg)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    665                     and len(data) > 0:
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:
    669                     from pyspark.util import _exception_message

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    508 
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 
    512         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    147     if not have_arrow:
    148         raise ImportError("PyArrow >= %s must be installed; however, "
--> 149                           "it was not found." % minimum_pyarrow_version)
    150     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
    151         raise ImportError("PyArrow >= %s must be installed; however, "

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88487 has finished for PR 20839 at commit 17dd605.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2018

Test build #88488 has finished for PR 20839 at commit 55209b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@HyukjinKwon and @ueshin I came close to a solution that would wrap the exception message like before but still keep the exception type and traceback. The problem was there are too many variations in python exceptions and it was getting too complicated to be foolproof. For instance, ImportError does not use args for the message, but has another attribute message in Python 2 and msg in Python 3.

To keep things simple for this, I thought it would be fine when fallback is disabled, to still print a warning message to indicate the related conf, and then use raise to re-raise the exception which retains the original exception with traceback. I manually ran tests of toPandas and createDataFrame with Python 2 & 3 and it looks cleaner imo. What do you guys think?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 22, 2018

Yea, I agree that the warn way is the min fix. I actually attempted a try to fix this locally before and failed to make a clean fix, and I understand the complexity here, in particular, because of compatibility between Python 2 and Python 3.

I also took a look for six and python-future to wrap and reraise the exceptions in Python 2. The implementation of six was missing and implementation of python-future was incomplete (if I read them correctly).

Also, I took a look for the complexity you mentioned when I initially proposed the changes in this logic. Looked not quite easy to wrap it manually too.

Only one main concern is, if it makes sense to call it a warn because the message contains, for example, the warning message would contain:

                        ... 
                        "'spark.sql.execution.arrow.enabled' is set to true; however, "
                         "failed unexpectedly:\n  %s\n"
                        ...

have_pandas = True
except ImportError:
have_pandas = False
if not have_pandas:
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is for making the error message clean when it's reraised in Python 3? I think it's fine to leave it as was. I believe the downside of it is to lose the information where exactly the error was thrown in Python 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the traceback to the raise ImportError below is all the information needed. If that happens, then the only possible cause is that the import failed from here. The problem with how it was before is that for Python 3, it will print out During handling of the above exception, another exception occurred: which makes it seem like it is not being handled correctly, since it's really just a failed import.

@HyukjinKwon
Copy link
Member

For the best and the min fix, can we rephrase the warning messages to make them look more like a warn not an error?

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon for reviewing! I agree with what you said about the rephrasing the warning message, I'll try to make that sound better.

@BryanCutler
Copy link
Member Author

After Reworded Warnings

In [2]: spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")
/home/bryan/git/spark/python/pyspark/sql/session.py:688: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and will not continue because automatic fallback with 'spark.sql.execution.arrow.fallback.enabled' has been set to false.
  PyArrow >= 0.8.0 must be installed; however, it was not found.
  warnings.warn(msg)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-ecc28a9b5e18> in <module>()
----> 1 spark.createDataFrame(pd.DataFrame([[{u'a': 1}]]), "a: map<string, int>")

~/git/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    665                     and len(data) > 0:
    666                 try:
--> 667                     return self._create_from_pandas_with_arrow(data, schema, timezone)
    668                 except Exception as e:
    669                     from pyspark.util import _exception_message

~/git/spark/python/pyspark/sql/session.py in _create_from_pandas_with_arrow(self, pdf, schema, timezone)
    508 
    509         require_minimum_pandas_version()
--> 510         require_minimum_pyarrow_version()
    511 
    512         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype

~/git/spark/python/pyspark/sql/utils.py in require_minimum_pyarrow_version()
    147     if not have_arrow:
    148         raise ImportError("PyArrow >= %s must be installed; however, "
--> 149                           "it was not found." % minimum_pyarrow_version)
    150     if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):
    151         raise ImportError("PyArrow >= %s must be installed; however, "

ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88552 has finished for PR 20839 at commit 1a6be1d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88554 has finished for PR 20839 at commit 5a43edf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm I don't have a better idea for now

@BryanCutler
Copy link
Member Author

Thanks @HyukjinKwon ! I'll merge later today if no more comments.

@BryanCutler
Copy link
Member Author

merged to master

@asfgit asfgit closed this in ed72bad Mar 28, 2018
@BryanCutler BryanCutler deleted the arrow-raise-same-error-SPARK-23699 branch March 28, 2018 03:10
mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
… enabled

## What changes were proposed in this pull request?

When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError.  This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3.

## How was this patch tested?

Updated existing tests to verify error type.

Author: Bryan Cutler <[email protected]>

Closes apache#20839 from BryanCutler/arrow-raise-same-error-SPARK-23699.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants