-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #74403 has finished for PR 17267 at commit
|
|
Thanks for working on this. LGTM. |
|
What's the difference between the two? briefly. I don't know enough to evaluate it though the effect looks positive. Is this the only place this should change? |
|
IMHO, yes. And @viirya is the original author.
|
python/pyspark/sql/utils.py
Outdated
|
|
||
| def __str__(self): | ||
| return repr(self.desc) | ||
| return str(self.desc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. does this work for unicode in Python 2, for example, spark.range(1).select("아")? Up to my knowledge, converting it to ascii directly throws an exception.
>>> str(u"아")Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)
>>> repr(u"아")"u'\\uc544'"
Maybe, we should check if this is unicode and do .encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested with this change as below to help:
- before
>>> try:
... spark.range(1).select(u"아")
... except Exception as e:
... print eu"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
>>> spark.range(1).select(u"아")Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
- after
>>> try:
... spark.range(1).select(u"아")
... except Exception as e:
... print eTraceback (most recent call last):
File "<stdin>", line 4, in <module>
File ".../spark/python/pyspark/sql/utils.py", line 27, in __str__
return str(self.desc)
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 17: ordinal not in range(128)
>>> spark.range(1).select(u"아")Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uncleGen, could you double check if I did something wrong maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a check under Python2. If it is unicode, just encode it with utf-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for confirmation. I thought I was mistaken :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another benefit for this change is, before it you will see the error log in your example like:
u"cannot resolve '\uc544' given input columns: [id];;\n'Project ['\uc544]
repr will show unicode escape characters \uc544. Even you encode it, you will see binary representation for it. str can show the correct "아" if encoded with utf-8.
If I test it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I support this change and tested some more cases with that encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on latest commit:
>>> df.select("아")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
: cannot resolve '`아`' given input columns: [age, name];;
'Project ['아]
+- Relation[age#0L,name#1] json
|
Thanks @HyukjinKwon,you give a good catch!I lost that case. Thanks @viirya for your suggestion. |
|
Test build #74487 has finished for PR 17267 at commit
|
|
Test build #74490 has finished for PR 17267 at commit
|
|
Test build #74491 has finished for PR 17267 at commit
|
python/pyspark/sql/utils.py
Outdated
| import py4j | ||
| import sys | ||
|
|
||
| if sys.version > '3': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be >=.
|
Test build #74503 has finished for PR 17267 at commit
|
|
ping @viirya and @HyukjinKwon |
|
@srowen Could you please take a view and help to merge? |
|
I'm not reviewing this patch. People who know better should merge it |
|
I'll take a look at reviewing this later on this week @uncleGen. Two minor thing that we can do in the meantime is make the JIRA description a bit clearer as to what the proposed change is, the other is this change isn't really tested by Jenkins - there are no tests that look at the formatting of the error strings - maybe consider adding a test or updating the description on the PR. |
|
cc @ueshin |
|
LGTM too but hope there would be a test if possible. |
|
Correct me if I'm wrong, but I got the following message after this patch in Python 3.6: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/dataframe.py", line 1049, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/Users/ueshin/workspace/pyspark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/utils.py", line 77, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: b"cannot resolve '`\xec\x95\x84`' given input columns: [id];;\n'Project ['\xec\x95\x84]\n+- Range (0, 1, step=1, splits=Some(8))\n"I guess this message is not desirable? |
|
+1 for adding a test. |
| return repr(self.desc) | ||
| desc = self.desc | ||
| if isinstance(desc, unicode): | ||
| return str(desc.encode('utf-8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, you are right and I misread the codes. We need to
- unicode in Python 2 =>
u.encode("utf-8"). - others in Python 2 => return
str(s). - others in Python 3 => return
str(s).
Root cause for #17267 (comment) looks because encode on string (also same as unicode in Python 2) in Python 3 produces 8-bit bytes, b"...", (also same as normal string, "..." and b"...", where b is ignored, in Python 2). And str function works differently as below:
Python 2
>>> str(b"aa")
'aa'
>>> b"aa"
'aa'Python 3
>>> str(b"aa")
"b'aa'"
>>> "aa"
'aa'There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I previously thought str works like Python2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
+1 We should add a test for this. |
|
Hey @uncleGen anytime to add a test for this? |
|
@dataknocker do you want to take over this one? then we can continue with #18324 |
…endly
### What changes were proposed in this pull request?
The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially.
### Why are the changes needed?
This is an improvement, and makes exception more human readable in python side.
### Does this PR introduce any user-facing change?
Before this pr, select `中文字段` throws exception something likes below:
```
Traceback (most recent call last):
File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
raise e
AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n"
```
after this pr:
```
Traceback (most recent call last):
File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
raise e
AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7;
'Project ['中文字段]
+- OneRowRelation
```
### How was this patch
Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions.
This pr's credits should go to uncleGen and is based on #17267
Closes #25814 from advancedxy/python_exception_19926_and_21045.
Authored-by: Xianjin YE <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
## What changes were proposed in this pull request? This PR proposes to close stale PRs, mostly the same instances with apache#18017 Closes apache#14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory … Closes apache#14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage. Closes apache#14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Closes apache#14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers Closes apache#14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key… Closes apache#14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples Closes apache#14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Closes apache#15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage Closes apache#15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins Closes apache#15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP] Closes apache#16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job Closes apache#16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable Closes apache#16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator Closes apache#16766 - [SPARK-19426][SQL] Custom coalesce for Dataset Closes apache#16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns Closes apache#17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work Closes apache#17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Closes apache#17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column Closes apache#17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService Closes apache#17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication Closes apache#17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone Closes apache#17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000) Closes apache#17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos Closes apache#18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table Closes apache#18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit… Closes apache#18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex Closes apache#18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable Closes apache#18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting Closes apache#18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages Closes apache#18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery Closes apache#18432 - resolve com.esotericsoftware.kryo.KryoException Closes apache#18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer Closes apache#18585 - SPARK-21359 Closes apache#18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala Added: Closes apache#18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I… Closes apache#18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0 Closes apache#18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to … Closes apache#18667 - Fix the simpleString used in error messages Closes apache#18782 - Branch 2.1 Added: Closes apache#17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads Added: Closes apache#16456 - [SPARK-18994] clean up the local directories for application in future by annother thread Closes apache#18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable Closes apache#18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server Added: Closes apache#18827 - Merge pull request 1 from apache/master ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18780 from HyukjinKwon/close-prs.
What changes were proposed in this pull request?
Exception in pyspark is a little difficult to read.
before pr, like:
after pr:
IMHO, the root cause is the
repris not user-friendlyThis pr change
reprtostrHow was this patch tested?
Jenkins