-
Notifications
You must be signed in to change notification settings - Fork 29k
[SQL] Python JsonRDD UTF8 Encoding Fix #1914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
------------------ 原始邮件 ------------------ 主题: [spark] [SQL] Python JsonRDD UTF8 Encoding Fix (#1914) Only encode unicode objects to UTF-8, and not strings You can merge this Pull Request by running Commit Summary Encoding Fix File Changes M python/pyspark/sql.py (4) Patch Links: https://github.com/apache/spark/pull/1914.patch https://github.com/apache/spark/pull/1914.diff — |
|
QA tests have started for PR 1914. This patch merges cleanly. |
|
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the isinstance here? In saveAsTextFile, we just unconditionally encode strings as UTF-8:
def func(split, iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
yield x.encode("utf-8")
keyed = self.mapPartitionsWithIndex(func)
keyed._bypass_serializer = TrueIs there a bug in this saveAsTextFile code, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if x is str with encoding "GBK", it will fail, because x.encode("utf-8") means it will try to x.decode("ascii") first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Let's fix this in saveAsTextFile, too. Should we open a JIRA for this, since it's also a bug in existing code that users might encounter?
|
test this please |
|
Jenkins, test this please. |
|
QA tests have started for PR 1914. This patch merges cleanly. |
|
QA results for PR 1914: |
|
Jenkins, test this please. |
|
QA tests have started for PR 1914. This patch merges cleanly. |
|
QA results for PR 1914: |
|
I've merged this to master and 1.1. Thanks! Have we created the followup JIRA issue for |
Only encode unicode objects to UTF-8, and not strings Author: Ahir Reddy <[email protected]> Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits: ca4e9ba [Ahir Reddy] Encoding Fix (cherry picked from commit fde692b) Signed-off-by: Michael Armbrust <[email protected]>
|
I created a followup JIRA here: https://issues.apache.org/jira/browse/SPARK-3103 |
Only encode unicode objects to UTF-8, and not strings Author: Ahir Reddy <[email protected]> Closes apache#1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits: ca4e9ba [Ahir Reddy] Encoding Fix
Co-authored-by: Russell Spitzer <[email protected]>
Only encode unicode objects to UTF-8, and not strings