Skip to content

Conversation

@ahirreddy
Copy link
Contributor

Only encode unicode objects to UTF-8, and not strings

@loveconan1988
Copy link

------------------ 原始邮件 ------------------
发件人: "Ahir Reddy";[email protected];
发送时间: 2014年8月13日(星期三) 上午9:18
收件人: "apache/spark"[email protected];

主题: [spark] [SQL] Python JsonRDD UTF8 Encoding Fix (#1914)

Only encode unicode objects to UTF-8, and not strings

You can merge this Pull Request by running
git pull https://github.com/ahirreddy/spark json-rdd-unicode-fix1
Or view, comment on, or merge it at:

#1914

Commit Summary

Encoding Fix

File Changes

M python/pyspark/sql.py (4)

Patch Links:

https://github.com/apache/spark/pull/1914.patch

https://github.com/apache/spark/pull/1914.diff


Reply to this email directly or view it on GitHub.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1914. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18402/consoleFull

@davies
Copy link
Contributor

davies commented Aug 13, 2014

lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the isinstance here? In saveAsTextFile, we just unconditionally encode strings as UTF-8:

        def func(split, iterator):
            for x in iterator:
                if not isinstance(x, basestring):
                    x = unicode(x)
                yield x.encode("utf-8")
        keyed = self.mapPartitionsWithIndex(func)
        keyed._bypass_serializer = True

Is there a bug in this saveAsTextFile code, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if x is str with encoding "GBK", it will fail, because x.encode("utf-8") means it will try to x.decode("ascii") first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Let's fix this in saveAsTextFile, too. Should we open a JIRA for this, since it's also a bug in existing code that users might encounter?

@marmbrus
Copy link
Contributor

test this please

@pwendell
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1914. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1914:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18479/consoleFull

@marmbrus
Copy link
Contributor

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1914. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1914:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18501/consoleFull

@marmbrus
Copy link
Contributor

I've merged this to master and 1.1. Thanks!

Have we created the followup JIRA issue for saveAsTextFile?

@asfgit asfgit closed this in fde692b Aug 14, 2014
asfgit pushed a commit that referenced this pull request Aug 14, 2014
Only encode unicode objects to UTF-8, and not strings

Author: Ahir Reddy <[email protected]>

Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:

ca4e9ba [Ahir Reddy] Encoding Fix

(cherry picked from commit fde692b)
Signed-off-by: Michael Armbrust <[email protected]>
@JoshRosen
Copy link
Contributor

I created a followup JIRA here: https://issues.apache.org/jira/browse/SPARK-3103

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Only encode unicode objects to UTF-8, and not strings

Author: Ahir Reddy <[email protected]>

Closes apache#1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:

ca4e9ba [Ahir Reddy] Encoding Fix
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants