Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion python/pyspark/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -1267,7 +1267,9 @@ def func(iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
yield x.encode("utf-8")
if isinstance(x, unicode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the isinstance here? In saveAsTextFile, we just unconditionally encode strings as UTF-8:

        def func(split, iterator):
            for x in iterator:
                if not isinstance(x, basestring):
                    x = unicode(x)
                yield x.encode("utf-8")
        keyed = self.mapPartitionsWithIndex(func)
        keyed._bypass_serializer = True

Is there a bug in this saveAsTextFile code, too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if x is str with encoding "GBK", it will fail, because x.encode("utf-8") means it will try to x.decode("ascii") first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Let's fix this in saveAsTextFile, too. Should we open a JIRA for this, since it's also a bug in existing code that users might encounter?

x = x.encode("utf-8")
yield x
keyed = rdd.mapPartitions(func)
keyed._bypass_serializer = True
jrdd = keyed._jrdd.map(self._jvm.BytesToString())
Expand Down