Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Aug 30, 2017

What changes were proposed in this pull request?

PickleException is thrown when creating dataframe from python row with empty bytearray

spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})).show()

net.razorvine.pickle.PickleException: invalid pickle data for bytearray; expected 1 or 2 args, got 0
	at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java
    ...

ByteArrayConstructor doesn't deal with empty byte array pickled by Python3.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented Aug 30, 2017

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Aug 30, 2017

Test build #81253 has finished for PR 19085 at commit a362d6b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ByteArrayConstructor extends net.razorvine.pickle.objects.ByteArrayConstructor

@HyukjinKwon
Copy link
Member

For me, it takes a while to double check this. Will try to help double check within this week.

@viirya
Copy link
Member Author

viirya commented Aug 30, 2017

Thanks @HyukjinKwon

class ByteArrayConstructor extends net.razorvine.pickle.objects.ByteArrayConstructor {
override def construct(args: Array[Object]): Object = {
// Deal with an empty byte array pickled by Python 3.
if (args.length == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It looks quite straightforward. I checked in Python 3:

>>> import pickle
>>> import pickletools
>>> print(pickletools.dis(pickle.dumps(bytearray())))
    0: \x80 PROTO      3
    2: c    GLOBAL     'builtins bytearray'
   22: q    BINPUT     0
   24: )    EMPTY_TUPLE
   25: R    REDUCE
   26: q    BINPUT     1
   28: .    STOP

which, up to my knowledge, gives new object[0] for args.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked pickle.dumps(..., protocol=0 - 4) just in case.

@HyukjinKwon
Copy link
Member

LGTM

@ueshin, could you double check this one when you have some time?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some minor comments.

override def construct(args: Array[Object]): Object = {
// Deal with an empty byte array pickled by Python 3.
if (args.length == 0) {
Array.empty[Byte]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use Array.emptyByteArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

# test for SPARK-21534
def test_empty_bytearray(self):
rdd = self.spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})
self.spark.createDataFrame(rdd).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about reusing a test SQLTests.test_BinaryType_serialization by adding bytearray()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added.

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81269 has finished for PR 19085 at commit c7dda23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in ecf437a Aug 31, 2017
@HyukjinKwon
Copy link
Member

Merged to master.

@viirya viirya deleted the SPARK-21534 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants