-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21534][SQL][PySpark] PickleException when creating dataframe from python row with empty bytearray #19085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon |
|
Test build #81253 has finished for PR 19085 at commit
|
|
For me, it takes a while to double check this. Will try to help double check within this week. |
|
Thanks @HyukjinKwon |
| class ByteArrayConstructor extends net.razorvine.pickle.objects.ByteArrayConstructor { | ||
| override def construct(args: Array[Object]): Object = { | ||
| // Deal with an empty byte array pickled by Python 3. | ||
| if (args.length == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It looks quite straightforward. I checked in Python 3:
>>> import pickle
>>> import pickletools
>>> print(pickletools.dis(pickle.dumps(bytearray())))
0: \x80 PROTO 3
2: c GLOBAL 'builtins bytearray'
22: q BINPUT 0
24: ) EMPTY_TUPLE
25: R REDUCE
26: q BINPUT 1
28: . STOP
which, up to my knowledge, gives new object[0] for args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked pickle.dumps(..., protocol=0 - 4) just in case.
|
LGTM @ueshin, could you double check this one when you have some time? |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some minor comments.
| override def construct(args: Array[Object]): Object = { | ||
| // Deal with an empty byte array pickled by Python 3. | ||
| if (args.length == 0) { | ||
| Array.empty[Byte] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use Array.emptyByteArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
python/pyspark/sql/tests.py
Outdated
| # test for SPARK-21534 | ||
| def test_empty_bytearray(self): | ||
| rdd = self.spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx}) | ||
| self.spark.createDataFrame(rdd).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reusing a test SQLTests.test_BinaryType_serialization by adding bytearray()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added.
|
Test build #81269 has finished for PR 19085 at commit
|
|
Merged to master. |
What changes were proposed in this pull request?
PickleExceptionis thrown when creating dataframe from python row with empty bytearrayByteArrayConstructordoesn't deal with empty byte array pickled by Python3.How was this patch tested?
Added test.