-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14739] [PySpark] Fix Vectors parser bugs #12516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins test this please |
|
AFAIK that's correct, because similar calls work as described here in Scala. LGTM pending tests |
|
Test build #56350 has finished for PR 12516 at commit
|
Spark 14739
… the parsing test cases
|
Actually we should probably close this in favor of #12513 which came first and has additional fixes. |
|
@vishnu667 Thanks merged in your PR. |
|
@srowen would be great if this PR is considered as the bug fix (instead of #12513) since I discovered the bug and also provided the first PR on the Jira ticket. I made a mistake by closing and opening a new PR when I wanted to update the code instead of adding more commits. By merging the @vishnu667 PR this should now have all the updated tests. |
|
@arashpa there was actually a fourth as well, one before you (WTH?) but the first two were closed. Generally, people shouldn't open a new PR when there were others in progress, unless that one has been abandoned or there's a clear need to try a different approach. I see you began work on the JIRA first, at virtually the same time as zero323, but both those were closed for some reason: #12510 #12511 Then #12513 which was correct, but I see it built on Maciej's comment and maybe your first PR. Why didn't you just update the original one instead of closing? it kind of signaled you weren't working on it. I am OK merging this one for reasons above. Maybe a little more communication would have avoided 3-4x duplicated effort on this one. |
|
@arashpa You'll need to merge https://github.com/arashpa/spark/pull/2 your test cases are still not updated the previous commit it got merged to your master instead of the current branch. @srowen which PR are you going to merge so that we can close the other one. |
Test cases fix
|
@vishnu667 just merged the second PR. |
|
@srowen i wanted to add a comment regarding fairness of credit. @arashpa did indeed find the bug since we were looking at this yesterday, Maciej reported the issue based off of @arashpa 's stack overflow question about the bug ( http://stackoverflow.com/questions/36730727/parsing-all-zero-sparse-vectors-with-pyspark-sparsevectors ). |
|
OK that all sounds good. With different aliases on different sites, I didn't see the connection. |
|
Jenkins retest this please |
|
Test build #56523 has finished for PR 12516 at commit
|
## What changes were proposed in this pull request? The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed. ## How was this patch tested? Standard unit-tests similar to other methods. Author: Arash Parsa <[email protected]> Author: Arash Parsa <[email protected]> Author: Vishnu Prasad <[email protected]> Author: Vishnu Prasad S <[email protected]> Closes #12516 from arashpa/SPARK-14739. (cherry picked from commit 2b8906c) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/1.6 |
## What changes were proposed in this pull request? The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed. ## How was this patch tested? Standard unit-tests similar to other methods. Author: Arash Parsa <[email protected]> Author: Arash Parsa <[email protected]> Author: Vishnu Prasad <[email protected]> Author: Vishnu Prasad S <[email protected]> Closes apache#12516 from arashpa/SPARK-14739. (cherry picked from commit 2b8906c) Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 1cda10b)
What changes were proposed in this pull request?
The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed.
How was this patch tested?
Standard unit-tests similar to other methods.