-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle registry #18823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
added regression test for multithreaded broadcast pickle
|
backport for 2.2 |
|
LGTM pending Jenkins tests. Thanks @BryanCutler, I just wanted to be sure if it passes the tests and careful of my vert first proper merge :).. |
|
@BryanCutler mind adding something like |
|
Sure no prob. I can add that to the PR title too, but I don't think I've done that in past backports. |
|
Yea, that's not in the guide and not required IIRC but just a little suggestion by me. |
|
Test build #80185 has finished for PR 18823 at commit
|
… registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <[email protected]> Closes #18823 from BryanCutler/branch-2.2.
|
Merged into branch-2.2. |
|
Thanks @HyukjinKwon!
On Aug 2, 2017 6:30 PM, "Hyukjin Kwon" <[email protected]> wrote:
Merged into branch-2.2.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18823 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEUwdbBhZlCw0mZ1WChm7OV0qWxjaolfks5sUSLLgaJpZM4OrsnX>
.
|
|
@BryanCutler, I know you know this but a reminder to close this as merged into branches. |
… registry ## What changes were proposed in this pull request? When using PySpark broadcast variables in a multi-threaded environment, `SparkContext._pickled_broadcast_vars` becomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used. ## How was this patch tested? Added a unit test that causes this race condition using another thread. Author: Bryan Cutler <[email protected]> Closes apache#18823 from BryanCutler/branch-2.2.
What changes were proposed in this pull request?
When using PySpark broadcast variables in a multi-threaded environment,
SparkContext._pickled_broadcast_varsbecomes a shared resource. A race condition can occur when broadcast variables that are pickled from one thread get added to the shared_pickled_broadcast_varsand become part of the python command from another thread. This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.How was this patch tested?
Added a unit test that causes this race condition using another thread.