-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13631] [CORE] Thread-safe getLocationsWithLargestOutputs #11505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify in which situation status will be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me add a comment.
|
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
|
I have to admit I don't completely understand the design, and that this might just be treating a symptom of something else -- when |
|
Test build #52427 has finished for PR 11505 at commit
|
|
Test build #52425 has finished for PR 11505 at commit
|
|
rebasing to pick up flaky test fix |
If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException.
|
Test build #52444 has finished for PR 11505 at commit
|
|
Jenkins, retest this please |
1 similar comment
|
Jenkins, retest this please |
|
Test build #52462 has finished for PR 11505 at commit
|
|
@zsxwing looks good to me, sounds like you're OK with it too |
| * @param fractionThreshold fraction of total map output size that a location must have | ||
| * for it to be considered large. | ||
| * | ||
| * This method is not thread-safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this makes it thread-safe? I agree it seems to resolve a problem that could arise when calling this from multiple threads, but I am not as clear it solves all of them. I'm OK with removing this if we have no particular reason to believe it's not thread-safe at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to believe it's not thread-safe; it's kind of neither explicitly thread-safe nor explicitly unsafe IMO. It's possible that individual status entries might change their size from underneath this function, but I don't see any problem with that. I am not necessarily the best person to ask, though.
|
OK, well unless there's a moderately strong objection, I think we can go ahead and merge this, even for 1.6. It's a cheap defensive measure and I don't see a downside. |
|
LGTM |
|
Merged to master/1.6 |
## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <[email protected]> Closes #11505 from a1k0n/SPARK-13631. (cherry picked from commit cbff280) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <[email protected]> Closes apache#11505 from a1k0n/SPARK-13631.
What changes were proposed in this pull request?
If a job is being scheduled in one thread which has a dependency on an
RDD currently executing a shuffle in another thread, Spark would throw a
NullPointerException. This patch synchronizes access to
mapStatusesandskips null status entries (which are in-progress shuffle tasks).
How was this patch tested?
Our client code unit test suite, which was reliably reproducing the race
condition with 10 threads, shows that this fixes it. I have not found a minimal
test case to add to Spark, but I will attempt to do so if desired.
The same test case was tripping up on SPARK-4454, which was fixed by
making other DAGScheduler code thread-safe.
@shivaram @srowen