[SPARK-13631] [CORE] Thread-safe getLocationsWithLargestOutputs #11505

a1k0n · 2016-03-04T00:31:42Z

What changes were proposed in this pull request?

If a job is being scheduled in one thread which has a dependency on an
RDD currently executing a shuffle in another thread, Spark would throw a
NullPointerException. This patch synchronizes access to mapStatuses and
skips null status entries (which are in-progress shuffle tasks).

How was this patch tested?

Our client code unit test suite, which was reliably reproducing the race
condition with 10 threads, shows that this fixes it. I have not found a minimal
test case to add to Spark, but I will attempt to do so if desired.

The same test case was tripping up on SPARK-4454, which was fixed by
making other DAGScheduler code thread-safe.

@shivaram @srowen

zsxwing · 2016-03-04T00:39:28Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

Could you clarify in which situation status will be null?

Sure. Let me add a comment.

zsxwing · 2016-03-04T00:39:40Z

ok to test

zsxwing · 2016-03-04T00:39:47Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

a1k0n · 2016-03-04T01:09:02Z

I have to admit I don't completely understand the design, and that this might just be treating a symptom of something else -- when getPreferredLocations is walking the dependency graph, should those map output statuses really be null entries? Can they be populated sooner? How does this thing work?

SparkQA · 2016-03-04T03:01:05Z

Test build #52427 has finished for PR 11505 at commit 8add38a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T03:14:18Z

Test build #52425 has finished for PR 11505 at commit 7fad8fa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

a1k0n · 2016-03-04T04:22:14Z

rebasing to pick up flaky test fix

If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException.

SparkQA · 2016-03-04T06:29:01Z

Test build #52444 has finished for PR 11505 at commit 4f78803.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

a1k0n · 2016-03-04T06:40:59Z

Jenkins, retest this please

srowen · 2016-03-04T10:31:25Z

Jenkins, retest this please

SparkQA · 2016-03-04T12:53:50Z

Test build #52462 has finished for PR 11505 at commit 4f78803.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-06T20:01:01Z

@zsxwing looks good to me, sounds like you're OK with it too

srowen · 2016-03-07T09:53:14Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

   * @param fractionThreshold fraction of total map output size that a location must have
   *                          for it to be considered large.
-   *
-   * This method is not thread-safe.


Are we sure this makes it thread-safe? I agree it seems to resolve a problem that could arise when calling this from multiple threads, but I am not as clear it solves all of them. I'm OK with removing this if we have no particular reason to believe it's not thread-safe at this point.

I don't see any reason to believe it's not thread-safe; it's kind of neither explicitly thread-safe nor explicitly unsafe IMO. It's possible that individual status entries might change their size from underneath this function, but I don't see any problem with that. I am not necessarily the best person to ask, though.

srowen · 2016-03-08T07:35:37Z

OK, well unless there's a moderately strong objection, I think we can go ahead and merge this, even for 1.6. It's a cheap defensive measure and I don't see a downside.

zsxwing · 2016-03-08T07:38:22Z

LGTM

srowen · 2016-03-09T10:26:27Z

Merged to master/1.6

## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <[email protected]> Closes #11505 from a1k0n/SPARK-13631. (cherry picked from commit cbff280) Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException. This patch synchronizes access to `mapStatuses` and skips null status entries (which are in-progress shuffle tasks). ## How was this patch tested? Our client code unit test suite, which was reliably reproducing the race condition with 10 threads, shows that this fixes it. I have not found a minimal test case to add to Spark, but I will attempt to do so if desired. The same test case was tripping up on SPARK-4454, which was fixed by making other DAGScheduler code thread-safe. shivaram srowen Author: Andy Sloane <[email protected]> Closes apache#11505 from a1k0n/SPARK-13631.

zsxwing reviewed Mar 4, 2016
View reviewed changes

core/src/main/scala/org/apache/spark/MapOutputTracker.scala Outdated

Copy link

Member

zsxwing Mar 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

Andy Sloane added 2 commits March 3, 2016 20:22

[SPARK-13631] [CORE] Thread-safe getLocationsWithLargestOutputs

ddb3cec

If a job is being scheduled in one thread which has a dependency on an RDD currently executing a shuffle in another thread, Spark would throw a NullPointerException.

Add comment explaining null status situation

4f78803

srowen reviewed Mar 7, 2016
View reviewed changes

asfgit closed this in cbff280 Mar 9, 2016

a1k0n deleted the SPARK-13631 branch February 15, 2017 00:23

[SPARK-13631] [CORE] Thread-safe getLocationsWithLargestOutputs #11505

[SPARK-13631] [CORE] Thread-safe getLocationsWithLargestOutputs #11505

Uh oh!

Conversation

a1k0n commented Mar 4, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

a1k0n Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Mar 4, 2016

Uh oh!

zsxwing Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

a1k0n commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

a1k0n commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

a1k0n commented Mar 4, 2016

Uh oh!

srowen commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

srowen commented Mar 6, 2016

Uh oh!

srowen Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

a1k0n Mar 7, 2016

Choose a reason for hiding this comment

Uh oh!

srowen commented Mar 8, 2016

Uh oh!

zsxwing commented Mar 8, 2016

Uh oh!

srowen commented Mar 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants