Skip to content

Commit ac013ea

Browse files
committed
[SPARK-18846][SCHEDULER] Fix flakiness in SchedulerIntegrationSuite
There is a small race in SchedulerIntegrationSuite. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. ran the test locally a bunch of times, never failed, though admittedly it never failed locally for me before either. However I am nearly 100% certain this is what caused the failure of one jenkins build https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull (which is long gone now, sorry -- I fixed it as part of apache#14079 initially) Author: Imran Rashid <irashid@cloudera.com> Closes apache#16270 from squito/sched_integ_flakiness.
1 parent cccd643 commit ac013ea

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ import scala.reflect.ClassTag
2828

2929
import org.scalactic.TripleEquals
3030
import org.scalatest.Assertions.AssertionsHelper
31+
import org.scalatest.concurrent.Eventually._
32+
import org.scalatest.time.SpanSugar._
3133

3234
import org.apache.spark._
3335
import org.apache.spark.TaskState._
@@ -157,8 +159,16 @@ abstract class SchedulerIntegrationSuite[T <: MockBackend: ClassTag] extends Spa
157159
}
158160
// When a job fails, we terminate before waiting for all the task end events to come in,
159161
// so there might still be a running task set. So we only check these conditions
160-
// when the job succeeds
161-
assert(taskScheduler.runningTaskSets.isEmpty)
162+
// when the job succeeds.
163+
// When the final task of a taskset completes, we post
164+
// the event to the DAGScheduler event loop before we finish processing in the taskscheduler
165+
// thread. It's possible the DAGScheduler thread processes the event, finishes the job,
166+
// and notifies the job waiter before our original thread in the task scheduler finishes
167+
// handling the event and marks the taskset as complete. So its ok if we need to wait a
168+
// *little* bit longer for the original taskscheduler thread to finish up to deal w/ the race.
169+
eventually(timeout(1 second), interval(10 millis)) {
170+
assert(taskScheduler.runningTaskSets.isEmpty)
171+
}
162172
assert(!backend.hasTasks)
163173
} else {
164174
assert(failure != null)

0 commit comments

Comments
 (0)