-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24795][Core][FOLLOWUP] Kill all running tasks when a task in a barrier stage fail #21943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jiangxb1987, thanks! I am a bot who has found some folks who might be able to help with the review:@gatorsmile, @mateiz and @kayousterhout |
| interruptThread: Boolean, | ||
| reason: String): Unit = synchronized { | ||
| logInfo(s"Killing all running tasks in stage $stageId: $reason") | ||
| taskSetsByStageIdAndAttempt.get(stageId).foreach { attempts => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some dup code and we dropped the useful comments from cancelTasks. It would be great if we move the common code here with comment and let cancelTasks call this method.
|
@jiangxb1987 Could you add a test to the new method? |
|
Test build #93882 has finished for PR 21943 at commit
|
|
Test build #93893 has finished for PR 21943 at commit
|
|
@mengxr Updated and added test cases, PTAL! |
| def submitTasks(taskSet: TaskSet): Unit | ||
|
|
||
| // Cancel a stage. | ||
| // Kill all the tasks in a stage and fail the stage and all the jobs that depend on the stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it guaranteed to work for any backend like YARN, Mesos, K8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated comment to note that if the backend doesn't support kill a task then the method shall throw UnsupportedOperationException.
|
LGTM |
|
Test build #93951 has finished for PR 21943 at commit
|
|
Test build #93945 has finished for PR 21943 at commit
|
|
retest this please |
|
LGTM |
|
Test build #93969 has finished for PR 21943 at commit
|
|
retest this please |
|
Test build #93968 has finished for PR 21943 at commit
|
|
thanks, merging to master! |
|
Test build #93996 has finished for PR 21943 at commit
|
What changes were proposed in this pull request?
Kill all running tasks when a task in a barrier stage fail in the middle.
TaskScheduler.cancelTasks()will also fail the job, so we implemented a new methodkillAllTaskAttempts()to just kill all running tasks of a stage without cancel the stage/job.How was this patch tested?
Add new test cases in
TaskSchedulerImplSuite.