-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25250][CORE] : Late zombie task completions handled correctly even before new taskset launched #22806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
5ad6efd
8667c28
a73f619
5509165
ee5bc68
7677aec
67e1644
f395b65
f7102ca
6709fe1
5234e87
9efbc58
6abd52c
89373af
231c51b
fcfe9f5
7ce6f10
0610939
929fbf9
393f901
52e832a
afbac96
024ec53
d2b7044
d6ac4a9
e9b363b
b55dbb0
551f412
28017ed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
Refactoring method name to completeTasks, also calling same method from task completion in ShuffleMapStage but not killing them.
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -288,20 +288,22 @@ private[spark] class TaskSchedulerImpl( | |
|
|
||
| /** | ||
| * SPARK-25250: Whenever any Result Task gets successfully completed, we simply mark the | ||
| * corresponding partition id as completed in all attempts for that particular stage. As a | ||
| * result, we do not see any Killed tasks due to TaskCommitDenied Exceptions showing up | ||
| * in the UI. | ||
| * corresponding partition id as completed in all attempts for that particular stage and | ||
| * additionally, for a Result Stage, we also kill the remaining task attempts running on the | ||
| * same partition. As a result, we do not see any Killed tasks due to | ||
| * TaskCommitDenied Exceptions showing up in the UI. | ||
|
||
| */ | ||
| override def markPartitionIdAsCompletedAndKillCorrespondingTaskAttempts( | ||
| partitionId: Int, stageId: Int): Unit = { | ||
| override def completeTasks(partitionId: Int, stageId: Int, killTasks: Boolean): Unit = { | ||
| taskSetsByStageIdAndAttempt.getOrElse(stageId, Map()).values.foreach { tsm => | ||
| tsm.partitionToIndex.get(partitionId) match { | ||
| case Some(index) => | ||
| tsm.markPartitionIdAsCompletedForTaskAttempt(index) | ||
|
||
| val taskInfoList = tsm.taskAttempts(index) | ||
| taskInfoList.filter(_.running).foreach { taskInfo => | ||
| killTaskAttempt(taskInfo.taskId, false, | ||
| s"Corresponding Partition ID $partitionId has been marked as Completed") | ||
| if (killTasks) { | ||
| val taskInfoList = tsm.taskAttempts(index) | ||
| taskInfoList.filter(_.running).foreach { taskInfo => | ||
| killTaskAttempt(taskInfo.taskId, false, | ||
|
||
| s"Corresponding Partition ID $partitionId has been marked as Completed") | ||
|
||
| } | ||
| } | ||
|
|
||
| case None => | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not kill shuffle map task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per @squito 's earlier comment, it was agreed that we kill tasks only for result stage( as an extension of SPARK-25773) in this PR. I am not sure whether we can kill ShuffleMapTasks as well here but I do not see any harm in doing so. @squito , WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to be late to respond here, have been traveling. So this question has come up a lot, and while there are reasons to do it, there are some complications as well, and I don't think we should roll that change into this PR, which is trying to solve a different bug. In short, it has been argued in the past that a shuffle map task may still make useful progress on other tasks. There are also complications with handling tasks that dont' respond well to killing (I think hadoop input readers?) To be honest, I feel like there is a stronger argument in favor of doing the killing now, though we'd probably want it behind a conf. So I'd be a +1 for the change, just that it shoudl be separate. (And I'm probably not recalling all of the gotchas with killing tasks at the moment, so maybe with a dedicated discussion on this, we can dredge up all the cases we need to think through.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, maybe we should not kill result task either, to be super safe.