-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17451] [CORE] CoarseGrainedExecutorBackend should inform driver before self-kill #15013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r before self-kill
|
cc @zsxwing for review |
|
Test build #65104 has finished for PR 15013 at commit
|
|
Test build #65111 has finished for PR 15013 at commit
|
|
|
||
| if (notifyDriver) { | ||
| logInfo(s"Notifying the driver before exiting the executor") | ||
| rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need to connect to driver if driver is None. It's usually because we cannot connect to driver. Then the codes become very simple:
if (notifyDriver && driver.nonEmpty) {
driver.get.ask[Boolean](RemoveExecutor(executorId, new ExecutorLossReason(reason))).onFailure { case e =>
logWarning(s"Unable to notify the driver due to " + e.getMessage, e)
}(ThreadUtils.sameThread)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did this change
|
Test build #65113 has finished for PR 15013 at commit
|
|
Test build #65124 has finished for PR 15013 at commit
|
|
Done with all change. Ready for review. |
|
@zsxwing : ping |
|
retest this please. LGTM. Let's run the test again! |
|
Test build #3264 has finished for PR 15013 at commit
|
|
Test build #3265 has finished for PR 15013 at commit
|
|
LGTM. Thanks! Merging to master. |
… before self-kill ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17451 `CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to `exitExecutor` to notify driver that the executor is exiting. ## How was this patch tested? Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. `Job aborted due to stage ...` it gives the correct reason: Before: `ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.` After: `ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.` Author: Tejas Patil <[email protected]> Closes apache#15013 from tejasapatil/SPARK-17451_inform_driver.
…driver before self-kill apache#15013
What changes were proposed in this pull request?
Jira : https://issues.apache.org/jira/browse/SPARK-17451
CoarseGrainedExecutorBackendin some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality toexitExecutorto notify driver that the executor is exiting.How was this patch tested?
Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie.
Job aborted due to stage ...it gives the correct reason:Before:
ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.After:
ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.