Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17451

CoarseGrainedExecutorBackend in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to exitExecutor to notify driver that the executor is exiting.

How was this patch tested?

Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. Job aborted due to stage ... it gives the correct reason:

Before:
ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

After:
ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.

@tejasapatil
Copy link
Contributor Author

cc @zsxwing for review

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65104 has finished for PR 15013 at commit 5bd5534.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65111 has finished for PR 15013 at commit 0242f1e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.


if (notifyDriver) {
logInfo(s"Notifying the driver before exiting the executor")
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
Copy link
Member

@zsxwing zsxwing Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not need to connect to driver if driver is None. It's usually because we cannot connect to driver. Then the codes become very simple:

if (notifyDriver && driver.nonEmpty) {
  driver.get.ask[Boolean](RemoveExecutor(executorId, new ExecutorLossReason(reason))).onFailure { case e =>
      logWarning(s"Unable to notify the driver due to " + e.getMessage, e)
    }(ThreadUtils.sameThread)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this change

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65113 has finished for PR 15013 at commit ea76f6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65124 has finished for PR 15013 at commit 71fa2e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Done with all change. Ready for review.

@tejasapatil
Copy link
Contributor Author

@zsxwing : ping

@zsxwing
Copy link
Member

zsxwing commented Sep 14, 2016

retest this please.

LGTM. Let's run the test again!

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #3264 has finished for PR 15013 at commit 71fa2e3.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #3265 has finished for PR 15013 at commit 71fa2e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Sep 15, 2016

LGTM. Thanks! Merging to master.

@asfgit asfgit closed this in b479278 Sep 15, 2016
wgtmac pushed a commit to wgtmac/spark that referenced this pull request Sep 19, 2016
… before self-kill

## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17451

`CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to `exitExecutor` to notify driver that the executor is exiting.

## How was this patch tested?

Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. `Job aborted due to stage ...` it gives the correct reason:

Before:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.`

After:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.`

Author: Tejas Patil <[email protected]>

Closes apache#15013 from tejasapatil/SPARK-17451_inform_driver.
@tejasapatil tejasapatil deleted the SPARK-17451_inform_driver branch September 20, 2016 00:07
zzcclp added a commit to zzcclp/spark that referenced this pull request Sep 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants