Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,4 @@ private[master] object ApplicationState extends Enumeration {
type ApplicationState = Value

val WAITING, RUNNING, FINISHED, FAILED, KILLED, UNKNOWN = Value

val MAX_NUM_RETRY = 10
}
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ private[deploy] class Master(
private val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200)
private val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
private val MAX_EXECUTOR_RETRIES = conf.getInt("spark.deploy.maxExecutorRetries", 10)

val workers = new HashSet[WorkerInfo]
val idToApp = new HashMap[String, ApplicationInfo]
Expand Down Expand Up @@ -265,7 +266,11 @@ private[deploy] class Master(

val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
if (!normalExit && appInfo.incrementRetryCount() >= ApplicationState.MAX_NUM_RETRY) {
// Important note: this code path is not exercised by tests, so be very careful when
// changing this `if` condition.
if (!normalExit
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
Expand Down
15 changes: 15 additions & 0 deletions docs/spark-standalone.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,21 @@ SPARK_MASTER_OPTS supports the following system properties:
the whole cluster by default. <br/>
</td>
</tr>
<tr>
<td><code>spark.deploy.maxExecutorRetries</code></td>
<td>10</td>
<td>
Limit on the maximum number of back-to-back executor failures that can occur before the
standalone cluster manager removes a faulty application. An application will never be removed
if it has any running executors. If an application experiences more than
<code>spark.deploy.maxExecutorRetries</code> failures in a row, no executors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "in a row" mean anything here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: if you have a sequence of executor events like FAIL RUNNING FAIL RUNNING ... then this resets the retry count, whereas FAIL FAIL FAIL FAIL... increments it.

successfully start running in between those failures, and the application has no running
executors then the standalone cluster manager will remove the application and mark it as failed.
To disable this automatic removal, set <code>spark.deploy.maxExecutorRetries</code> to
<code>-1</code>.
<br/>
</td>
</tr>
<tr>
<td><code>spark.worker.timeout</code></td>
<td>60</td>
Expand Down