[SPARK-6954] [YARN] Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative #5536

piaozhexiu · 2015-04-16T01:30:46Z

Make sure numExecutorsPending in ExecutorAllocationManager be non-negative.

sryza · 2015-04-16T07:31:14Z

It would be good to come up with a test that can reproduce the issue. I believe that it is actually supposed to be acceptable for numExecutorsPending to sit below 0, in situations where we have more executors than we need. My suspicion is that simply capping numExecutorsPending at 0 may turn out to be a duct tape solution, and that we're slowly leaking downwards. My further suspicion is that the right solution involves removing executorsPendingToRemove from the targetNumExecutors calculation - when we have too many executors we're probably double counting.

This is sort of handwavy, but I'm happy to look deeper if you'd like.

piaozhexiu · 2015-04-16T13:43:53Z

Thank you for your thoughts Sandy! Why don't I try to write a test first? I'll ask for help if I can't.

andrewor14 · 2015-04-16T17:53:47Z

ok to test

sryza · 2015-04-16T17:54:37Z

Awesome, sounds good Cheolsoo.

SparkQA · 2015-04-16T19:33:20Z

Test build #30428 has finished for PR 5536 at commit 42b5b0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

jerryshao · 2015-04-20T08:34:53Z

I've also detected this issue by running spark-shell with ./bin/spark-shell --master yarn --deploy-mode client and run sc.parallelize(1 to 100).map(i => (i, i)).collect(). After this job is finished, the numExecutorPending will be -1. Maybe useful to reproduce this issue :).

But seems this negative value does no harm to the whole system.

andrewor14 · 2015-04-20T23:35:33Z

I believe this is the right fix, but it's actually fairly non-trivial to determine the correctness of this change.

When we request a new total that is less than the current total, YarnAllocator will lower the number of pending executor requests but not kill existing executors to get to that total. Thus, there is a divergence in the total expected by ExecutorAllocationManager and the YarnAllocator, and the ExecutorAllocationManager uses this new incorrect total to compute the number of pending executors, leading to a negative value.

A simple example: We have 10 executors with nothing pending to be added or removed. Then suddenly we finish running all tasks, such that the number of executors needed is now 0. Then, when we try to cancel, we will request a total of 0 executors. If we follow the math in updateNumPendingExecutors, we will get -10:

// 0 - 10 + 0 = -10
numExecutorsPending =
  newTotalExecutors - executorIds.size + executorsPendingToRemove.size

which doesn't make sense because we can't have a negative number of pending executors.

Why is this a correct fix then? Well, the only place where we ever request a total less than the current one is when we cancel outstanding requests, and when we cancel requests the YarnAllocator will only at most cancel all pending requests, but not kill existing executors. Limiting the numPendingExecutors to be at least 0 essentially causes the same behavior on the ExecutorAllocationManager side.

andrewor14 · 2015-04-20T23:40:31Z

By the way, I plan to rewrite a large part of this logic in a way that is more intuitive, after which any test you write here may not apply anymore so it might make sense to hold that off for now. This code has grown to be quite unmanageable such that even I, the original author of this feature, need to spend a significant chunk of time trying to follow the logic and understand the root cause of the issue.

For this reason I'm going to merge this as is into master and 1.3. Thanks @piaozhexiu @sryza @jerryshao.

sryza · 2015-04-20T23:56:54Z

Mind if I take a look before you merge? I think I'm the last one to have touched this code.

On Apr 20, 2015, at 4:40 PM, andrewor14 [email protected] wrote:

By the way, I plan to rewrite a large part of this logic in a way that is more intuitive, after which any test you write here may not apply anymore so it might make sense to hold that off for now. This code has grown to be quite unmanageable such that even I, the original author of this feature, need to spend a significant chunk of time trying to follow the logic and understand the root cause of the issue.

For this reason I'm going to merge this as is into master and 1.3. Thanks @piaozhexiu @sryza @jerryshao.

—
Reply to this email directly or view it on GitHub.

andrewor14 · 2015-04-21T00:06:33Z

Ok, let me know if you have any other comments.

sryza · 2015-04-21T01:58:13Z

I've thought about this more and think I'm able to articulate why I think the present fix is incorrect.

Aside from the weirdness of the idea that there can be a negative number of pending executors, I think the behavior that you (Andrew) describe in your example is not a problem. numPendingExecutors ends up as -10. targetNumExecutors is then calculated as numPendingExecutors + executorIds.size - executorsPendingRemoval.size, which comes out to 0. This is the correct number that we'd like to pass on to YarnAllocator, which tells YarnAllocator that if executors die naturally or are preempted, it shouldn't try to replace them because there's no need. If we simply force numExecutorsPending to be 0 in that situation, targetNumExecutors would be calculated as 10, which is incorrect.

The problem is that some sort of double counting is going on. I think executors that we don't want can be counted both in the form of negative numPendingExecutors and as part of executorsPendingRemoval.

sryza · 2015-04-21T02:06:21Z

More broadly:

I agree with you that the current code is really really difficult to reason about. When I added the ability to cancel executor requests, I spent a while thinking about how we could rewrite everything to make the logic more intuitive and wasn't able to come up with anything great. I definitely wouldn't rule out the possibility that you can come up with something better. But given that, if it's possible, it's likely not going to be an easy or quick task, I don't think we should allow the prospect of a rewrite to hold us back from writing tests. As long as we're stuck with the current code, they're an important way to help us manage its complexity.

piaozhexiu · 2015-04-21T04:38:38Z

Hi @sryza , @andrewor14 ,

I attached two diagrams that show how the following 4 variables change over time w/ and w/o my fix:

numExecutorsPending
executorIds.size
executorsPendingToRemove.size
targetNumExecutors

https://issues.apache.org/jira/secure/attachment/12726785/with_fix.png
https://issues.apache.org/jira/secure/attachment/12726786/without_fix.png

I collected them at every call to targetNumExecutors(). Hope this help you understand what's going on.

piaozhexiu · 2015-04-21T04:46:38Z

My suspicion is that simply capping numExecutorsPending at 0 may turn out to be a duct tape solution, and that we're slowly leaking downwards.

@sryza , I cannot prove whether my fix is right or wrong, and I'll leave that to you guys. But at least, I don't see any side effects that you're worried about w/ my fix. It seems to work nicely.

srowen · 2015-04-21T10:23:16Z

Trying to synthesize the several comments here: obviously we want to avoid the exception reported in the JIRA. Sandy you say that it's OK for pending executors to be -10, but then does that mean that whatever is making the request with value "-10" now needs logic to never request anything < 0 instead?

That, is are you suggesting that the capping should not actually change the value from -10? Does the value stay at -10 indefinitely then? that's what the new charts on the JIRA suggest. I suppose there is no allocation to "fulfill" that request for -10 pending executors and decrease it by -10.

Is the issue that the shut down of the executors should update the pending count too? -10 pending executors means "waiting to see 10 executors stop".

sryza · 2015-04-21T19:41:23Z

Thanks for putting together those graphs Cheolsoo. I think they're really helpful for thinking about this.

The side effect that I'm worried about is targetNumExecutors being higher than it needs to be. To know what it "needs to be", we would also need lines on the graph showing the number of running + pending tasks (not that I think you should waste time on that - I think my deductions a couple comments above should be proof that can happen). The without_fix graph confirms my suspicion that the trouble starts arising when we start issuing kill requests. Before that, targetNumExecutors never dives below 0. As soon as there are executors pending removal, it does, which triggers the exception.

andrewor14 · 2015-04-21T20:13:59Z

Let's take a step back and summarize the problem and the proposed solution to make sure that we're all on the same page. This issue is actually fairly complicated so I think it's worth it to explore this in some detail. I would be happy to hear your thoughts about this. @srowen @piaozhexiu @sryza

Symptom.

There are instances where ExecutorAllocationManager (EAM) requests a negative total number of executors from the YarnAllocator, and this results in an exception because the the code correctly does not allow this. I think we all agree on this part. Now let's trace through an example of how we can run into this exception.

Problem illustrated with an example.

(0) Let's say we start with 10 executors with nothing pending to be added or removed
(1) Some tasks have finished such that the scheduler tells us we only need 5.
(2) Next time we call schedule we attempt to cancel 5 pending requests (but there are none)
(3) We set numExecutorsPending = newTarget - executorIds.size + [...] = 5 - 10 + 0 = -5
(4) Next time we add executors our new target will be

// Let's say `numExecutorsToAdd == 1`
val currentTarget = numExecutorsPending + executorIds.size - [...] = -5 + 10 - 0 = 5
val newTarget = math.min(currentTarget + numExecutorsToAdd, [...]) = 5 + 1 = 6

(5) Thus, we only request 6 executors when we're trying to add, even though we already have 10.

Now imagine between steps (3) and (4), a number of executors fail (or are removed, the distinction is not important). For instance, if all of the existing executors fail, then we will end up with a negative target total (-4 in this case, computed using the same equation above), resulting in the exception. This is consistent with the description in the JIRA that this only happens if we remove executors too quickly.

Underlying cause.

After a cancel, the EAM unconditionally adjusts the new target downwards whether or not there are pending requests to be canceled, and then computes numExecutorsPending using this new target. If there are no pending executors, then the allocator actually ignores the cancel request because there is nothing to cancel. Meanwhile, the EAM expects the new total number of executors to go down as well, but this never happens.

In other words, when we cancel, we should only adjust the new target downwards by at most numExecutorsPending, since the allocator will only cancel any pending requests but not kill existing executors.

Proposed solution.

This patch ensures that numExecutorsPending never goes below 0. This means the next time EAM tries to add executors, it will use a target total that is at least the number of currently registered executors, and this must be >= 0.

This behavior also matches the state on the allocator side in most cases. When it receives a cancel request, the allocator will cancel at most the number of pending requests that are outstanding (clearly this cannot be negative).

Note that even this solution is not perfect. There exists a race condition in which a few of the pending requests we are trying to cancel may already be granted by the time the cancel request reaches the allocator. In this case, the EAM will get more executors than is needed, but that is fine because they will eventually be removed anyway. It's worth noting that this whole cancel feature is best-effort in the first place, so it's OK for the cancel to not always take effect, but it's not OK for the cancel to cause the add to fail.

sryza · 2015-04-21T21:13:57Z

Thanks Andrew, I think you've nailed the part of the problem that I was missing: when executors die or are killed, executorIds.size decreases. If numPendingExecutors is negative at this point, this can lead targetNumExecutors to go below 0.

My opinion is that it's best not to think in terms of adding and canceling requests, but to think in terms of expressing a preference about the total number of executors. I believe that it's much easier to reason about invariants around this number than it is to reason about sequences of add, cancel, and remove events. Also, canceling isn't entirely accurate because the YarnAllocator doesn't ignore requests for fewer total executors than the current number allocated. It registers the new number as its target number of executors, which affects the behavior when an executor dies or is killed.

What we ultimately care about is that targetNumExecutors should never go below 0. We only care about numExecutorsPending going below 0 to the extent that it causes the former to occur.

I think we can fully specify the behavior of targetNumExecutors in a reasonable way:

targetNumExecutors should reflect the number of executors YarnAllocator should be willing to submit requests for. I.e. if all our executors were to suddenly die, how many requests would YarnAllocator submit to YARN.
targetNumExecutors can change in two ways:

When the maximum number of executors needed (based on current running + pending tasks) is below the current target, we decrease the current target to equal it.
When the maximum number of executors needed is above the current target, we increase the current target according to the exponential add policy.

With this view, we can separate killing executors from the targetNumExecutors calculation entirely. When targetNumExecutors is less than executorIds.size for a given period of time, we issue kill requests. These kill requests do not affect targetNumExecutors, neither while the kill is pending nor after the kill has gone through.

I'm pretty convinced that this approach is the most straightforward and easiest to reason about. It requires slightly modifying the logic for removing executors, but I think in a way that's in line with its original goals. We also get rid of numExecutorsPending and track targetNumExecutors directly. If this seems reasonable to you Andrew, I'll put up a patch that makes the required changes.

sryza · 2015-04-26T06:44:21Z

#5704 demonstrates what I outlined in my above comment, and should supersede this PR.

AmplabJenkins · 2015-04-27T18:18:40Z

Can one of the admins verify this patch?

andrewor14 · 2015-05-02T01:34:18Z

@piaozhexiu we ended up merging #5704 instead. Would you mind closing this one?

piaozhexiu · 2015-05-02T03:10:48Z

Thank you guys for fixing it. I am closing it now.

Make sure numExecutorsPending to be non-negative

42b5b0f

sryza mentioned this pull request Apr 24, 2015

[SPARK-6891] Fix the bug that ExecutorAllocationManager will request negative number executors #5676

Closed

This was referenced Apr 28, 2015

SPARK-6954. [YARN] ExecutorAllocationManager can end up requesting a negative n... #5704

Closed

[SPARK-7007][core] Add a metric source for ExecutorAllocationManager #5589

Closed

piaozhexiu closed this May 2, 2015

[SPARK-6954] [YARN] Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative #5536

[SPARK-6954] [YARN] Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative #5536

Uh oh!

Conversation

piaozhexiu commented Apr 16, 2015

Uh oh!

sryza commented Apr 16, 2015

Uh oh!

piaozhexiu commented Apr 16, 2015

Uh oh!

andrewor14 commented Apr 16, 2015

Uh oh!

sryza commented Apr 16, 2015

Uh oh!

SparkQA commented Apr 16, 2015

Uh oh!

jerryshao commented Apr 20, 2015

Uh oh!

andrewor14 commented Apr 20, 2015

Uh oh!

andrewor14 commented Apr 20, 2015

Uh oh!

sryza commented Apr 20, 2015

Uh oh!

andrewor14 commented Apr 21, 2015

Uh oh!

sryza commented Apr 21, 2015

Uh oh!

sryza commented Apr 21, 2015

Uh oh!

piaozhexiu commented Apr 21, 2015

Uh oh!

piaozhexiu commented Apr 21, 2015

Uh oh!

srowen commented Apr 21, 2015

Uh oh!

sryza commented Apr 21, 2015

Uh oh!

andrewor14 commented Apr 21, 2015

Uh oh!

sryza commented Apr 21, 2015

Uh oh!

sryza commented Apr 26, 2015

Uh oh!

AmplabJenkins commented Apr 27, 2015

Uh oh!

andrewor14 commented May 2, 2015

Uh oh!

piaozhexiu commented May 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants