-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-6954] [YARN] Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative #5536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It would be good to come up with a test that can reproduce the issue. I believe that it is actually supposed to be acceptable for numExecutorsPending to sit below 0, in situations where we have more executors than we need. My suspicion is that simply capping numExecutorsPending at 0 may turn out to be a duct tape solution, and that we're slowly leaking downwards. My further suspicion is that the right solution involves removing executorsPendingToRemove from the targetNumExecutors calculation - when we have too many executors we're probably double counting. This is sort of handwavy, but I'm happy to look deeper if you'd like. |
|
Thank you for your thoughts Sandy! Why don't I try to write a test first? I'll ask for help if I can't. |
|
ok to test |
|
Awesome, sounds good Cheolsoo. |
|
Test build #30428 has finished for PR 5536 at commit
|
|
I've also detected this issue by running spark-shell with But seems this negative value does no harm to the whole system. |
|
I believe this is the right fix, but it's actually fairly non-trivial to determine the correctness of this change. When we request a new total that is less than the current total, A simple example: We have 10 executors with nothing pending to be added or removed. Then suddenly we finish running all tasks, such that the number of executors needed is now 0. Then, when we try to cancel, we will request a total of 0 executors. If we follow the math in which doesn't make sense because we can't have a negative number of pending executors. Why is this a correct fix then? Well, the only place where we ever request a total less than the current one is when we cancel outstanding requests, and when we cancel requests the |
|
By the way, I plan to rewrite a large part of this logic in a way that is more intuitive, after which any test you write here may not apply anymore so it might make sense to hold that off for now. This code has grown to be quite unmanageable such that even I, the original author of this feature, need to spend a significant chunk of time trying to follow the logic and understand the root cause of the issue. For this reason I'm going to merge this as is into master and 1.3. Thanks @piaozhexiu @sryza @jerryshao. |
|
Mind if I take a look before you merge? I think I'm the last one to have touched this code.
|
|
Ok, let me know if you have any other comments. |
|
I've thought about this more and think I'm able to articulate why I think the present fix is incorrect. Aside from the weirdness of the idea that there can be a negative number of pending executors, I think the behavior that you (Andrew) describe in your example is not a problem. numPendingExecutors ends up as -10. targetNumExecutors is then calculated as numPendingExecutors + executorIds.size - executorsPendingRemoval.size, which comes out to 0. This is the correct number that we'd like to pass on to YarnAllocator, which tells YarnAllocator that if executors die naturally or are preempted, it shouldn't try to replace them because there's no need. If we simply force numExecutorsPending to be 0 in that situation, targetNumExecutors would be calculated as 10, which is incorrect. The problem is that some sort of double counting is going on. I think executors that we don't want can be counted both in the form of negative numPendingExecutors and as part of executorsPendingRemoval. |
|
More broadly: I agree with you that the current code is really really difficult to reason about. When I added the ability to cancel executor requests, I spent a while thinking about how we could rewrite everything to make the logic more intuitive and wasn't able to come up with anything great. I definitely wouldn't rule out the possibility that you can come up with something better. But given that, if it's possible, it's likely not going to be an easy or quick task, I don't think we should allow the prospect of a rewrite to hold us back from writing tests. As long as we're stuck with the current code, they're an important way to help us manage its complexity. |
|
Hi @sryza , @andrewor14 , I attached two diagrams that show how the following 4 variables change over time w/ and w/o my fix:
https://issues.apache.org/jira/secure/attachment/12726785/with_fix.png I collected them at every call to |
@sryza , I cannot prove whether my fix is right or wrong, and I'll leave that to you guys. But at least, I don't see any side effects that you're worried about w/ my fix. It seems to work nicely. |
|
Trying to synthesize the several comments here: obviously we want to avoid the exception reported in the JIRA. Sandy you say that it's OK for pending executors to be -10, but then does that mean that whatever is making the request with value "-10" now needs logic to never request anything < 0 instead? That, is are you suggesting that the capping should not actually change the value from -10? Does the value stay at -10 indefinitely then? that's what the new charts on the JIRA suggest. I suppose there is no allocation to "fulfill" that request for -10 pending executors and decrease it by -10. Is the issue that the shut down of the executors should update the pending count too? -10 pending executors means "waiting to see 10 executors stop". |
|
Thanks for putting together those graphs Cheolsoo. I think they're really helpful for thinking about this. The side effect that I'm worried about is targetNumExecutors being higher than it needs to be. To know what it "needs to be", we would also need lines on the graph showing the number of running + pending tasks (not that I think you should waste time on that - I think my deductions a couple comments above should be proof that can happen). The without_fix graph confirms my suspicion that the trouble starts arising when we start issuing kill requests. Before that, targetNumExecutors never dives below 0. As soon as there are executors pending removal, it does, which triggers the exception. |
|
Let's take a step back and summarize the problem and the proposed solution to make sure that we're all on the same page. This issue is actually fairly complicated so I think it's worth it to explore this in some detail. I would be happy to hear your thoughts about this. @srowen @piaozhexiu @sryza Symptom. There are instances where Problem illustrated with an example. (0) Let's say we start with 10 executors with nothing pending to be added or removed (5) Thus, we only request 6 executors when we're trying to add, even though we already have 10. Now imagine between steps (3) and (4), a number of executors fail (or are removed, the distinction is not important). For instance, if all of the existing executors fail, then we will end up with a negative target total (-4 in this case, computed using the same equation above), resulting in the exception. This is consistent with the description in the JIRA that this only happens if we remove executors too quickly. Underlying cause. After a cancel, the EAM unconditionally adjusts the new target downwards whether or not there are pending requests to be canceled, and then computes In other words, when we cancel, we should only adjust the new target downwards by at most Proposed solution. This patch ensures that This behavior also matches the state on the allocator side in most cases. When it receives a cancel request, the allocator will cancel at most the number of pending requests that are outstanding (clearly this cannot be negative). Note that even this solution is not perfect. There exists a race condition in which a few of the pending requests we are trying to cancel may already be granted by the time the cancel request reaches the allocator. In this case, the EAM will get more executors than is needed, but that is fine because they will eventually be removed anyway. It's worth noting that this whole cancel feature is best-effort in the first place, so it's OK for the cancel to not always take effect, but it's not OK for the cancel to cause the add to fail. |
|
Thanks Andrew, I think you've nailed the part of the problem that I was missing: when executors die or are killed, executorIds.size decreases. If numPendingExecutors is negative at this point, this can lead targetNumExecutors to go below 0. My opinion is that it's best not to think in terms of adding and canceling requests, but to think in terms of expressing a preference about the total number of executors. I believe that it's much easier to reason about invariants around this number than it is to reason about sequences of add, cancel, and remove events. Also, canceling isn't entirely accurate because the YarnAllocator doesn't ignore requests for fewer total executors than the current number allocated. It registers the new number as its target number of executors, which affects the behavior when an executor dies or is killed. What we ultimately care about is that targetNumExecutors should never go below 0. We only care about numExecutorsPending going below 0 to the extent that it causes the former to occur. I think we can fully specify the behavior of targetNumExecutors in a reasonable way:
With this view, we can separate killing executors from the targetNumExecutors calculation entirely. When targetNumExecutors is less than executorIds.size for a given period of time, we issue kill requests. These kill requests do not affect targetNumExecutors, neither while the kill is pending nor after the kill has gone through. I'm pretty convinced that this approach is the most straightforward and easiest to reason about. It requires slightly modifying the logic for removing executors, but I think in a way that's in line with its original goals. We also get rid of numExecutorsPending and track targetNumExecutors directly. If this seems reasonable to you Andrew, I'll put up a patch that makes the required changes. |
|
#5704 demonstrates what I outlined in my above comment, and should supersede this PR. |
|
Can one of the admins verify this patch? |
|
@piaozhexiu we ended up merging #5704 instead. Would you mind closing this one? |
|
Thank you guys for fixing it. I am closing it now. |
Make sure
numExecutorsPendinginExecutorAllocationManagerbe non-negative.