-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused ... #3393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ed offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
|
Test build #23696 has started for PR 3393 at commit
|
|
At launching from spark-submit with mesos master, for the first time, resourceOffers are called, there's no task at that time because taskScheduler didn't submit a job. Thus all offers are declined because no tasks. In branch-1.1, d.launchTasks with empty tasks declined implicitly, but in current master, If offers don't have any tasks, d.launchTasks are not called. Thus unused offers from acceptedOffers are not declined. I fix that situation and edited test case |
|
Good catch, I think I didn't completely understand how TaskSchedulerImpl are using the offers and forgot not all acceptable offers are eventually used. Your PR LGTM, +1 |
|
Test build #23696 has finished for PR 3393 at commit
|
|
Test PASSed. |
|
@tnachen +1, Thanks. |
|
@pwendell Please merge this PR. :-) |
|
I went ahead and reviewed this overall function (@andrewor14 merged some recent changes which @tnachen authored) and there seem to be multiple issues. Can you guys comment on the following?
As for this fix specifically. Right now it uses the absence of a node in I'm not an expert in this area of the code but a cursory glance at this functionality reveals some potential issues. |
|
Hi @pwendell, 1. Is a great suggestion there isn't enough comments for sure, I can add more in another pr. For 2 I dont have enough context to infer what's the best choice of cores per task so I believe I just refactor the existing behavior. I do like to see why these values are chosen since this is more spark specific than mesos, I'll try to change these numbers and see if I can see any impact running spark perf. |
|
For 3, you are right if the offer is not used it is not acked. |
|
Basically, I thought minimum changes of code is a good way to fix bug, and refactoring is another issue. @pwendell I agree that 1. As @tnachen mentioned above, he thought all acceptedOffers are eventually used. If I can refactor this code, It would be changed.
I can fix and refactor that things and more by understanding that Impl deeply. Check my misunderstanding, please. |
|
Hi @jongyoul I think let's work on the cpu value in another PR. Can you also add a test where the offer's slave id is already added but the offer wasn't used? Thanks! |
|
Thanks @tnachen and @jongyoul for answering questions. It seems like some of this behavior is just porting code that was in the old version of the Mesos binding before @tnachen refactored it. So we can hold off on clean-ups and do it separately. We should fix the specific issue at hand though for 1.2. Looking back, I'm curious how this works in Spark 1.1. From what I can tell we never call Where |
|
@pwendell Underneath the scheduler driver, declineOffer is actually just calling launchTasks with the offerId and an empty task vector as well, so the behavior is identical. I changed the api to call declineOffer since it's more semantically correct, and I believe more future-proof as the scheduler driver can change. |
|
Yes - let's just fix the bug with this patch and we can punt improving this until later. The bug with this patch is that if there is an offer that is not used for a node that is already in This assumes that we don't get multiple offers for a given host. However, I'm pretty sure the old code assumed that as well. |
|
@jongyoul if you could fix that bug and then add a unit test for that case, that would be great. |
|
Just to be totally clear, here is the case:
|
|
Yes you will only get one offer per slave in the same resourceoffers call. This might change once dynamic reservations are intoduced but shouldn't affect Spark unless spark uses this feature |
|
@tnachen do you agree that the issue I identified is a potential problem with the current proposed fix? Just want to make sure I'm not misunderstanding something. |
|
@pwendell yes it is a problem, we'll need to fix this before we can merge this. |
|
@tnachen. Not it's blocking me cutting a 1.2 RC though. I'm working on a small extension to this with new unit tests. I can post it in a few minutes. |
|
@pwendell sounds good, thx for identifying and fixing it! I'll take a look when you post it. |
|
Thanks @tnachen a review would be really appreciated |
|
@tnachen I can fix this bug in a few days. |
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <[email protected]> Author: Jongyoul Lee <[email protected]> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <[email protected]> Author: Jongyoul Lee <[email protected]> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <[email protected]> Author: Jongyoul Lee <[email protected]> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27) Signed-off-by: Patrick Wendell <[email protected]>
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <[email protected]> Author: Jongyoul Lee <[email protected]> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27) Signed-off-by: Patrick Wendell <[email protected]>
|
@jongyoul can you close this issue now? I pulled in your commits already. |
…ed offers from acceptedOffers - Fix a case that unused node cannot be declined when slaveIdsWithExecutors has already that node.
|
@pwendell Oh, I'm late a little bit. I patched that code to similar you. I'll close this issue. |
|
Test build #23814 has started for PR 3393 at commit
|
|
@pwendell SparkQA trigger this issue for testing. Please check it and close this PR again. I did check your PR. |
|
Test build #23815 has started for PR 3393 at commit
|
|
You should close this PR. I already merged my PR and gave you author credit: |
|
Test build #23814 has finished for PR 3393 at commit
|
|
Test PASSed. |
|
@pwendell Yes, I reopened beacuse jenkins triggered. I'll close again. |
|
Test build #23815 has finished for PR 3393 at commit
|
|
Test PASSed. |
...offers from acceptedOffers