Skip to content

Conversation

@jongyoul
Copy link
Member

...offers from acceptedOffers

  • Added code for declining unused offers among acceptedOffers
  • Edited testCase for checking declining unused offers

…ed offers from acceptedOffers

- Added code for declining unused offers among acceptedOffers
- Edited testCase for checking declining unused offers
@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23696 has started for PR 3393 at commit f20f1b3.

  • This patch merges cleanly.

@jongyoul
Copy link
Member Author

At launching from spark-submit with mesos master, for the first time, resourceOffers are called, there's no task at that time because taskScheduler didn't submit a job. Thus all offers are declined because no tasks. In branch-1.1, d.launchTasks with empty tasks declined implicitly, but in current master, If offers don't have any tasks, d.launchTasks are not called. Thus unused offers from acceptedOffers are not declined. I fix that situation and edited test case

@tnachen
Copy link
Contributor

tnachen commented Nov 21, 2014

Good catch, I think I didn't completely understand how TaskSchedulerImpl are using the offers and forgot not all acceptable offers are eventually used. Your PR LGTM, +1

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23696 has finished for PR 3393 at commit f20f1b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23696/
Test PASSed.

@jongyoul
Copy link
Member Author

@tnachen +1, Thanks.

@jongyoul
Copy link
Member Author

@pwendell Please merge this PR. :-)

@pwendell
Copy link
Contributor

I went ahead and reviewed this overall function (@andrewor14 merged some recent changes which @tnachen authored) and there seem to be multiple issues. Can you guys comment on the following?

  1. The variable named acceptedOffers - why is it named this way if it doesn't indicate that the offers were accepted. Should it be acceptableOffers? Or usableOffers? When defining that variable it would be nice to have a comment that explains what is going on. Basically, you are excluding ahead of time nodes which we know ahead of time don't have enough memory to launch an executor or enough cores to launch even a single task. That should be in a comment.
  2. Why does mesos need 2 * scheduler.CPUS_PER_TASK cores for a node to be considered and not 1 + scheduler.CPUS_PER_TASK? The executor needs a single core, it says so right in the comments.
  3. Likewise when it subtracts a CPU for the executor, why does it subtract scheduler.CPUS_PER_TASK? Shouldn't it just subtract one?

As for this fix specifically. Right now it uses the absence of a node in slaveIdsWithExecutors to infer that the offer was not accepted. What happens if an offer is not used that is adding cores to an existing executor. Won't that offer be lost here and never replied to? It seems like to do this in a correct way we need to associate the returned task descriptions with resource offers.

I'm not an expert in this area of the code but a cursory glance at this functionality reveals some potential issues.

@tnachen
Copy link
Contributor

tnachen commented Nov 22, 2014

Hi @pwendell, 1. Is a great suggestion there isn't enough comments for sure, I can add more in another pr.

For 2 I dont have enough context to infer what's the best choice of cores per task so I believe I just refactor the existing behavior. I do like to see why these values are chosen since this is more spark specific than mesos, I'll try to change these numbers and see if I can see any impact running spark perf.

@tnachen
Copy link
Contributor

tnachen commented Nov 22, 2014

For 3, you are right if the offer is not used it is not acked.
My assumption was that all offers that we accepted based on the conditions including the ones expanding existing executors are used.
If that is not correct than we need to do what you suggested to have a local map to see what offers are used by the task scheduler impl.
I need to dig a bit deeper into the spark scheduler more.
@jongyoul i think if @pwendell condition can be true your fix needs to account that as well, sorry for the confusion

@jongyoul
Copy link
Member Author

Basically, I thought minimum changes of code is a good way to fix bug, and refactoring is another issue.

@pwendell I agree that 1. As @tnachen mentioned above, he thought all acceptedOffers are eventually used. If I can refactor this code, It would be changed. usableOffers and unusableOffers are a better idea. then, unusableOffers would be declined explicitly.

  1. 3., MesosSchedulerBackend's CPUS_PER_TASK and slaveIdsWithExecutors are ambiguous. At first, I think 1 + scheduler.CPUS_PER_TASK is a better way than 2 * scheduler.CPUS_PER_TASK. Assumption of needing one more cpu is acceptable way because we should consider Mesos' worker is being launched in case that that executor is not added by slaveIdsWithExecutors.

I can fix and refactor that things and more by understanding that Impl deeply. Check my misunderstanding, please.

@tnachen
Copy link
Contributor

tnachen commented Nov 23, 2014

Hi @jongyoul I think let's work on the cpu value in another PR.

Can you also add a test where the offer's slave id is already added but the offer wasn't used? Thanks!

@pwendell
Copy link
Contributor

Thanks @tnachen and @jongyoul for answering questions. It seems like some of this behavior is just porting code that was in the old version of the Mesos binding before @tnachen refactored it. So we can hold off on clean-ups and do it separately.

We should fix the specific issue at hand though for 1.2. Looking back, I'm curious how this works in Spark 1.1. From what I can tell we never call declineOffer in Spark 1.1. However, we will call something like this:

d.launchTasks(Collections.singleton(offers(i).getId), mesosTasks(i), filters)

Where mesosTasks(i) is an empty list. @tnachen for mesos is that equivalent to declining the offer? The relevant code is here:

https://github.com/apache/spark/blob/branch-1.1/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L263

@tnachen
Copy link
Contributor

tnachen commented Nov 23, 2014

@pwendell Underneath the scheduler driver, declineOffer is actually just calling launchTasks with the offerId and an empty task vector as well, so the behavior is identical. I changed the api to call declineOffer since it's more semantically correct, and I believe more future-proof as the scheduler driver can change.

@jongyoul
Copy link
Member Author

@pwendell Hi, @tnachen suggests that the issue of cpu divide with another issue. Is it better idea?

@tnachen
Copy link
Contributor

tnachen commented Nov 23, 2014

@jongyoul I believe @pwendell is referring to the same issue that I was referring to, which is to handle the class that @pwendell has brought up.

@pwendell
Copy link
Contributor

Yes - let's just fix the bug with this patch and we can punt improving this until later.

The bug with this patch is that if there is an offer that is not used for a node that is already in slavesWithExectors, then this patch will not decline it. One simple way to fix this is to keep track of nodes that were assigned an offer for that round, then decline offers that were not included in that set.

This assumes that we don't get multiple offers for a given host. However, I'm pretty sure the old code assumed that as well.

@pwendell
Copy link
Contributor

@jongyoul if you could fix that bug and then add a unit test for that case, that would be great.

@pwendell
Copy link
Contributor

Just to be totally clear, here is the case:

  1. Spark already has an executor on node A. I.e. A is in slavesWithExecutors.
  2. An additional core is offered on node A, but that offer is not used.
  3. The offer is never declined.

@tnachen
Copy link
Contributor

tnachen commented Nov 24, 2014

Yes you will only get one offer per slave in the same resourceoffers call. This might change once dynamic reservations are intoduced but shouldn't affect Spark unless spark uses this feature

@pwendell
Copy link
Contributor

@tnachen do you agree that the issue I identified is a potential problem with the current proposed fix? Just want to make sure I'm not misunderstanding something.

@tnachen
Copy link
Contributor

tnachen commented Nov 25, 2014

@pwendell yes it is a problem, we'll need to fix this before we can merge this.

@tnachen
Copy link
Contributor

tnachen commented Nov 25, 2014

@pwendell Is this blocking 1.1.1? If @jongyoul can't get to it soon eonugh I can get a fix too.

@pwendell
Copy link
Contributor

@tnachen. Not it's blocking me cutting a 1.2 RC though. I'm working on a small extension to this with new unit tests. I can post it in a few minutes.

@tnachen
Copy link
Contributor

tnachen commented Nov 25, 2014

@pwendell sounds good, thx for identifying and fixing it! I'll take a look when you post it.

@pwendell
Copy link
Contributor

Thanks @tnachen a review would be really appreciated

@jongyoul
Copy link
Member Author

@tnachen I can fix this bug in a few days.

asfgit pushed a commit that referenced this pull request Nov 25, 2014
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <[email protected]>
Author: Jongyoul Lee <[email protected]>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
asfgit pushed a commit that referenced this pull request Nov 25, 2014
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <[email protected]>
Author: Jongyoul Lee <[email protected]>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
asfgit pushed a commit that referenced this pull request Nov 25, 2014
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <[email protected]>
Author: Jongyoul Lee <[email protected]>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers

(cherry picked from commit b043c27)
Signed-off-by: Patrick Wendell <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 25, 2014
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <[email protected]>
Author: Jongyoul Lee <[email protected]>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers

(cherry picked from commit b043c27)
Signed-off-by: Patrick Wendell <[email protected]>
@pwendell
Copy link
Contributor

@jongyoul can you close this issue now? I pulled in your commits already.

…ed offers from acceptedOffers

- Fix a case that unused node cannot be declined when slaveIdsWithExecutors has already that node.
@jongyoul
Copy link
Member Author

@pwendell Oh, I'm late a little bit. I patched that code to similar you. I'll close this issue.

@jongyoul jongyoul closed this Nov 25, 2014
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23814 has started for PR 3393 at commit 63855bf.

  • This patch does not merge cleanly.

@jongyoul
Copy link
Member Author

@pwendell SparkQA trigger this issue for testing. Please check it and close this PR again. I did check your PR.

@jongyoul jongyoul reopened this Nov 25, 2014
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23815 has started for PR 3393 at commit 63855bf.

  • This patch does not merge cleanly.

@pwendell
Copy link
Contributor

You should close this PR. I already merged my PR and gave you author credit:

f0afb62

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23814 has finished for PR 3393 at commit 63855bf.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23814/
Test PASSed.

@jongyoul
Copy link
Member Author

@pwendell Yes, I reopened beacuse jenkins triggered. I'll close again.

@jongyoul jongyoul closed this Nov 25, 2014
@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23815 has finished for PR 3393 at commit 63855bf.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23815/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants