Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Dec 22, 2014

...llocator

The goal of this PR is to simplify YarnAllocator as much as possible and get it up to the level of code quality we see in the rest of Spark.

In service of this, it does a few things:

  • Uses AMRMClient APIs for matching containers to requests.
  • Adds calls to AMRMClient.removeContainerRequest so that, when we use a container, we don't end up requesting it again.
  • Removes YarnAllocator's host->rack cache. YARN's RackResolver already does this caching, so this is redundant.
  • Adds tests for basic YarnAllocator functionality.
  • Breaks up the allocateResources method, which was previously nearly 300 lines.
  • A little bit of stylistic cleanup.
  • Fixes a bug that causes three times the requests to be filed when preferred host locations are given.

The patch is lossy. In particular, it loses the logic for trying to avoid containers bunching up on nodes. As I understand it, the logic that's gone is:

  • If, in a single response from the RM, we receive a set of containers on a node, and prefer some number of containers on that node greater than 0 but less than the number we received, give back the delta between what we preferred and what we received.

This seems like a weird way to avoid bunching E.g. it does nothing to avoid bunching when we don't request containers on particular nodes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24715 has started for PR 3765 at commit 1becc37.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #24715 has finished for PR 3765 at commit 1becc37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24715/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24794 has started for PR 3765 at commit 85c9e5f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24794 has finished for PR 3765 at commit 85c9e5f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24794/
Test FAILed.

@sryza
Copy link
Contributor Author

sryza commented Dec 24, 2014

Updated patch fixes the broken test, adds more comments, and simplifies even further.

It also removes support for requesting containers based on locality, as this has been both inaccessible and internally broken for a while (the last time it worked was 0.9). I think it will be advantageous to have from a clean baseline and then revisit the approach when working on SPARK-4352.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24798 has started for PR 3765 at commit 7980f3f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 25, 2014

Test build #24798 has finished for PR 3765 at commit 7980f3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24798/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 28, 2014

Test build #24847 has started for PR 3765 at commit 86896b0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 28, 2014

Test build #24847 has finished for PR 3765 at commit 86896b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24847/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this value changed in the tests at all or is it for all tests we don't really want to launch containers? If the latter you could just use Utils.isTesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value differs between tests. E.g. in YarnClusterSuite we do want to launch containers, but in YarnAllocatorSuite, we don't.

@sryza
Copy link
Contributor Author

sryza commented Jan 14, 2015

@tgravescs are you able to take a look at this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there maybe is maxExecutors += requestedTotal. because maxExecutors is sum of current executors, including running and pending executors.

@tgravescs
Copy link
Contributor

@sryza did you run this through a bunch of manual tests as well? Try cases like container dies/killed before all of the first ones allocated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't used anymore - you can remove it

@SparkQA
Copy link

SparkQA commented Jan 20, 2015

Test build #25850 has started for PR 3765 at commit 74f56dd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 20, 2015

Test build #25850 has finished for PR 3765 at commit 74f56dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25850/
Test PASSed.

@sryza
Copy link
Contributor Author

sryza commented Jan 21, 2015

@tgravescs, uploaded a new patch that addresses your review comments. I just ran a bunch of manual tests on a 6-node, including

  • request more resources than the cluster has available
  • kill executors while a job is running
  • kill executors before we've received the full set of initial executors

I noticed that YARN's RackResolver, which we now rely on directly, is really noisy, so I added a line in log4j-defaults.properties to muffle those logs.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25857 has started for PR 3765 at commit 32a5942.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 21, 2015

Test build #25857 has finished for PR 3765 at commit 32a5942.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25857/
Test PASSed.

@tgravescs
Copy link
Contributor

looks good. Thanks @sryza

@asfgit asfgit closed this in 2eeada3 Jan 21, 2015
bomeng pushed a commit to Huawei-Spark/spark that referenced this pull request Jan 22, 2015
…rnA...

...llocator

The goal of this PR is to simplify YarnAllocator as much as possible and get it up to the level of code quality we see in the rest of Spark.

In service of this, it does a few things:
* Uses AMRMClient APIs for matching containers to requests.
* Adds calls to AMRMClient.removeContainerRequest so that, when we use a container, we don't end up requesting it again.
* Removes YarnAllocator's host->rack cache. YARN's RackResolver already does this caching, so this is redundant.
* Adds tests for basic YarnAllocator functionality.
* Breaks up the allocateResources method, which was previously nearly 300 lines.
* A little bit of stylistic cleanup.
* Fixes a bug that causes three times the requests to be filed when preferred host locations are given.

The patch is lossy. In particular, it loses the logic for trying to avoid containers bunching up on nodes. As I understand it, the logic that's gone is:

* If, in a single response from the RM, we receive a set of containers on a node, and prefer some number of containers on that node greater than 0 but less than the number we received, give back the delta between what we preferred and what we received.

This seems like a weird way to avoid bunching E.g. it does nothing to avoid bunching when we don't request containers on particular nodes.

Author: Sandy Ryza <[email protected]>

Closes apache#3765 from sryza/sandy-spark-1714 and squashes the following commits:

32a5942 [Sandy Ryza] Muffle RackResolver logs
74f56dd [Sandy Ryza] Fix a couple comments and simplify requestTotalExecutors
60ea4bd [Sandy Ryza] Fix scalastyle
ca35b53 [Sandy Ryza] Simplify further
e9cf8a6 [Sandy Ryza] Fix YarnClusterSuite
257acf3 [Sandy Ryza] Remove locality stuff and more cleanup
59a3c5e [Sandy Ryza] Take out rack stuff
5f72fd5 [Sandy Ryza] Further documentation and cleanup
89edd68 [Sandy Ryza] SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnAllocator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants