SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnA... #3765

sryza · 2014-12-22T21:30:55Z

...llocator

The goal of this PR is to simplify YarnAllocator as much as possible and get it up to the level of code quality we see in the rest of Spark.

In service of this, it does a few things:

Uses AMRMClient APIs for matching containers to requests.
Adds calls to AMRMClient.removeContainerRequest so that, when we use a container, we don't end up requesting it again.
Removes YarnAllocator's host->rack cache. YARN's RackResolver already does this caching, so this is redundant.
Adds tests for basic YarnAllocator functionality.
Breaks up the allocateResources method, which was previously nearly 300 lines.
A little bit of stylistic cleanup.
Fixes a bug that causes three times the requests to be filed when preferred host locations are given.

The patch is lossy. In particular, it loses the logic for trying to avoid containers bunching up on nodes. As I understand it, the logic that's gone is:

If, in a single response from the RM, we receive a set of containers on a node, and prefer some number of containers on that node greater than 0 but less than the number we received, give back the delta between what we preferred and what we received.

This seems like a weird way to avoid bunching E.g. it does nothing to avoid bunching when we don't request containers on particular nodes.

SparkQA · 2014-12-22T21:32:32Z

Test build #24715 has started for PR 3765 at commit 1becc37.

This patch merges cleanly.

SparkQA · 2014-12-22T22:42:59Z

Test build #24715 has finished for PR 3765 at commit 1becc37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-22T22:43:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24715/
Test FAILed.

SparkQA · 2014-12-24T23:22:33Z

Test build #24794 has started for PR 3765 at commit 85c9e5f.

This patch merges cleanly.

SparkQA · 2014-12-24T23:23:25Z

Test build #24794 has finished for PR 3765 at commit 85c9e5f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-24T23:23:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24794/
Test FAILed.

sryza · 2014-12-24T23:24:53Z

Updated patch fixes the broken test, adds more comments, and simplifies even further.

It also removes support for requesting containers based on locality, as this has been both inaccessible and internally broken for a while (the last time it worked was 0.9). I think it will be advantageous to have from a clean baseline and then revisit the approach when working on SPARK-4352.

SparkQA · 2014-12-25T00:37:33Z

Test build #24798 has started for PR 3765 at commit 7980f3f.

This patch merges cleanly.

SparkQA · 2014-12-25T02:01:37Z

Test build #24798 has finished for PR 3765 at commit 7980f3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-25T02:01:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24798/
Test PASSed.

SparkQA · 2014-12-28T03:27:32Z

Test build #24847 has started for PR 3765 at commit 86896b0.

This patch merges cleanly.

SparkQA · 2014-12-28T04:53:28Z

Test build #24847 has finished for PR 3765 at commit 86896b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-28T04:53:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24847/
Test PASSed.

tgravescs · 2014-12-30T20:17:39Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

Is this value changed in the tests at all or is it for all tests we don't really want to launch containers? If the latter you could just use Utils.isTesting.

The value differs between tests. E.g. in YarnClusterSuite we do want to launch containers, but in YarnAllocatorSuite, we don't.

sryza · 2015-01-14T08:14:14Z

@tgravescs are you able to take a look at this?

lianhuiwang · 2015-01-14T12:37:28Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

i think there maybe is maxExecutors += requestedTotal. because maxExecutors is sum of current executors, including running and pending executors.

tgravescs · 2015-01-20T20:58:39Z

@sryza did you run this through a bunch of manual tests as well? Try cases like container dies/killed before all of the first ones allocated.

tgravescs · 2015-01-20T21:10:17Z

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala

this isn't used anymore - you can remove it

…rnAllocator

SparkQA · 2015-01-20T22:42:34Z

Test build #25850 has started for PR 3765 at commit 74f56dd.

This patch merges cleanly.

SparkQA · 2015-01-20T23:53:33Z

Test build #25850 has finished for PR 3765 at commit 74f56dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-20T23:53:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25850/
Test PASSed.

sryza · 2015-01-21T00:26:32Z

@tgravescs, uploaded a new patch that addresses your review comments. I just ran a bunch of manual tests on a 6-node, including

request more resources than the cluster has available
kill executors while a job is running
kill executors before we've received the full set of initial executors

I noticed that YARN's RackResolver, which we now rely on directly, is really noisy, so I added a line in log4j-defaults.properties to muffle those logs.

SparkQA · 2015-01-21T00:27:43Z

Test build #25857 has started for PR 3765 at commit 32a5942.

This patch merges cleanly.

SparkQA · 2015-01-21T01:36:11Z

Test build #25857 has finished for PR 3765 at commit 32a5942.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-21T01:36:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25857/
Test PASSed.

tgravescs · 2015-01-21T16:31:39Z

looks good. Thanks @sryza

…rnA... ...llocator The goal of this PR is to simplify YarnAllocator as much as possible and get it up to the level of code quality we see in the rest of Spark. In service of this, it does a few things: * Uses AMRMClient APIs for matching containers to requests. * Adds calls to AMRMClient.removeContainerRequest so that, when we use a container, we don't end up requesting it again. * Removes YarnAllocator's host->rack cache. YARN's RackResolver already does this caching, so this is redundant. * Adds tests for basic YarnAllocator functionality. * Breaks up the allocateResources method, which was previously nearly 300 lines. * A little bit of stylistic cleanup. * Fixes a bug that causes three times the requests to be filed when preferred host locations are given. The patch is lossy. In particular, it loses the logic for trying to avoid containers bunching up on nodes. As I understand it, the logic that's gone is: * If, in a single response from the RM, we receive a set of containers on a node, and prefer some number of containers on that node greater than 0 but less than the number we received, give back the delta between what we preferred and what we received. This seems like a weird way to avoid bunching E.g. it does nothing to avoid bunching when we don't request containers on particular nodes. Author: Sandy Ryza <[email protected]> Closes apache#3765 from sryza/sandy-spark-1714 and squashes the following commits: 32a5942 [Sandy Ryza] Muffle RackResolver logs 74f56dd [Sandy Ryza] Fix a couple comments and simplify requestTotalExecutors 60ea4bd [Sandy Ryza] Fix scalastyle ca35b53 [Sandy Ryza] Simplify further e9cf8a6 [Sandy Ryza] Fix YarnClusterSuite 257acf3 [Sandy Ryza] Remove locality stuff and more cleanup 59a3c5e [Sandy Ryza] Take out rack stuff 5f72fd5 [Sandy Ryza] Further documentation and cleanup 89edd68 [Sandy Ryza] SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnAllocator

sryza mentioned this pull request Dec 22, 2014

[SPARK-2687] [yarn]amClient should remove ContainerRequest #3245

Closed

tgravescs reviewed Dec 30, 2014
View reviewed changes

lianhuiwang reviewed Jan 14, 2015
View reviewed changes

tgravescs reviewed Jan 20, 2015
View reviewed changes

yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala Outdated

Copy link

Contributor

tgravescs Jan 20, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't used anymore - you can remove it

sryza added 8 commits January 20, 2015 14:13

SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in Ya…

89edd68

…rnAllocator

Further documentation and cleanup

5f72fd5

Take out rack stuff

59a3c5e

Remove locality stuff and more cleanup

257acf3

Fix YarnClusterSuite

e9cf8a6

Simplify further

ca35b53

Fix scalastyle

60ea4bd

Fix a couple comments and simplify requestTotalExecutors

74f56dd

sryza force-pushed the sandy-spark-1714 branch from 86896b0 to 74f56dd Compare January 20, 2015 22:39

Muffle RackResolver logs

32a5942

asfgit closed this in 2eeada3 Jan 21, 2015

SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnA... #3765

SPARK-1714. Take advantage of AMRMClient APIs to simplify logic in YarnA... #3765

Uh oh!

Conversation

sryza commented Dec 22, 2014

Uh oh!

SparkQA commented Dec 22, 2014

Uh oh!

SparkQA commented Dec 22, 2014

Uh oh!

AmplabJenkins commented Dec 22, 2014

Uh oh!

SparkQA commented Dec 24, 2014

Uh oh!

SparkQA commented Dec 24, 2014

Uh oh!

AmplabJenkins commented Dec 24, 2014

Uh oh!

sryza commented Dec 24, 2014

Uh oh!

SparkQA commented Dec 25, 2014

Uh oh!

SparkQA commented Dec 25, 2014

Uh oh!

AmplabJenkins commented Dec 25, 2014

Uh oh!

SparkQA commented Dec 28, 2014

Uh oh!

SparkQA commented Dec 28, 2014

Uh oh!

AmplabJenkins commented Dec 28, 2014

Uh oh!

tgravescs Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

sryza Dec 30, 2014

Choose a reason for hiding this comment

Uh oh!

sryza commented Jan 14, 2015

Uh oh!

lianhuiwang Jan 14, 2015

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Jan 20, 2015

Uh oh!

tgravescs Jan 20, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 20, 2015

Uh oh!

SparkQA commented Jan 20, 2015

Uh oh!

AmplabJenkins commented Jan 20, 2015

Uh oh!

sryza commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

SparkQA commented Jan 21, 2015

Uh oh!

AmplabJenkins commented Jan 21, 2015

Uh oh!

tgravescs commented Jan 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants