Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Apr 26, 2015

...umber of executors

@sryza sryza changed the title SPARK-6954. ExecutorAllocationManager can end up requesting a negative n... SPARK-6954. [YARN] ExecutorAllocationManager can end up requesting a negative n... Apr 26, 2015
@SparkQA
Copy link

SparkQA commented Apr 26, 2015

Test build #30956 timed out for PR 5704 at commit 9eea5fe after a configured wait of 150m.

@srowen
Copy link
Member

srowen commented Apr 26, 2015

Jenkins, retest this please

@piaozhexiu
Copy link

@sryza , thank you for the patch. I tried it with my queries, and it works very well. I look forward to getting this issue fixed in 1.3 branch.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #713 has finished for PR 5704 at commit 9eea5fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@srowen
Copy link
Member

srowen commented Apr 27, 2015

CC @andrewor14 just to make sure he can also have a look

@vanzin
Copy link
Contributor

vanzin commented Apr 27, 2015

LGTM. Much easier to reason about.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30971 has finished for PR 5704 at commit 9eea5fe.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

Thanks for submitting the patch @sryza. I haven't had the time to respond to your comment on #5536 until recently, so let's move the discussion here. Echoing my concern about this approach (and the existing approach) from there:


Cancel in my mind should only deal with pending requests. This behavior is reflected on the YarnAllocator side, where a cancel request from the EAM does not remove existing executors, but only cancels pending executor requests from the RM. Thus, I would think that cancel should only be a best-effort feature and should not affect the add behavior beyond the pending requests.

If you look at the example I outlined in detail on #5536, the old EAM's cancel behavior does affect the add behavior beyond pending requests. In particular, in step (4) our new target is 6 (which is less than what we currently have, 10) even though we intended to add 1 executor, so really the new target should be 11. The reason is because cancel and add are so intertwined that it's impossible to think about them independently.

This patch also maintains this behavior. If an add immediately follows a cancel, then the add may not actually add any executors because the cancel exerts its influence beyond the pending executors. For instance, if after cancel our new target is 0, then the first few adds may use a target that is actually below the number of executors we currently have, which is essentially a no-op.

Further note that this is not that uncommon of a case. Since we basically unconditionally attempt to cancel every interval, we can have a stage that just finished running, in which case no executors will be needed and we will request a target of 0. Then when a new stage comes in we will have to ramp up from the beginning again.

Summary. My main point is that cancel on the EAM side should only deal with pending requests, because it only deals with pending requests in the allocator side as well. Otherwise, its influence will be felt disproportionately when we do try to add again.

@andrewor14
Copy link
Contributor

By the way, I do find tracking the target number easier to reason about than tracking the pending number. Since this behavior already existed in 1.3 (although was broken there) I am OK merging this patch for 1.4. I just thought I should point out my reservations so we can address them in the future after the upcoming release.

@andrewor14
Copy link
Contributor

In other words, there seems to be two separate issues here:

(1) SPARK-6954 - we throw an exception for requesting negative target sometimes, and
(2) (No JIRA yet) - cancel is too closely intertwined with add

I believe there are still disagreements out there about whether (2) is actually an issue. This patch fixes (1) and cleans up some code. After the 1.4 release I would like to at least continue the discussion on (2), if not address it in a patch.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31149 has finished for PR 5704 at commit 9eea5fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@sryza
Copy link
Contributor Author

sryza commented Apr 28, 2015

I sync'd up with @andrewor14 offline about this. To summarize my understanding of his concern, the current patch has non-optimal behavior in the following (common) situation:

  1. We run a job and ramp up to 100 executors.
  2. Then there's a quiet period, but our executors stick around.
  3. We then run another job, which could saturate 200 executors.

Instead of immediately increasing our target number of executors in response to the load, we end up needing to wait the same amount of time that it took to ramp up to 100 before requesting any additional executors.

We settled on the following as a solution:
In addExecutors, before increasing numExecutorsTarget with numExecutorsToAdd, if numExecutorsTarget is less than executorIds.size, set numExecutorsTarget to executorIds.size.

This basically means that we never spend any time "ramping up" to where we already are.

@sryza
Copy link
Contributor Author

sryza commented Apr 28, 2015

I'll update the current patch to include this.

@sryza sryza force-pushed the sandy-spark-6954 branch from 9eea5fe to b7890fb Compare April 29, 2015 20:35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: insert a space before 'and'

@SparkQA
Copy link

SparkQA commented Apr 29, 2015

Test build #31322 has finished for PR 5704 at commit b7890fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@srowen
Copy link
Member

srowen commented Apr 30, 2015

@sryza @andrewor14 do you think this is in a pretty good state now (excepting the minor string change above)? worth including in 1.4? I'm cognizant that the merge deadline is soon.

@andrewor14
Copy link
Contributor

Yes, I will take a look at this later today and hopefully merge it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with the number to add (executors are countable!)

@andrewor14
Copy link
Contributor

LGTM I'm merging this into master after fixing the small comments myself. After fixing the ramping up thing I believe the behavior here is ultimately the same as #5536, but this one refactors the manager in such a way that makes it easier to follow. Thanks for adding the inline documentation where appropriate. @sryza

@andrewor14
Copy link
Contributor

@sryza would you mind opening a branch against 1.3 as well?

@asfgit asfgit closed this in 099327d May 2, 2015
sryza added a commit to sryza/spark that referenced this pull request May 2, 2015
… negative n...

...umber of executors

Author: Sandy Ryza <[email protected]>

Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
asfgit pushed a commit that referenced this pull request May 2, 2015
… ne...

...gative n...

...umber of executors

Author: Sandy Ryza <sandycloudera.com>

Closes #5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors

Author: Sandy Ryza <[email protected]>

Closes #5856 from sryza/sandy-backport-6954 and squashes the following commits:

1cb517a [Sandy Ryza] [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n...
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
… negative n...

...umber of executors

Author: Sandy Ryza <[email protected]>

Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
… negative n...

...umber of executors

Author: Sandy Ryza <[email protected]>

Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… negative n...

...umber of executors

Author: Sandy Ryza <[email protected]>

Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
mingyukim pushed a commit to palantir/spark that referenced this pull request Jul 10, 2015
… ne...

...gative n...

...umber of executors

Author: Sandy Ryza <sandycloudera.com>

Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors

Author: Sandy Ryza <[email protected]>

Closes apache#5856 from sryza/sandy-backport-6954 and squashes the following commits:

1cb517a [Sandy Ryza] [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants