-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-6954. [YARN] ExecutorAllocationManager can end up requesting a negative n... #5704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #30956 timed out for PR 5704 at commit |
|
Jenkins, retest this please |
|
@sryza , thank you for the patch. I tried it with my queries, and it works very well. I look forward to getting this issue fixed in 1.3 branch. |
|
Test build #713 has finished for PR 5704 at commit
|
|
CC @andrewor14 just to make sure he can also have a look |
|
LGTM. Much easier to reason about. |
|
Test build #30971 has finished for PR 5704 at commit
|
|
retest this please |
|
Thanks for submitting the patch @sryza. I haven't had the time to respond to your comment on #5536 until recently, so let's move the discussion here. Echoing my concern about this approach (and the existing approach) from there: Cancel in my mind should only deal with pending requests. This behavior is reflected on the If you look at the example I outlined in detail on #5536, the old EAM's cancel behavior does affect the add behavior beyond pending requests. In particular, in step (4) our new target is 6 (which is less than what we currently have, 10) even though we intended to add 1 executor, so really the new target should be 11. The reason is because cancel and add are so intertwined that it's impossible to think about them independently. This patch also maintains this behavior. If an add immediately follows a cancel, then the add may not actually add any executors because the cancel exerts its influence beyond the pending executors. For instance, if after cancel our new target is 0, then the first few adds may use a target that is actually below the number of executors we currently have, which is essentially a no-op. Further note that this is not that uncommon of a case. Since we basically unconditionally attempt to cancel every interval, we can have a stage that just finished running, in which case no executors will be needed and we will request a target of 0. Then when a new stage comes in we will have to ramp up from the beginning again. Summary. My main point is that cancel on the EAM side should only deal with pending requests, because it only deals with pending requests in the allocator side as well. Otherwise, its influence will be felt disproportionately when we do try to add again. |
|
By the way, I do find tracking the target number easier to reason about than tracking the pending number. Since this behavior already existed in 1.3 (although was broken there) I am OK merging this patch for 1.4. I just thought I should point out my reservations so we can address them in the future after the upcoming release. |
|
In other words, there seems to be two separate issues here: (1) SPARK-6954 - we throw an exception for requesting negative target sometimes, and I believe there are still disagreements out there about whether (2) is actually an issue. This patch fixes (1) and cleans up some code. After the 1.4 release I would like to at least continue the discussion on (2), if not address it in a patch. |
|
Test build #31149 has finished for PR 5704 at commit
|
|
I sync'd up with @andrewor14 offline about this. To summarize my understanding of his concern, the current patch has non-optimal behavior in the following (common) situation:
Instead of immediately increasing our target number of executors in response to the load, we end up needing to wait the same amount of time that it took to ramp up to 100 before requesting any additional executors. We settled on the following as a solution: This basically means that we never spend any time "ramping up" to where we already are. |
|
I'll update the current patch to include this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: insert a space before 'and'
|
Test build #31322 has finished for PR 5704 at commit
|
|
@sryza @andrewor14 do you think this is in a pretty good state now (excepting the minor string change above)? worth including in 1.4? I'm cognizant that the merge deadline is soon. |
|
Yes, I will take a look at this later today and hopefully merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: with the number to add (executors are countable!)
|
LGTM I'm merging this into master after fixing the small comments myself. After fixing the ramping up thing I believe the behavior here is ultimately the same as #5536, but this one refactors the manager in such a way that makes it easier to follow. Thanks for adding the inline documentation where appropriate. @sryza |
|
@sryza would you mind opening a branch against 1.3 as well? |
… negative n... ...umber of executors Author: Sandy Ryza <[email protected]> Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
… ne... ...gative n... ...umber of executors Author: Sandy Ryza <sandycloudera.com> Closes #5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors Author: Sandy Ryza <[email protected]> Closes #5856 from sryza/sandy-backport-6954 and squashes the following commits: 1cb517a [Sandy Ryza] [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n...
… negative n... ...umber of executors Author: Sandy Ryza <[email protected]> Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
… negative n... ...umber of executors Author: Sandy Ryza <[email protected]> Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
… negative n... ...umber of executors Author: Sandy Ryza <[email protected]> Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
… ne... ...gative n... ...umber of executors Author: Sandy Ryza <sandycloudera.com> Closes apache#5704 from sryza/sandy-spark-6954 and squashes the following commits: b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors 6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors Author: Sandy Ryza <[email protected]> Closes apache#5856 from sryza/sandy-backport-6954 and squashes the following commits: 1cb517a [Sandy Ryza] [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n...
...umber of executors