Bug 2103786: drain controller: don't skip the MCC pod drain #3212

yuqi-zhang · 2022-06-28T18:04:11Z

In 4.11 we moved the drain operation to a centralized controller.
This drain controller doesn't drain the MCC today.

In theory this means the pod can finish and then gracefully terminate.
In practice this is problematic since the pod never gets scheduled
off the node, meaning it thinks its still running (according to the API)
when the master node shuts down, and won't be reachable until the master
node reboots.

For some metal setups the reboot could take up to 30 minutes, this means
that we won't have a MCC for 30 minutes, which would be very problematic
as other pools waiting for drains/updates are stalled as well. We should
instead just drain the controller pod. The next one will immediately
pick up where this one left off, so there shouldn't be any conflicts.
The pod when drained will restart existing operations in ~10 seconds
from testing, which would be much faster.

yuqi-zhang · 2022-06-28T20:04:36Z

FWIW this does make the process much faster in conjunction with John's #3185, since we would no longer need to wait for the node to reboot to restart the controller. The question I guess is more along the lines of how we can make this cleaner.

An option would be to use our graceful shutdown path to "drain" ourselves? But if we already released the lease, I think we wouldn't be able to request the drain?

cgwalters · 2022-06-28T20:13:27Z

pkg/controller/drain/drain_controller.go

Isn't this just a very roundabout way to:

Release the lease

exit(0)

Hmm except if we do this aren't we losing out on the completion annotation below?

The next controller will write the completion annotation immediately. If this one did, the daemon may shut down the node too fast for the controller to drain itself, since the daemon doesn't otherwise "wait" for the controller. Although the controller now has context handling so... it might wait for this sync to exit?

Isn't this just a very roundabout way to:
Release the lease
exit(0)

I don't think so. The difference is, if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots. So the controller never switches nodes, and the upgrade process will be missing the controller until however long the node takes to reboot.

Which, now I think about it, is very bad for some slow-booting master nodes on e.g. metal, since the controller will be dead for a long time?

More concretely, the behaviour I've observed without this PR:

the master node this controller is running on is marked for update

the daemon tells the controller to drain, the controller does, the daemon then completes the update and reboots the node

the controller pod is listed as "running" still, but it is supposed to be running on that master node that just shut down. The pod doesn't get scheduled anywhere else. So now we wait for however long it takes for the master to reboot before there is a MCC running at all

With this PR:
1 and 2 are the same, but instead of completing the drain and then having the daemon do its job, the controller will drain itself to another node, and start immediately. The new controller picks up, marks the completion, and the daemon on the old node continues while the new controller chugs on and does other activities (such as drain workers)

if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots.

Won't kube reschedule the pod when it exits? And it should happen on another node because the current one isn't schedulable.

Although officially draining would avoid a potential race where if we just exit, but then reboot, kubelet may go down before the pod status is updated.

Hmm. I think the graceful node shutdown is really the fix for these races.

With John's changes in https://github.com/openshift/machine-config-operator/pull/3185/files#diff-98a2adaa91130abb8fb7321e0ed6de823657c91b48c10e67c141a12d0a0197feR108 we should already be exiting right? The pod never gets rescheduled. I don't know what the expected behaviour is though, I should check that

Side note: I think this may also work with context-ifying all the controllers and properly handling their shutdowns so we get rescheduled, but directly calling the drain is "easier" in a sense here

I realized that this is the equivalent of removing the filter to not drain the MCC pod, so let's see if this works

In 4.11 we moved the drain operation to a centralized controller. This drain controller doesn't drain the MCC today. In theory this means the pod can finish and then gracefully terminate. In practice this is problematic since the pod never gets scheduled off the node, meaning it thinks its still running (according to the API) when the master node shuts down, and won't be reachable until the master node reboots. For some metal setups the reboot could take up to 30 minutes, this means that we won't have a MCC for 30 minutes, which would be very problematic as other pools waiting for drains/updates are stalled as well. We should instead just drain the controller pod. The next one will immediately pick up where this one left off, so there shouldn't be any conflicts. The pod when drained will restart existing operations in ~10 seconds from testing, which would be much faster.

yuqi-zhang · 2022-07-04T22:05:30Z

Tested, rebased, updated commit message/description and opened a BZ for it.

I don't think this is necessarily the best path, but it should work

openshift-ci · 2022-07-04T22:05:56Z

@yuqi-zhang: An error was encountered querying GitHub for users with public email ([email protected]) for bug 2103786 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message.


non-200 OK status code: 403 Forbidden body: "{\n  \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n  \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-07-04T22:05:58Z

@yuqi-zhang: This pull request references Bugzilla bug 2103786, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.12.0) matches configured target release for branch (4.12.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @rioliu-rh

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheesesashimi · 2022-07-06T17:35:49Z

Overall, this looks good and I have no comments or suggestions.

However, I just want to be sure I understand what's happening: We were originally trying to be clever about not draining the MCC pod and instead relying on the reboot to cause the MCC pod to terminate. However, there is a lag between when the node comes back up, rejoins the cluster, and the MCC pod is restarted. In some setups, this lag can be minimal whereas it can be much larger on others.

This PR remedies the above by being more indiscriminate about what pods its draining in order to cause the MCC pod to be stopped and rescheduled on another node. Does that sound correct?

yuqi-zhang · 2022-07-06T18:05:35Z

However, I just want to be sure I understand what's happening: We were originally trying to be clever about not draining the MCC pod and instead relying on the reboot to cause the MCC pod to terminate.

Yes, this also came with the added benefit that the corresponding master node would finish the upgrade, and once rebooted, we can just go to the next node with the new controller, instead of being interrupted halfway through.

However, there is a lag between when the node comes back up, rejoins the cluster, and the MCC pod is restarted. In some setups, this lag can be minimal whereas it can be much larger on others.

Yes, the consequence of the above, is that the pod never gets drained and the status never gets updated, so the api says its still running, and we're just waiting for it.

This PR remedies the above by being more indiscriminate about what pods its draining in order to cause the MCC pod to be stopped and rescheduled on another node. Does that sound correct?

Correct, the new pod starts immediately and picks up draining the (currently half-drained) master node, so we change where the "disruption" happens, but shorten the downtime considerably

cheesesashimi · 2022-07-06T18:36:35Z

Sounds reasonable to me, thanks for clarifying things!

/lgtm
/approve

openshift-ci · 2022-07-06T18:37:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-07-06T21:14:23Z

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-07-06T21:17:29Z

@yuqi-zhang: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3212

Bugzilla bug 2103786 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2022-07-06T21:34:54Z

/cherry-pick release-4.11

openshift-cherrypick-robot · 2022-07-06T21:35:33Z

@yuqi-zhang: new pull request created: #3234

Details

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2022

openshift-ci bot requested review from cgwalters and kikisdeliveryservice June 28, 2022 18:07

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2022

cgwalters reviewed Jun 28, 2022

View reviewed changes

yuqi-zhang force-pushed the drain-controller-pod branch from e7fdf67 to 36f8f38 Compare June 29, 2022 19:31

yuqi-zhang force-pushed the drain-controller-pod branch from 36f8f38 to b25e570 Compare July 4, 2022 22:03

yuqi-zhang changed the title ~~WIP: Test draining the controller pod~~ Bug 2103786: drain controller: don't skip the MCC pod drain Jul 4, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2022

openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 4, 2022

openshift-ci bot requested a review from rioliu-rh July 4, 2022 22:06

openshift-ci bot assigned cheesesashimi Jul 6, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2022

openshift-ci bot merged commit 134558d into openshift:master Jul 6, 2022

openshift-cherrypick-robot mentioned this pull request Jul 6, 2022

[release-4.11] Bug 2104687: drain controller: don't skip the MCC pod drain #3234

Merged

Bug 2103786: drain controller: don't skip the MCC pod drain #3212

Bug 2103786: drain controller: don't skip the MCC pod drain #3212

Uh oh!

Conversation

yuqi-zhang commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi-zhang commented Jun 28, 2022

Uh oh!

cgwalters Jun 28, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Jul 4, 2022

Uh oh!

openshift-ci bot commented Jul 4, 2022

Uh oh!

openshift-ci bot commented Jul 4, 2022

Uh oh!

cheesesashimi commented Jul 6, 2022

Uh oh!

yuqi-zhang commented Jul 6, 2022

Uh oh!

cheesesashimi commented Jul 6, 2022

Uh oh!

openshift-ci bot commented Jul 6, 2022

Uh oh!

openshift-ci bot commented Jul 6, 2022

Uh oh!

openshift-ci bot commented Jul 6, 2022

Uh oh!

yuqi-zhang commented Jul 6, 2022

Uh oh!

openshift-cherrypick-robot commented Jul 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuqi-zhang commented Jun 28, 2022 •

edited

Loading

yuqi-zhang Jun 28, 2022 •

edited

Loading

yuqi-zhang Jun 28, 2022 •

edited

Loading