-
Notifications
You must be signed in to change notification settings - Fork 462
Bug 2103786: drain controller: don't skip the MCC pod drain #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FWIW this does make the process much faster in conjunction with John's #3185, since we would no longer need to wait for the node to reboot to restart the controller. The question I guess is more along the lines of how we can make this cleaner. An option would be to use our graceful shutdown path to "drain" ourselves? But if we already released the lease, I think we wouldn't be able to request the drain? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just a very roundabout way to:
- Release the lease
exit(0)
Hmm except if we do this aren't we losing out on the completion annotation below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next controller will write the completion annotation immediately. If this one did, the daemon may shut down the node too fast for the controller to drain itself, since the daemon doesn't otherwise "wait" for the controller. Although the controller now has context handling so... it might wait for this sync to exit?
Isn't this just a very roundabout way to:
Release the lease
exit(0)
I don't think so. The difference is, if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots. So the controller never switches nodes, and the upgrade process will be missing the controller until however long the node takes to reboot.
Which, now I think about it, is very bad for some slow-booting master nodes on e.g. metal, since the controller will be dead for a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More concretely, the behaviour I've observed without this PR:
- the master node this controller is running on is marked for update
- the daemon tells the controller to drain, the controller does, the daemon then completes the update and reboots the node
- the controller pod is listed as "running" still, but it is supposed to be running on that master node that just shut down. The pod doesn't get scheduled anywhere else. So now we wait for however long it takes for the master to reboot before there is a MCC running at all
With this PR:
1 and 2 are the same, but instead of completing the drain and then having the daemon do its job, the controller will drain itself to another node, and start immediately. The new controller picks up, marks the completion, and the daemon on the old node continues while the new controller chugs on and does other activities (such as drain workers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots.
Won't kube reschedule the pod when it exits? And it should happen on another node because the current one isn't schedulable.
Although officially draining would avoid a potential race where if we just exit, but then reboot, kubelet may go down before the pod status is updated.
Hmm. I think the graceful node shutdown is really the fix for these races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With John's changes in https://github.com/openshift/machine-config-operator/pull/3185/files#diff-98a2adaa91130abb8fb7321e0ed6de823657c91b48c10e67c141a12d0a0197feR108 we should already be exiting right? The pod never gets rescheduled. I don't know what the expected behaviour is though, I should check that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I think this may also work with context-ifying all the controllers and properly handling their shutdowns so we get rescheduled, but directly calling the drain is "easier" in a sense here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that this is the equivalent of removing the filter to not drain the MCC pod, so let's see if this works
e7fdf67 to
36f8f38
Compare
In 4.11 we moved the drain operation to a centralized controller. This drain controller doesn't drain the MCC today. In theory this means the pod can finish and then gracefully terminate. In practice this is problematic since the pod never gets scheduled off the node, meaning it thinks its still running (according to the API) when the master node shuts down, and won't be reachable until the master node reboots. For some metal setups the reboot could take up to 30 minutes, this means that we won't have a MCC for 30 minutes, which would be very problematic as other pools waiting for drains/updates are stalled as well. We should instead just drain the controller pod. The next one will immediately pick up where this one left off, so there shouldn't be any conflicts. The pod when drained will restart existing operations in ~10 seconds from testing, which would be much faster.
36f8f38 to
b25e570
Compare
|
Tested, rebased, updated commit message/description and opened a BZ for it. I don't think this is necessarily the best path, but it should work |
|
@yuqi-zhang: An error was encountered querying GitHub for users with public email ([email protected]) for bug 2103786 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@yuqi-zhang: This pull request references Bugzilla bug 2103786, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Overall, this looks good and I have no comments or suggestions. However, I just want to be sure I understand what's happening: We were originally trying to be clever about not draining the MCC pod and instead relying on the reboot to cause the MCC pod to terminate. However, there is a lag between when the node comes back up, rejoins the cluster, and the MCC pod is restarted. In some setups, this lag can be minimal whereas it can be much larger on others. This PR remedies the above by being more indiscriminate about what pods its draining in order to cause the MCC pod to be stopped and rescheduled on another node. Does that sound correct? |
Yes, this also came with the added benefit that the corresponding master node would finish the upgrade, and once rebooted, we can just go to the next node with the new controller, instead of being interrupted halfway through.
Yes, the consequence of the above, is that the pod never gets drained and the status never gets updated, so the api says its still running, and we're just waiting for it.
Correct, the new pod starts immediately and picks up draining the (currently half-drained) master node, so we change where the "disruption" happens, but shorten the downtime considerably |
|
Sounds reasonable to me, thanks for clarifying things! /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@yuqi-zhang: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@yuqi-zhang: All pull requests linked via external trackers have merged: Bugzilla bug 2103786 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.11 |
|
@yuqi-zhang: new pull request created: #3234 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In 4.11 we moved the drain operation to a centralized controller.
This drain controller doesn't drain the MCC today.
In theory this means the pod can finish and then gracefully terminate.
In practice this is problematic since the pod never gets scheduled
off the node, meaning it thinks its still running (according to the API)
when the master node shuts down, and won't be reachable until the master
node reboots.
For some metal setups the reboot could take up to 30 minutes, this means
that we won't have a MCC for 30 minutes, which would be very problematic
as other pools waiting for drains/updates are stalled as well. We should
instead just drain the controller pod. The next one will immediately
pick up where this one left off, so there shouldn't be any conflicts.
The pod when drained will restart existing operations in ~10 seconds
from testing, which would be much faster.