Skip to content

Conversation

@yuqi-zhang
Copy link
Contributor

@yuqi-zhang yuqi-zhang commented Jun 28, 2022

In 4.11 we moved the drain operation to a centralized controller.
This drain controller doesn't drain the MCC today.

In theory this means the pod can finish and then gracefully terminate.
In practice this is problematic since the pod never gets scheduled
off the node, meaning it thinks its still running (according to the API)
when the master node shuts down, and won't be reachable until the master
node reboots.

For some metal setups the reboot could take up to 30 minutes, this means
that we won't have a MCC for 30 minutes, which would be very problematic
as other pools waiting for drains/updates are stalled as well. We should
instead just drain the controller pod. The next one will immediately
pick up where this one left off, so there shouldn't be any conflicts.
The pod when drained will restart existing operations in ~10 seconds
from testing, which would be much faster.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2022
@yuqi-zhang
Copy link
Contributor Author

FWIW this does make the process much faster in conjunction with John's #3185, since we would no longer need to wait for the node to reboot to restart the controller. The question I guess is more along the lines of how we can make this cleaner.

An option would be to use our graceful shutdown path to "drain" ourselves? But if we already released the lease, I think we wouldn't be able to request the drain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just a very roundabout way to:

  • Release the lease
  • exit(0)

Hmm except if we do this aren't we losing out on the completion annotation below?

Copy link
Contributor Author

@yuqi-zhang yuqi-zhang Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next controller will write the completion annotation immediately. If this one did, the daemon may shut down the node too fast for the controller to drain itself, since the daemon doesn't otherwise "wait" for the controller. Although the controller now has context handling so... it might wait for this sync to exit?

Isn't this just a very roundabout way to:
Release the lease
exit(0)

I don't think so. The difference is, if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots. So the controller never switches nodes, and the upgrade process will be missing the controller until however long the node takes to reboot.

Which, now I think about it, is very bad for some slow-booting master nodes on e.g. metal, since the controller will be dead for a long time?

Copy link
Contributor Author

@yuqi-zhang yuqi-zhang Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concretely, the behaviour I've observed without this PR:

  1. the master node this controller is running on is marked for update
  2. the daemon tells the controller to drain, the controller does, the daemon then completes the update and reboots the node
  3. the controller pod is listed as "running" still, but it is supposed to be running on that master node that just shut down. The pod doesn't get scheduled anywhere else. So now we wait for however long it takes for the master to reboot before there is a MCC running at all

With this PR:
1 and 2 are the same, but instead of completing the drain and then having the daemon do its job, the controller will drain itself to another node, and start immediately. The new controller picks up, marks the completion, and the daemon on the old node continues while the new controller chugs on and does other activities (such as drain workers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we exit instead of drain, the controller pod will simply stay dead until the master node it was supposed to schedule on reboots.

Won't kube reschedule the pod when it exits? And it should happen on another node because the current one isn't schedulable.

Although officially draining would avoid a potential race where if we just exit, but then reboot, kubelet may go down before the pod status is updated.

Hmm. I think the graceful node shutdown is really the fix for these races.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With John's changes in https://github.com/openshift/machine-config-operator/pull/3185/files#diff-98a2adaa91130abb8fb7321e0ed6de823657c91b48c10e67c141a12d0a0197feR108 we should already be exiting right? The pod never gets rescheduled. I don't know what the expected behaviour is though, I should check that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: I think this may also work with context-ifying all the controllers and properly handling their shutdowns so we get rescheduled, but directly calling the drain is "easier" in a sense here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that this is the equivalent of removing the filter to not drain the MCC pod, so let's see if this works

@yuqi-zhang yuqi-zhang force-pushed the drain-controller-pod branch from e7fdf67 to 36f8f38 Compare June 29, 2022 19:31
In 4.11 we moved the drain operation to a centralized controller.
This drain controller doesn't drain the MCC today.

In theory this means the pod can finish and then gracefully terminate.
In practice this is problematic since the pod never gets scheduled
off the node, meaning it thinks its still running (according to the API)
when the master node shuts down, and won't be reachable until the master
node reboots.

For some metal setups the reboot could take up to 30 minutes, this means
that we won't have a MCC for 30 minutes, which would be very problematic
as other pools waiting for drains/updates are stalled as well. We should
instead just drain the controller pod. The next one will immediately
pick up where this one left off, so there shouldn't be any conflicts.
The pod when drained will restart existing operations in ~10 seconds
from testing, which would be much faster.
@yuqi-zhang yuqi-zhang force-pushed the drain-controller-pod branch from 36f8f38 to b25e570 Compare July 4, 2022 22:03
@yuqi-zhang yuqi-zhang changed the title WIP: Test draining the controller pod Bug 2103786: drain controller: don't skip the MCC pod drain Jul 4, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2022
@yuqi-zhang
Copy link
Contributor Author

Tested, rebased, updated commit message/description and opened a BZ for it.

I don't think this is necessarily the best path, but it should work

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 4, 2022

@yuqi-zhang: An error was encountered querying GitHub for users with public email ([email protected]) for bug 2103786 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. non-200 OK status code: 403 Forbidden body: "{\n \"documentation_url\": \"https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits\",\n \"message\": \"You have exceeded a secondary rate limit. Please wait a few minutes before you try again.\"\n}\n"

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 4, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 4, 2022

@yuqi-zhang: This pull request references Bugzilla bug 2103786, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.12.0) matches configured target release for branch (4.12.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @rioliu-rh

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from rioliu-rh July 4, 2022 22:06
@cheesesashimi
Copy link
Member

Overall, this looks good and I have no comments or suggestions.

However, I just want to be sure I understand what's happening: We were originally trying to be clever about not draining the MCC pod and instead relying on the reboot to cause the MCC pod to terminate. However, there is a lag between when the node comes back up, rejoins the cluster, and the MCC pod is restarted. In some setups, this lag can be minimal whereas it can be much larger on others.

This PR remedies the above by being more indiscriminate about what pods its draining in order to cause the MCC pod to be stopped and rescheduled on another node. Does that sound correct?

@yuqi-zhang
Copy link
Contributor Author

However, I just want to be sure I understand what's happening: We were originally trying to be clever about not draining the MCC pod and instead relying on the reboot to cause the MCC pod to terminate.

Yes, this also came with the added benefit that the corresponding master node would finish the upgrade, and once rebooted, we can just go to the next node with the new controller, instead of being interrupted halfway through.

However, there is a lag between when the node comes back up, rejoins the cluster, and the MCC pod is restarted. In some setups, this lag can be minimal whereas it can be much larger on others.

Yes, the consequence of the above, is that the pod never gets drained and the status never gets updated, so the api says its still running, and we're just waiting for it.

This PR remedies the above by being more indiscriminate about what pods its draining in order to cause the MCC pod to be stopped and rescheduled on another node. Does that sound correct?

Correct, the new pod starts immediately and picks up draining the (currently half-drained) master node, so we change where the "disruption" happens, but shorten the downtime considerably

@cheesesashimi
Copy link
Member

Sounds reasonable to me, thanks for clarifying things!

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 6, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2022

@yuqi-zhang: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 134558d into openshift:master Jul 6, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2022

@yuqi-zhang: All pull requests linked via external trackers have merged:

Bugzilla bug 2103786 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2103786: drain controller: don't skip the MCC pod drain

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor Author

/cherry-pick release-4.11

@openshift-cherrypick-robot

@yuqi-zhang: new pull request created: #3234

Details

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants