Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185

jkyros · 2022-06-13T17:00:35Z

TL;DR: SIGTERM/SIGINT --> cancel runContext --> shutdown --> cancel leaderContext --> release leader lease --> terminate

- What I did

added two contexts:
- a runContext (for cencelling our running state) and
- a leaderContext (for cancelling our leader lease)
  to both the controller and the operator run functions
added a signal handler that catches SIGINT/SIGTERM to controller and operator
updated run functions such that:
- signal handler cancels runContext when SIGTERM/SIGINT is received
- operator/controller shuts down when runContext is cancelled, and then sequentially cancels leaderContext
updated leaderElection config in controller and operator to release leader lease when leaderContext is cancelled
shutdown metrics listener properly

- How to verify it

delete a controller or an operator pod
observe logging that shows it shutting down/releasing the lease
observe that the pod that replaces it has short ( < 1 minute) leader election wait

- Description for the changelog
Controller and operator shutdown cleanly on normal termination and release their leader lease cleanly, resulting in faster leader election times, since we don't have to wait for the previous lease to time out anymore.

Fixes: BZ1817075

openshift-ci · 2022-06-13T17:00:52Z

@jkyros: This pull request references Bugzilla bug 1817075, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

Details

In response to this:

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cmd/machine-config-controller/start.go

cmd/machine-config-operator/start.go

cgwalters · 2022-06-13T17:08:07Z

cmd/machine-config-operator/start.go

Perhaps worth factoring this into a helper function to share with controller?

Or at least just a cookie for future readers to know there's code duplication, like
// Note, if you're changing this you probably also want to change machine-config-controller/start.go
or so?

cgwalters · 2022-06-13T17:15:56Z

cmd/machine-config-controller/start.go

Hmm. But in practice aren't we opening ourselves to major problems here if:

we drop the lease

new controller starts and starts making changes

old controller also makes changes

Now in most cases, both controllers be making the same decisions, but not necessarily. For example, imagine that we change the logic for which node to pick to upgrade; if old and new controllers race here and make different choices, we could violate maxUnavailable.

I think we need to do something like set a variable when we get SIGTERM and have all of our control loop handlers like syncMachineConfigPool quietly no-op if it's set.

Yeah, this was my concern too. I talked about it with Jerry a little last week.

My thought here was:

if there are no more events coming in, and no workers working (because the context cancelled them all), syncMachineConfigPool can't run again because there's no event to trigger it

so the only thing we should have to worry about is the one syncMachineConfigPool that is already in progress

but yes it would make me feel safer if we had something to stop us if some race-contion-y thing happened and something snuck through the queue while everything was reacting to the context cancellation.

@yuqi-zhang and I talked some more about this. As for that variable, checking the context should be just as good, (if cancelled, quietly no-op), but say I add a context check at the beginning of syncMachineConfigPool:

that doesn't save us from a "half-done" sync (like if we were already in the middle of the function body)

if we wait for that "half-done" sync, our syncs could be even slower than they are today?

Some other options we discussed:

have the next controller, after acquiring the lease, add a small wait to ensure the previous pod is (most likely) terminated

somehow track all the subcontrollers and routines and terminate them synchronously

check what other controllers are doing for this

I agree we should check what other operators are doing here.

That said, to avoid races I think the general pattern is:

Use a "done" channel we can poll in each controller sync loop

Use a sync.WaitGroup to wait for the completion of the polling goroutines

For correctness, everything that's making a blocking call should take a context and be cancelled when that happens. So given your example of syncRequiredMachineConfigPools() for example...we should be passing a context to that, then it gets passed down into the API calls like

diff --git a/pkg/operator/sync.go b/pkg/operator/sync.go index 69e4aa742..dd131f751 100644 --- a/pkg/operator/sync.go +++ b/pkg/operator/sync.go @@ -684,7 +684,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(_ *renderConfig) error { return false, nil } optr.setOperatorStatusExtension(&co.Status, lastErr) - _, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(context.TODO(), co, metav1.UpdateOptions{}) + _, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(ctx), co, metav1.UpdateOptions{}) if err != nil { lastErr = errors.Wrapf(lastErr, "failed to update clusteroperator: %v", err) return false, nil

And there is also PollWithContext instead of plain wait.Poll that we should use.

How about this as a short term plan:

As soon we lose the lease, call os.Exit(0) - do not pass Go¹, do not collect $200.

This means we could interrupt some of the controllers mid change, but they already need to be robust to that in theory.

¹ Get it? We're not "passing" any more Go code, the process is exiting...

Just curious, if we don't explicitly call os.Exit(0), wouldn't that essentially happen immediately (if we also remove <-time.After(5 * time.Second)) in this scenario?

I can't say for sure myself...possibly/probably? Immediately calling os.Exit(0) would reinforce that that's what we expect, and be robust to someone coming along later and adding something after the leader election loss (as unlikely as that is probably).

(I don't have a really strong opinion on this bit)

But I do think we should not call time.Sleep() - we should exit as fast as possible.

jianzhangbjz · 2022-06-15T10:00:57Z

/bugzilla refresh

openshift-ci · 2022-06-15T10:01:01Z

@jianzhangbjz: This pull request references Bugzilla bug 1817075, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In order to know we need to shut down properly, we need a signal handler. And everything is context-driven, so we need to make sure that signal handler can cancel the context to signal everything else to shut down. This adds a signal handler function to the command helpers that the operator and the controller can share, since they both perform leader elections and they will both need to shut down cleanly to release their leases.

Previously we did not cleanly release the leader lease on controller shutdown, which resulted in slow leader elections as the existing lease was forced to time out when a pod got rescheduled. This adds a signal handler, a context, and sets the leaderelection settings such that when the controller receives a shutdown signal, it will release its leader lease and terminate so the new leader can more quickly take over.

Much like the controller, the operator does not relinquish its leader lease on shutdown. This results in additional delays when updating/redeploying especially because the controller depends on controllerconfig/renderconfig being updated, and that has to wait behind the operator leader election. This makes sure the operator shuts down properly and release its leader lease when its context is cancelled on shutdown.

The metrics handler previously always emitted an error when being shut down, even if the error was nil. This was okay before because we never actually properly shutdown the metrics handler before, but now that we're going to, this needs to be clearer. This makes the metrics handler only emit an error when there is an error, and emit an info message when it successfully shuts down.

jkyros · 2022-06-17T23:28:40Z

Alright, so along the lines of the conversation in #3185 (comment), this simplifies this back to one single context and and gets rid of the delays so everything immediately exits.

This should be no worse than what we have today because we already don't gracefully shutdown, we just die to SIGTERM
I put up [ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194 to see how much work it would be to do this "correctly" and shut everything down properly, but only look if you're curious, I have no expectation that that goes in anytime soon

yuqi-zhang · 2022-06-20T18:46:22Z

cmd/machine-config-controller/start.go

-		RetryPeriod:   leaderElectionCfg.RetryPeriod.Duration,
+	leaderelection.RunOrDie(runContext, leaderelection.LeaderElectionConfig{
+		Lock:            common.CreateResourceLock(cb, startOpts.resourceLockNamespace, componentName),
+		ReleaseOnCancel: true,


Am I correct in understanding that essentially, this release is the crux of the PR here. Previously, we took the leader election (lock?) and never released it until the timeout happens?

You are correct.

yuqi-zhang

Generally looks fine. With this new change, do you foresee any potential dangerous scenarios we can fall into?

jkyros · 2022-06-21T16:30:40Z

Generally looks fine. With this new change, do you foresee any potential dangerous scenarios we can fall into?

I don't see any "concrete" danger given how things actually operate. I do see a "hypothetical" danger in that a verrrrry small amount of time could potentially elapse between when we cancel the context and when we terminate with os.Exit(). In that verrrrry small amount of time:

whatever is in progress is still progressing (e.g. an in-progress syncMachineConfigPool) , but since we've also cancelled the controller context (which stops the informers, etc) we aren't going to process any new events at that point.

To expand on that:

A new pod can't take over and do work until we release the lease
the same context that cancels the lease also cancels the controllercontext, so it can't receive any new informer events/take on any new work once we cancel
we die immediately after we release the lease
the only gap where anything can technically happen is between
- when the lease is cancelled and
- when we die to os.Exit(0)

And that's in client-go and the defer fires the moment we leave the leaderelection Run() loop, which we do immediately when the context is cancelled.

machine-config-operator/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go

Lines 206 to 213 in 37b7416

    
           	if !le.acquire(ctx) { 
        
           		return // ctx signalled done 
        
           	} 
        
           	ctx, cancel := context.WithCancel(ctx) 
        
           	defer cancel() 
        
           	go le.config.Callbacks.OnStartedLeading(ctx) 
        
           	le.renew(ctx) 
        
           }

So it's:

"can another pod grab the lease and start working on something we're already working on in the time it takes for a defer to call our function that calls os.Exit(0) ?".

And while technically I can say yes (because a very small amount of time could elapse during those go statements due to scheduling) realistically I think I can say no. Also, I'm pretty sure it will always take longer than that (the time between the context cancellation and os.Exit() ) for the next pod to acquire the leader lease.

( the dual context/5 second wait I had in there before was "try to get that in-progress work done before we exit" vs this now which is "try to exit before we get any more work done" )

yuqi-zhang · 2022-06-22T21:48:35Z

Ok I think we should be good to merge this. To be safe, I'm going to go ahead and

/payload 4.11 ci blocking

just to run a set of additional tests

openshift-ci · 2022-06-22T21:48:45Z

@yuqi-zhang: trigger 5 job(s) of type blocking for the ci release of OCP 4.11

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1106da80-f275-11ec-99c4-540d38102930-0

yuqi-zhang

/lgtm

All green on the payload side, I think this is of high value to upgrade timings to get in for 4.11, so let's try to merge this.

Thanks for the great work John!

openshift-ci · 2022-06-23T13:24:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jkyros, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-06-23T16:00:17Z

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-06-23T16:08:45Z

@jkyros: All pull requests linked via external trackers have merged:

Bugzilla bug 1817075 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 13, 2022

openshift-ci bot requested review from cheesesashimi, jianzhangbjz and kikisdeliveryservice June 13, 2022 17:00

cgwalters reviewed Jun 13, 2022

View reviewed changes

jkyros added 4 commits June 17, 2022 17:32

jkyros force-pushed the bugs/controller-slow-leaderelection branch from 99c28f8 to 61a9005 Compare June 17, 2022 22:50

jkyros mentioned this pull request Jun 17, 2022

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

Closed

yuqi-zhang reviewed Jun 20, 2022

View reviewed changes

yuqi-zhang approved these changes Jun 20, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2022

yuqi-zhang approved these changes Jun 23, 2022

View reviewed changes

openshift-ci bot assigned yuqi-zhang Jun 23, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2022

openshift-ci bot merged commit 4aa21da into openshift:master Jun 23, 2022

yuqi-zhang mentioned this pull request Jun 28, 2022

Bug 2103786: drain controller: don't skip the MCC pod drain #3212

Merged

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185

Uh oh!

Conversation

jkyros commented Jun 13, 2022

Uh oh!

openshift-ci bot commented Jun 13, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkyros Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jianzhangbjz commented Jun 15, 2022

Uh oh!

openshift-ci bot commented Jun 15, 2022

Uh oh!

jkyros commented Jun 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

jkyros commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi-zhang commented Jun 22, 2022

Uh oh!

openshift-ci bot commented Jun 22, 2022

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jun 23, 2022

Uh oh!

openshift-ci bot commented Jun 23, 2022

Uh oh!

openshift-ci bot commented Jun 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jkyros Jun 13, 2022 •

edited

Loading

jkyros commented Jun 21, 2022 •

edited

Loading