-
Notifications
You must be signed in to change notification settings - Fork 462
Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185
Conversation
|
@jkyros: This pull request references Bugzilla bug 1817075, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cmd/machine-config-operator/start.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth factoring this into a helper function to share with controller?
Or at least just a cookie for future readers to know there's code duplication, like
// Note, if you're changing this you probably also want to change machine-config-controller/start.go
or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. But in practice aren't we opening ourselves to major problems here if:
- we drop the lease
- new controller starts and starts making changes
- old controller also makes changes
Now in most cases, both controllers be making the same decisions, but not necessarily. For example, imagine that we change the logic for which node to pick to upgrade; if old and new controllers race here and make different choices, we could violate maxUnavailable.
I think we need to do something like set a variable when we get SIGTERM and have all of our control loop handlers like syncMachineConfigPool quietly no-op if it's set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was my concern too. I talked about it with Jerry a little last week.
My thought here was:
- if there are no more events coming in, and no workers working (because the context cancelled them all),
syncMachineConfigPoolcan't run again because there's no event to trigger it - so the only thing we should have to worry about is the one
syncMachineConfigPoolthat is already in progress
but yes it would make me feel safer if we had something to stop us if some race-contion-y thing happened and something snuck through the queue while everything was reacting to the context cancellation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuqi-zhang and I talked some more about this. As for that variable, checking the context should be just as good, (if cancelled, quietly no-op), but say I add a context check at the beginning of syncMachineConfigPool:
- that doesn't save us from a "half-done" sync (like if we were already in the middle of the function body)
- if we wait for that "half-done" sync, our syncs could be even slower than they are today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other options we discussed:
- have the next controller, after acquiring the lease, add a small wait to ensure the previous pod is (most likely) terminated
- somehow track all the subcontrollers and routines and terminate them synchronously
- check what other controllers are doing for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should check what other operators are doing here.
That said, to avoid races I think the general pattern is:
- Use a "done" channel we can poll in each controller sync loop
- Use a sync.WaitGroup to wait for the completion of the polling goroutines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For correctness, everything that's making a blocking call should take a context and be cancelled when that happens. So given your example of syncRequiredMachineConfigPools() for example...we should be passing a context to that, then it gets passed down into the API calls like
diff --git a/pkg/operator/sync.go b/pkg/operator/sync.go
index 69e4aa742..dd131f751 100644
--- a/pkg/operator/sync.go
+++ b/pkg/operator/sync.go
@@ -684,7 +684,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(_ *renderConfig) error {
return false, nil
}
optr.setOperatorStatusExtension(&co.Status, lastErr)
- _, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(context.TODO(), co, metav1.UpdateOptions{})
+ _, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(ctx), co, metav1.UpdateOptions{})
if err != nil {
lastErr = errors.Wrapf(lastErr, "failed to update clusteroperator: %v", err)
return false, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there is also PollWithContext instead of plain wait.Poll that we should use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this as a short term plan:
As soon we lose the lease, call os.Exit(0) - do not pass Go¹, do not collect $200.
This means we could interrupt some of the controllers mid change, but they already need to be robust to that in theory.
¹ Get it? We're not "passing" any more Go code, the process is exiting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, if we don't explicitly call os.Exit(0), wouldn't that essentially happen immediately (if we also remove <-time.After(5 * time.Second)) in this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say for sure myself...possibly/probably? Immediately calling os.Exit(0) would reinforce that that's what we expect, and be robust to someone coming along later and adding something after the leader election loss (as unlikely as that is probably).
(I don't have a really strong opinion on this bit)
But I do think we should not call time.Sleep() - we should exit as fast as possible.
|
/bugzilla refresh |
|
@jianzhangbjz: This pull request references Bugzilla bug 1817075, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In order to know we need to shut down properly, we need a signal handler. And everything is context-driven, so we need to make sure that signal handler can cancel the context to signal everything else to shut down. This adds a signal handler function to the command helpers that the operator and the controller can share, since they both perform leader elections and they will both need to shut down cleanly to release their leases.
Previously we did not cleanly release the leader lease on controller shutdown, which resulted in slow leader elections as the existing lease was forced to time out when a pod got rescheduled. This adds a signal handler, a context, and sets the leaderelection settings such that when the controller receives a shutdown signal, it will release its leader lease and terminate so the new leader can more quickly take over.
Much like the controller, the operator does not relinquish its leader lease on shutdown. This results in additional delays when updating/redeploying especially because the controller depends on controllerconfig/renderconfig being updated, and that has to wait behind the operator leader election. This makes sure the operator shuts down properly and release its leader lease when its context is cancelled on shutdown.
The metrics handler previously always emitted an error when being shut down, even if the error was nil. This was okay before because we never actually properly shutdown the metrics handler before, but now that we're going to, this needs to be clearer. This makes the metrics handler only emit an error when there is an error, and emit an info message when it successfully shuts down.
99c28f8 to
61a9005
Compare
|
Alright, so along the lines of the conversation in #3185 (comment), this simplifies this back to one single context and and gets rid of the delays so everything immediately exits.
|
| RetryPeriod: leaderElectionCfg.RetryPeriod.Duration, | ||
| leaderelection.RunOrDie(runContext, leaderelection.LeaderElectionConfig{ | ||
| Lock: common.CreateResourceLock(cb, startOpts.resourceLockNamespace, componentName), | ||
| ReleaseOnCancel: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that essentially, this release is the crux of the PR here. Previously, we took the leader election (lock?) and never released it until the timeout happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct.
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks fine. With this new change, do you foresee any potential dangerous scenarios we can fall into?
I don't see any "concrete" danger given how things actually operate. I do see a "hypothetical" danger in that a verrrrry small amount of time could potentially elapse between when we cancel the context and when we terminate with os.Exit(). In that verrrrry small amount of time:
To expand on that:
So it's:
And while technically I can say yes (because a very small amount of time could elapse during those go statements due to scheduling) realistically I think I can say no. Also, I'm pretty sure it will always take longer than that (the time between the context cancellation and ( the dual context/5 second wait I had in there before was "try to get that in-progress work done before we exit" vs this now which is "try to exit before we get any more work done" ) |
|
Ok I think we should be good to merge this. To be safe, I'm going to go ahead and /payload 4.11 ci blocking just to run a set of additional tests |
|
@yuqi-zhang: trigger 5 job(s) of type blocking for the ci release of OCP 4.11
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1106da80-f275-11ec-99c4-540d38102930-0 |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
All green on the payload side, I think this is of high value to upgrade timings to get in for 4.11, so let's try to merge this.
Thanks for the great work John!
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jkyros, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jkyros: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@jkyros: All pull requests linked via external trackers have merged:
Bugzilla bug 1817075 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
TL;DR: SIGTERM/SIGINT --> cancel
runContext--> shutdown --> cancelleaderContext--> release leader lease --> terminate- What I did
runContext(for cencelling our running state) andleaderContext(for cancelling our leader lease)to both the controller and the operator run functions
runContextwhen SIGTERM/SIGINT is receivedrunContextis cancelled, and then sequentially cancelsleaderContextleaderContextis cancelled- How to verify it
- Description for the changelog
Controller and operator shutdown cleanly on normal termination and release their leader lease cleanly, resulting in faster leader election times, since we don't have to wait for the previous lease to time out anymore.
Fixes: BZ1817075