Skip to content

Conversation

@jkyros
Copy link
Member

@jkyros jkyros commented Jun 13, 2022

TL;DR: SIGTERM/SIGINT --> cancel runContext --> shutdown --> cancel leaderContext --> release leader lease --> terminate

- What I did

  • added two contexts:
    • a runContext (for cencelling our running state) and
    • a leaderContext (for cancelling our leader lease)
      to both the controller and the operator run functions
  • added a signal handler that catches SIGINT/SIGTERM to controller and operator
  • updated run functions such that:
    • signal handler cancels runContext when SIGTERM/SIGINT is received
    • operator/controller shuts down when runContext is cancelled, and then sequentially cancels leaderContext
  • updated leaderElection config in controller and operator to release leader lease when leaderContext is cancelled
  • shutdown metrics listener properly

- How to verify it

  • delete a controller or an operator pod
  • observe logging that shows it shutting down/releasing the lease
  • observe that the pod that replaces it has short ( < 1 minute) leader election wait

- Description for the changelog
Controller and operator shutdown cleanly on normal termination and release their leader lease cleanly, resulting in faster leader election times, since we don't have to wait for the previous lease to time out anymore.

Fixes: BZ1817075

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2022

@jkyros: This pull request references Bugzilla bug 1817075, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

Details

In response to this:

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines 103 to 104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth factoring this into a helper function to share with controller?

Or at least just a cookie for future readers to know there's code duplication, like
// Note, if you're changing this you probably also want to change machine-config-controller/start.go
or so?

Comment on lines 99 to 101
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. But in practice aren't we opening ourselves to major problems here if:

  • we drop the lease
  • new controller starts and starts making changes
  • old controller also makes changes

Now in most cases, both controllers be making the same decisions, but not necessarily. For example, imagine that we change the logic for which node to pick to upgrade; if old and new controllers race here and make different choices, we could violate maxUnavailable.

I think we need to do something like set a variable when we get SIGTERM and have all of our control loop handlers like syncMachineConfigPool quietly no-op if it's set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was my concern too. I talked about it with Jerry a little last week.

My thought here was:

  • if there are no more events coming in, and no workers working (because the context cancelled them all), syncMachineConfigPool can't run again because there's no event to trigger it
  • so the only thing we should have to worry about is the one syncMachineConfigPool that is already in progress

but yes it would make me feel safer if we had something to stop us if some race-contion-y thing happened and something snuck through the queue while everything was reacting to the context cancellation.

Copy link
Member Author

@jkyros jkyros Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang and I talked some more about this. As for that variable, checking the context should be just as good, (if cancelled, quietly no-op), but say I add a context check at the beginning of syncMachineConfigPool:

  • that doesn't save us from a "half-done" sync (like if we were already in the middle of the function body)
  • if we wait for that "half-done" sync, our syncs could be even slower than they are today?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other options we discussed:

  • have the next controller, after acquiring the lease, add a small wait to ensure the previous pod is (most likely) terminated
  • somehow track all the subcontrollers and routines and terminate them synchronously
  • check what other controllers are doing for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should check what other operators are doing here.

That said, to avoid races I think the general pattern is:

  • Use a "done" channel we can poll in each controller sync loop
  • Use a sync.WaitGroup to wait for the completion of the polling goroutines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correctness, everything that's making a blocking call should take a context and be cancelled when that happens. So given your example of syncRequiredMachineConfigPools() for example...we should be passing a context to that, then it gets passed down into the API calls like

diff --git a/pkg/operator/sync.go b/pkg/operator/sync.go
index 69e4aa742..dd131f751 100644
--- a/pkg/operator/sync.go
+++ b/pkg/operator/sync.go
@@ -684,7 +684,7 @@ func (optr *Operator) syncRequiredMachineConfigPools(_ *renderConfig) error {
 				return false, nil
 			}
 			optr.setOperatorStatusExtension(&co.Status, lastErr)
-			_, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(context.TODO(), co, metav1.UpdateOptions{})
+			_, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(ctx), co, metav1.UpdateOptions{})
 			if err != nil {
 				lastErr = errors.Wrapf(lastErr, "failed to update clusteroperator: %v", err)
 				return false, nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is also PollWithContext instead of plain wait.Poll that we should use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this as a short term plan:

As soon we lose the lease, call os.Exit(0) - do not pass Go¹, do not collect $200.

This means we could interrupt some of the controllers mid change, but they already need to be robust to that in theory.

¹ Get it? We're not "passing" any more Go code, the process is exiting...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if we don't explicitly call os.Exit(0), wouldn't that essentially happen immediately (if we also remove <-time.After(5 * time.Second)) in this scenario?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say for sure myself...possibly/probably? Immediately calling os.Exit(0) would reinforce that that's what we expect, and be robust to someone coming along later and adding something after the leader election loss (as unlikely as that is probably).

(I don't have a really strong opinion on this bit)

But I do think we should not call time.Sleep() - we should exit as fast as possible.

@jianzhangbjz
Copy link
Member

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

@jianzhangbjz: This pull request references Bugzilla bug 1817075, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jkyros added 4 commits June 17, 2022 17:32
In order to know we need to shut down properly, we need a signal
handler. And everything is context-driven, so we need to make sure that
signal handler can cancel the context to signal everything else to shut
down.

This adds a signal handler function to the command helpers that the
operator and the controller can share, since they both perform leader
elections and they will both need to shut down cleanly to release their
leases.
Previously we did not cleanly release the leader lease on controller
shutdown, which resulted in slow leader elections as the existing lease
was forced to time out when a pod got rescheduled.

This adds a signal handler, a context, and sets the leaderelection
settings such that when the controller receives a shutdown signal, it
will release its leader lease and terminate so the new leader can more
quickly take over.
Much like the controller, the operator does not relinquish its leader
lease on shutdown. This results in additional delays when
updating/redeploying especially because the controller depends on
controllerconfig/renderconfig being updated, and that has to wait behind
the operator leader election.

This makes sure the operator shuts down properly and release its leader
lease when its context is cancelled on shutdown.
The metrics handler previously always emitted an error when being shut
down, even if the error was nil. This was okay before because we never
actually properly shutdown the metrics handler before, but now that
we're going to, this needs to be clearer.

This makes the metrics handler only emit an error when there is an
error, and emit an info message when it successfully shuts down.
@jkyros
Copy link
Member Author

jkyros commented Jun 17, 2022

Alright, so along the lines of the conversation in #3185 (comment), this simplifies this back to one single context and and gets rid of the delays so everything immediately exits.

RetryPeriod: leaderElectionCfg.RetryPeriod.Duration,
leaderelection.RunOrDie(runContext, leaderelection.LeaderElectionConfig{
Lock: common.CreateResourceLock(cb, startOpts.resourceLockNamespace, componentName),
ReleaseOnCancel: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that essentially, this release is the crux of the PR here. Previously, we took the leader election (lock?) and never released it until the timeout happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks fine. With this new change, do you foresee any potential dangerous scenarios we can fall into?

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 20, 2022
@jkyros
Copy link
Member Author

jkyros commented Jun 21, 2022

Generally looks fine. With this new change, do you foresee any potential dangerous scenarios we can fall into?

I don't see any "concrete" danger given how things actually operate. I do see a "hypothetical" danger in that a verrrrry small amount of time could potentially elapse between when we cancel the context and when we terminate with os.Exit(). In that verrrrry small amount of time:

  • whatever is in progress is still progressing (e.g. an in-progress syncMachineConfigPool) , but since we've also cancelled the controller context (which stops the informers, etc) we aren't going to process any new events at that point.

To expand on that:

  • A new pod can't take over and do work until we release the lease
  • the same context that cancels the lease also cancels the controllercontext, so it can't receive any new informer events/take on any new work once we cancel
  • we die immediately after we release the lease
  • the only gap where anything can technically happen is between
    • when the lease is cancelled and
    • when we die to os.Exit(0)
  • And that's in client-go and the defer fires the moment we leave the leaderelection Run() loop, which we do immediately when the context is cancelled.
    if !le.acquire(ctx) {
    return // ctx signalled done
    }
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()
    go le.config.Callbacks.OnStartedLeading(ctx)
    le.renew(ctx)
    }

So it's:

  • "can another pod grab the lease and start working on something we're already working on in the time it takes for a defer to call our function that calls os.Exit(0) ?".

And while technically I can say yes (because a very small amount of time could elapse during those go statements due to scheduling) realistically I think I can say no. Also, I'm pretty sure it will always take longer than that (the time between the context cancellation and os.Exit() ) for the next pod to acquire the leader lease.

( the dual context/5 second wait I had in there before was "try to get that in-progress work done before we exit" vs this now which is "try to exit before we get any more work done" )

@yuqi-zhang
Copy link
Contributor

Ok I think we should be good to merge this. To be safe, I'm going to go ahead and

/payload 4.11 ci blocking

just to run a set of additional tests

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 22, 2022

@yuqi-zhang: trigger 5 job(s) of type blocking for the ci release of OCP 4.11

  • periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade
  • periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1106da80-f275-11ec-99c4-540d38102930-0

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

All green on the payload side, I think this is of high value to upgrade timings to get in for 4.11, so let's try to merge this.

Thanks for the great work John!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jkyros, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 4aa21da into openshift:master Jun 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

@jkyros: All pull requests linked via external trackers have merged:

Bugzilla bug 1817075 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants