[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

jkyros · 2022-06-17T23:24:09Z

This is not a serious final attempt, please don't spend review cycles on it

What this does:

Adds contexts throughout the operator and controller for any functions/api calls that can be cancelled.
Replaces our bespoke stop channels with contexts
Replaces polling functions with the context-aware equivalents
Favors explicitly passing the context as the first argument in the method signature vs using a context from the struct the method is defined on
Includes a context in some of the controller structs for things like Informer event handler functions whose signatures couldn't change without breaking the interface requirements
Uses those contexts I added to cancel and properly shut down when SIGINT/SIGTERM is received
The shutdown process was more or less a flavor of what the CVO did in
Bug 1843505: pkg/start: Release leader lease on graceful shutdown cluster-version-operator#424, which also seemed to work well for them.

Why:

We haven't really been using contexts in the MCO at all for cancellation and proper shutdown, but if we wanted to, this is more or less the amount of work I think we'd have to do.
This was prompted by a conversation in Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185 (comment) talking about shutting down "correctly".
I kind of wanted to see if it would break anything, and so far it doesn't look like it did, but I kind of want to see how it does in CI.

Both machine-config-controller and machine-config-operator have the need of a signal handler to prompt them to gracefully shut down. This adds one to our cmd helpers to avoid code duplication.

The UpdateNodeRetry function has a Patch call that needs to be cancellable. This adds a context to UpdateNodeRetry so that underling Patch API call call can be cancelled

The resource application functions did not previously support context cancellation, but their underlying API calls did. This makes the resource apply functions support context cancellation all the way through, so we can properly cancel them in cases like when we need to shut down our controllers/operators.

When looping through controllers, like we do in our controller start functions in start.go, it is nice to know which controller is being started. Rather than change the familiar slice/append layout we've grown so used to (and keep the controllers in say, a map or something indexed by name, I figured it would be better to have each operator know it's own name so we can retrieve it.

The GetManagedKey helper function does a bunch of gets/creates that need to be cancellable. This modifies GetManagedKey to receive a context argument so those gets/creates can be cancelled via that context.

Previously we did not shutdown the metric listener properly, so this didn't matter, but it would always report an error (even if the error was 'nil'). This adjusts it so it makes it clear in the logs what the status of the shutdown is, and it no longer emits an error on successful shutdown

Not that we necessarily need it, but this adds contexts to all of the functions in the bootstrap controller that need to be cancelled to shut down properly.

This adds context cancellation to the functions in the container runtime controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.

This adds context cancellation to the functions in the drain controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also makes the existing context that was part of the controller struct "live" rather than a context.TODO() for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.

This adds context cancellation to the functions in the kubelet config controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.

This adds context cancellation to the functions in the node controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.

This adds context cancellation to the functions in the render controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

This adds context cancellation to the functions in the template controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

This adds context cancellation to the functions in the operator. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

We changed the signature of UpdateNodeRetry over in internals because of what we're doing in controller/operator land, this just adjusts the call over here to take that into account.

A bunch of the method signatures changed as a result of the 'contextification' of the MCO's controllers and operator, this just updates the bootstrap test to match those new signatures (currently with context.TODO() )

This: - adds main/lease contexts to the controller - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the controller is being terminated so we can cancel our contexts - gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424

This: - adds main/lease contexts to the operator - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the operator is being terminated so we can cancel our contexts - gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424

openshift-ci · 2022-06-17T23:24:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mtrmac

(Note to self: I have only “marked as viewed” the files that were only changed to pass context values around; I didn’t review any of the actual changes.)

mtrmac · 2022-06-18T20:10:48Z

lib/resourceapply/machineconfig_test.go

 		t.Run(fmt.Sprintf("test#%d", idx), func(t *testing.T) {
 			client := fake.NewSimpleClientset(test.existing...)
-			_, actualModified, err := ApplyMachineConfig(client.MachineconfigurationV1(), test.input)
+			_, actualModified, err := ApplyMachineConfig(context.TODO(), client.MachineconfigurationV1(), test.input)


https://pkg.go.dev/context#Background suggests that tests should use context.Background — unless you plan to actually exercise the context cancellation in a future test enhancement.

(Applies throughout.)

Thanks for taking a look!

Since we were light on context usage we were light on context usage testing. 😄

Those context.TODO()s were intentional -- I hadn't decided if/how we wanted to exercise those yet, but ideally we would find some way to test this.

It very well might not be right there in the existing test, but I don't know that I can say at this point that it 100% will not, so I wasn't comfortable setting those to context.Background() yet.

Oh, getting tests would be even better, of course.

openshift-ci · 2022-06-25T22:14:34Z

@jkyros: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2022-09-24T01:01:14Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-ci · 2022-10-06T14:13:18Z

@jkyros: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-images	`4509154`	link	true	`/test okd-scos-images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2022-11-06T00:30:51Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-12-06T08:00:43Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-12-06T08:00:59Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jkyros added 18 commits June 15, 2022 13:42

cmd/helpers: Add signal handler helper

e0369bc

Both machine-config-controller and machine-config-operator have the need of a signal handler to prompt them to gracefully shut down. This adds one to our cmd helpers to avoid code duplication.

internal/kubeutils: context-ify UpdateNodeRetry

0480f42

The UpdateNodeRetry function has a Patch call that needs to be cancellable. This adds a context to UpdateNodeRetry so that underling Patch API call call can be cancelled

Make GetManagedKey helper cancellable with context

16ac998

The GetManagedKey helper function does a bunch of gets/creates that need to be cancellable. This modifies GetManagedKey to receive a context argument so those gets/creates can be cancelled via that context.

Context cancellation for bootstrap controller

89588ba

Not that we necessarily need it, but this adds contexts to all of the functions in the bootstrap controller that need to be cancelled to shut down properly.

Context cancellation for render controller

cc74f01

This adds context cancellation to the functions in the render controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

Context cancellation for template controller

01f0d71

This adds context cancellation to the functions in the template controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

Context cancellation for operator

f697d44

This adds context cancellation to the functions in the operator. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.

clusterNodeWriter use context in UpdateNodeRetry

4746485

We changed the signature of UpdateNodeRetry over in internals because of what we're doing in controller/operator land, this just adjusts the call over here to take that into account.

Update bootstrap test account for context changes

67cef3d

A bunch of the method signatures changed as a result of the 'contextification' of the MCO's controllers and operator, this just updates the bootstrap test to match those new signatures (currently with context.TODO() )

openshift-ci bot requested review from cheesesashimi and mtrmac June 17, 2022 23:24

jkyros mentioned this pull request Jun 17, 2022

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185

Merged

jkyros force-pushed the speculative/context-all-the-things branch from 85928c8 to 4509154 Compare June 18, 2022 09:28

mtrmac reviewed Jun 18, 2022

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 6, 2022

openshift-ci bot closed this Dec 6, 2022

jkyros mentioned this pull request Jul 24, 2023

MCO-552: implement the ability for the MCO to handle image registry certificates #3770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

Uh oh!

jkyros commented Jun 17, 2022

Uh oh!

openshift-ci bot commented Jun 17, 2022

Uh oh!

mtrmac left a comment

Uh oh!

mtrmac Jun 18, 2022

Uh oh!

jkyros Jun 21, 2022

Uh oh!

mtrmac Jun 21, 2022

Uh oh!

openshift-ci bot commented Jun 25, 2022

Uh oh!

openshift-bot commented Sep 24, 2022

Uh oh!

openshift-ci bot commented Oct 6, 2022

Uh oh!

openshift-bot commented Nov 6, 2022

Uh oh!

openshift-bot commented Dec 6, 2022

Uh oh!

openshift-ci bot commented Dec 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194

Uh oh!

Conversation

jkyros commented Jun 17, 2022

This is not a serious final attempt, please don't spend review cycles on it

Uh oh!

openshift-ci bot commented Jun 17, 2022

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

mtrmac Jun 18, 2022

Choose a reason for hiding this comment

Uh oh!

jkyros Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

mtrmac Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jun 25, 2022

Uh oh!

openshift-bot commented Sep 24, 2022

Uh oh!

openshift-ci bot commented Oct 6, 2022

Uh oh!

openshift-bot commented Nov 6, 2022

Uh oh!

openshift-bot commented Dec 6, 2022

Uh oh!

openshift-ci bot commented Dec 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants