-
Notifications
You must be signed in to change notification settings - Fork 462
[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown #3194
Conversation
Both machine-config-controller and machine-config-operator have the need of a signal handler to prompt them to gracefully shut down. This adds one to our cmd helpers to avoid code duplication.
The UpdateNodeRetry function has a Patch call that needs to be cancellable. This adds a context to UpdateNodeRetry so that underling Patch API call call can be cancelled
The resource application functions did not previously support context cancellation, but their underlying API calls did. This makes the resource apply functions support context cancellation all the way through, so we can properly cancel them in cases like when we need to shut down our controllers/operators.
When looping through controllers, like we do in our controller start functions in start.go, it is nice to know which controller is being started. Rather than change the familiar slice/append layout we've grown so used to (and keep the controllers in say, a map or something indexed by name, I figured it would be better to have each operator know it's own name so we can retrieve it.
The GetManagedKey helper function does a bunch of gets/creates that need to be cancellable. This modifies GetManagedKey to receive a context argument so those gets/creates can be cancelled via that context.
Previously we did not shutdown the metric listener properly, so this didn't matter, but it would always report an error (even if the error was 'nil'). This adjusts it so it makes it clear in the logs what the status of the shutdown is, and it no longer emits an error on successful shutdown
Not that we necessarily need it, but this adds contexts to all of the functions in the bootstrap controller that need to be cancelled to shut down properly.
This adds context cancellation to the functions in the container runtime controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.
This adds context cancellation to the functions in the drain controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also makes the existing context that was part of the controller struct "live" rather than a context.TODO() for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.
This adds context cancellation to the functions in the kubelet config controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.
This adds context cancellation to the functions in the node controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable. This also adds a private context to the controller struct for cases where the function signature could not change due to interface constraints (like Informer event handlers), but that usage makes it less clear what is cancellable and what isn't, so I tried to avoid it unless absolutely necessary.
This adds context cancellation to the functions in the render controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.
This adds context cancellation to the functions in the template controller. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.
This adds context cancellation to the functions in the operator. This favors explicitly passing the context as the first parameter so it is apparent that the function is cancellable.
We changed the signature of UpdateNodeRetry over in internals because of what we're doing in controller/operator land, this just adjusts the call over here to take that into account.
A bunch of the method signatures changed as a result of the 'contextification' of the MCO's controllers and operator, this just updates the bootstrap test to match those new signatures (currently with context.TODO() )
This: - adds main/lease contexts to the controller - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the controller is being terminated so we can cancel our contexts - gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424
This: - adds main/lease contexts to the operator - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the operator is being terminated so we can cancel our contexts - gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jkyros The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
85928c8 to
4509154
Compare
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note to self: I have only “marked as viewed” the files that were only changed to pass context values around; I didn’t review any of the actual changes.)
| t.Run(fmt.Sprintf("test#%d", idx), func(t *testing.T) { | ||
| client := fake.NewSimpleClientset(test.existing...) | ||
| _, actualModified, err := ApplyMachineConfig(client.MachineconfigurationV1(), test.input) | ||
| _, actualModified, err := ApplyMachineConfig(context.TODO(), client.MachineconfigurationV1(), test.input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pkg.go.dev/context#Background suggests that tests should use context.Background — unless you plan to actually exercise the context cancellation in a future test enhancement.
(Applies throughout.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look!
Since we were light on context usage we were light on context usage testing. 😄
Those context.TODO()s were intentional -- I hadn't decided if/how we wanted to exercise those yet, but ideally we would find some way to test this.
It very well might not be right there in the existing test, but I don't know that I can say at this point that it 100% will not, so I wasn't comfortable setting those to context.Background() yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, getting tests would be even better, of course.
|
@jkyros: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@jkyros: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is not a serious final attempt, please don't spend review cycles on it
What this does:
Bug 1843505: pkg/start: Release leader lease on graceful shutdown cluster-version-operator#424, which also seemed to work well for them.
Why:
We haven't really been using contexts in the MCO at all for cancellation and proper shutdown, but if we wanted to, this is more or less the amount of work I think we'd have to do.
This was prompted by a conversation in Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts #3185 (comment) talking about shutting down "correctly".
I kind of wanted to see if it would break anything, and so far it doesn't look like it did, but I kind of want to see how it does in CI.