Skip to content

Conversation

@jkyros
Copy link
Member

@jkyros jkyros commented Jun 17, 2022

This is not a serious final attempt, please don't spend review cycles on it

What this does:

  • Adds contexts throughout the operator and controller for any functions/api calls that can be cancelled.
  • Replaces our bespoke stop channels with contexts
  • Replaces polling functions with the context-aware equivalents
  • Favors explicitly passing the context as the first argument in the method signature vs using a context from the struct the method is defined on
  • Includes a context in some of the controller structs for things like Informer event handler functions whose signatures couldn't change without breaking the interface requirements
  • Uses those contexts I added to cancel and properly shut down when SIGINT/SIGTERM is received
  • The shutdown process was more or less a flavor of what the CVO did in
    Bug 1843505: pkg/start: Release leader lease on graceful shutdown cluster-version-operator#424, which also seemed to work well for them.

Why:

jkyros added 18 commits June 15, 2022 13:42
Both machine-config-controller and machine-config-operator have the need
of a signal handler to prompt them to gracefully shut down.

This adds one to our cmd helpers to avoid code duplication.
The UpdateNodeRetry function has a Patch call that needs to be
cancellable.

This adds a context to UpdateNodeRetry so that underling Patch API call
call can be cancelled
The resource application functions did not previously support context
cancellation, but their underlying API calls did.

This makes the resource apply functions support context cancellation all
the way through, so we can properly cancel them in cases like when we
need to shut down our controllers/operators.
When looping through controllers, like we do in our controller start
functions in start.go, it is nice to know which controller is being
started.

Rather than change the familiar slice/append layout we've grown so used
to (and keep the controllers in say, a map or something indexed by name,
I figured it would be better to have each operator know it's own name
so we can retrieve it.
The GetManagedKey helper function does a bunch of gets/creates that need
to be cancellable.

This modifies GetManagedKey to receive a context argument so those
gets/creates can be cancelled via that context.
Previously we did not shutdown the metric listener properly, so this
didn't matter, but it would always report an error (even if the error
was 'nil').

This adjusts it so it makes it clear in the logs what the status of the
shutdown is, and it no longer emits an error on successful shutdown
Not that we necessarily need it, but this adds contexts to all of the
functions in the bootstrap controller that need to be cancelled to shut
down properly.
This adds context cancellation to the functions in the container runtime
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.

This also adds a private context to the controller struct for cases
where the function signature could not change due to interface
constraints (like Informer event handlers), but that usage makes it less
clear what is cancellable and what isn't, so I tried to avoid it unless
absolutely necessary.
This adds context cancellation to the functions in the drain
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.

This also makes the existing context that was part of the controller
struct "live" rather than a context.TODO() for cases where the function
signature could not change due to interface constraints (like Informer
event handlers), but that usage makes it less clear what is
cancellable and what isn't, so I tried to avoid it unless absolutely
necessary.
This adds context cancellation to the functions in the kubelet config
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.

This also adds a private context to the controller struct for cases
where the function signature could not change due to interface
constraints (like Informer event handlers), but that usage makes it less
clear what is cancellable and what isn't, so I tried to avoid it unless
absolutely necessary.
This adds context cancellation to the functions in the node
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.

This also adds a private context to the controller struct for cases
where the function signature could not change due to interface
constraints (like Informer event handlers), but that usage makes it less
clear what is cancellable and what isn't, so I tried to avoid it unless
absolutely necessary.
This adds context cancellation to the functions in the render
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.
This adds context cancellation to the functions in the template
controller.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.
This adds context cancellation to the functions in the operator.

This favors explicitly passing the context as the first parameter so it
is apparent that the function is cancellable.
We changed the signature of UpdateNodeRetry over in internals because of
what we're doing in controller/operator land, this just adjusts the call
over here to take that into account.
A bunch of the method signatures changed as a result of the
'contextification' of the MCO's controllers and operator, this just
updates the bootstrap test to match those new signatures (currently with
context.TODO() )
This:
- adds main/lease contexts to the controller
- sets up a counter and channels to track goroutine completion
- sets up a signal handler to catch when the controller is being
  terminated so we can cancel our contexts
- gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM

The reason this does not use sync.WaitGroup instead is that
sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how
many', so the channels are more useful.

Cribbed off of what the CVO did here:
openshift/cluster-version-operator#424
This:
- adds main/lease contexts to the operator
- sets up a counter and channels to track goroutine completion
- sets up a signal handler to catch when the operator is being
  terminated so we can cancel our contexts
- gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM

The reason this does not use sync.WaitGroup instead is that
sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how
many', so the channels are more useful.

Cribbed off of what the CVO did here:
openshift/cluster-version-operator#424
@openshift-ci openshift-ci bot requested review from cheesesashimi and mtrmac June 17, 2022 23:24
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 17, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note to self: I have only “marked as viewed” the files that were only changed to pass context values around; I didn’t review any of the actual changes.)

t.Run(fmt.Sprintf("test#%d", idx), func(t *testing.T) {
client := fake.NewSimpleClientset(test.existing...)
_, actualModified, err := ApplyMachineConfig(client.MachineconfigurationV1(), test.input)
_, actualModified, err := ApplyMachineConfig(context.TODO(), client.MachineconfigurationV1(), test.input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/context#Background suggests that tests should use context.Background — unless you plan to actually exercise the context cancellation in a future test enhancement.

(Applies throughout.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

Since we were light on context usage we were light on context usage testing. 😄

Those context.TODO()s were intentional -- I hadn't decided if/how we wanted to exercise those yet, but ideally we would find some way to test this.

It very well might not be right there in the existing test, but I don't know that I can say at this point that it 100% will not, so I wasn't comfortable setting those to context.Background() yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, getting tests would be even better, of course.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2022

@jkyros: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2022

@jkyros: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-images 4509154 link true /test okd-scos-images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 6, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Dec 6, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2022

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants