Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels #508

wking · 2021-01-28T21:02:23Z

Builds on #501; possibly review that first.

In rbhz#1921413, we had a CVO fail to gracefully release the leader lock, with:

$ grep start.go: namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256
2021-01-26T19:02:49.273117163Z I0126 19:02:49.273074       1 start.go:260] Waiting on 2 outstanding goroutines.
2021-01-26T19:02:49.273512006Z E0126 19:02:49.273474       1 start.go:271] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273551       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273569       1 start.go:264] Run context completed; beginning two-minute graceful shutdown period.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273577       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:04:49.273938652Z I0126 19:04:49.273877       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690       1 start.go:190] Shutting down due to terminated

So the bind failed, which should be rare, but can happen if the new CVO is scheduled before the kernel gets around to reaping the port from a previous CVO pod. The start.go handling notices the metrics server dying, and immediately cancels runContext, and we go into the two-minute shutdown grace period. That happened so early that the context was already done by the time we got into Operator.Run, with its:

go wait.UntilWithContext(runContext, func(runContext context.Context) {
 defer close(workerStopCh)

From the godocs for UntilWithContext:

$ git --no-pager grep -n '// UntilWithContext is syntactic sugar\|f may not be invoked if context\|func JitterUntilWithContext' vendor/k8s.io/apimachinery/pkg/util/wait
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:95:// UntilWithContext is syntactic sugar on top of JitterUntilWithContext
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:183:// Cancel context to stop. f may not be invoked if context is already expired.
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:184:func JitterUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration, jitterFactor float64, sliding bool) {

So when runContext is already done, we never close workerStopCh, and we hang forever when we block on it before returning from Operator.Run.

This commit addresses that issue by bringing in a resultChannel flow like we already use in pkg/start/start.go and pkg/cvo/metrics.go. It will give us logging for the number of outstanding goroutines, and tell us as each of them are reaped, which will make it easier to diagnose any other issues with hanging children, while also using shutdownContext to bail out to avoid a stuck child causing major problems.

There is a risk with these cluster-touching children, that if we hit the shutdownContext-timeout safety valve, we might exit Operator.Run and release the leader lock, and then have our abandoned child fighting over in-cluster state with some subsequent CVO which just acquired the leader lock. We might want to write an event or something to make it more obvious when that happens, but for now, I'm more interested in closing the context-already-canceled bug.

openshift-ci-robot · 2021-01-28T21:10:10Z

@wking: This pull request references Bugzilla bug 1921413, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-01-28T21:10:36Z

/bugzilla refresh

openshift-ci-robot · 2021-01-28T21:10:43Z

@wking: This pull request references Bugzilla bug 1921413, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-01-29T02:18:57Z

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1354905902246793216/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\)\|FAIL'
...
--- FAIL: TestIntegrationCVO_gracefulStepDown (11.07s)
I0128 22:12:10.009539    9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:10.009563    9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:10.009690    9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:10.009699    9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:10.010016    9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:10.010021    9056 cvo.go:371] Waiting on 2 outstanding goroutines.
I0128 22:12:13.380240    9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:13.380281    9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:13.380332    9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:13.380338    9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:13.380711    9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:13.380720    9056 cvo.go:371] Waiting on 2 outstanding goroutines.
FAIL
...

So something's going on in those tests to bail us out before we actually reap all the children.

wking · 2021-01-29T02:19:01Z

/hold

wking · 2021-02-04T00:33:09Z

In case this was some sort of e2e-operator race (which we've been bitten by in the past, but not with this exact presentation):

/retest

wking · 2021-02-04T05:11:59Z

Comparing the above logs (which are similar to the second attempt), the two children not getting reaped are available updates and upgradeable. I've pushed 70a99d5 -> ffe5f5e, which rebases on master now that #501 has landed, and adds some debug logging for the upgradeable goroutine so we can get a better feel for where things are getting stuck.

wking · 2021-02-04T18:45:03Z

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1357195059073650688/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\|goroutine\)\|FAIL'
...
I0204 05:50:45.897564    9560 cvo.go:342] Launched upgradeable goroutine
I0204 05:50:45.897575    9560 cvo.go:344] Processing work for upgradeable goroutine
I0204 05:50:45.897530    9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907356    9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907371    9560 cvo.go:384] Collected sync worker goroutine.
I0204 05:50:45.907377    9560 cvo.go:378] Waiting on 4 outstanding goroutines.
I0204 05:50:45.907402    9560 cvo.go:384] Collected status notifier goroutine.
I0204 05:50:45.907425    9560 cvo.go:378] Waiting on 3 outstanding goroutines.
I0204 05:50:45.907430    9560 cvo.go:384] Collected cluster version sync goroutine.
I0204 05:50:45.907433    9560 cvo.go:378] Waiting on 2 outstanding goroutines.
I0204 05:50:46.087418    9560 cvo.go:399] Collected status notifier goroutine.
I0204 05:50:46.087451    9560 cvo.go:378] Waiting on 4 outstanding goroutines.
...

So stuck in processing, although without timestamps, it's hard to know for how long. I'm guessing we are getting stuck in:

func processNextWorkItem(ctx context.Context, queue workqueue.RateLimitingInterface, syncHandler func(context.Context, string) error, syncFailingStatus syncFailingStatusFunc) bool {
        key, quit := queue.Get()

Looking to see if there's a Context-aware Get...

In [1], we had a CVO fail to gracefully release the leader lock, with: $ grep start.go: namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log 2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120 1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256 2021-01-26T19:02:49.273117163Z I0126 19:02:49.273074 1 start.go:260] Waiting on 2 outstanding goroutines. 2021-01-26T19:02:49.273512006Z E0126 19:02:49.273474 1 start.go:271] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use 2021-01-26T19:02:49.273587622Z I0126 19:02:49.273551 1 start.go:260] Waiting on 1 outstanding goroutines. 2021-01-26T19:02:49.273587622Z I0126 19:02:49.273569 1 start.go:264] Run context completed; beginning two-minute graceful shutdown period. 2021-01-26T19:02:49.273587622Z I0126 19:02:49.273577 1 start.go:260] Waiting on 1 outstanding goroutines. 2021-01-26T19:04:49.273938652Z I0126 19:04:49.273877 1 start.go:260] Waiting on 1 outstanding goroutines. 2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690 1 start.go:190] Shutting down due to terminated So the bind failed, which should be rare, but can happen if the new CVO is scheduled before the kernel gets around to reaping the port from a previous CVO pod. The start.go handling notices the metrics server dying, and immediately cancels runContext, and we go into the two-minute shutdown grace period. That happened so early that the context was already done by the time we got into Operator.Run, with its: go wait.UntilWithContext(runContext, func(runContext context.Context) { defer close(workerStopCh) From the godocs for UntilWithContext: $ git --no-pager grep -n '// UntilWithContext is syntactic sugar\|f may not be invoked if context\|func JitterUntilWithContext' vendor/k8s.io/apimachinery/pkg/util/wait vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:95:// UntilWithContext is syntactic sugar on top of JitterUntilWithContext vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:183:// Cancel context to stop. f may not be invoked if context is already expired. vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:184:func JitterUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration, jitterFactor float64, sliding bool) { So when runContext is already done, we never close workerStopCh, and we hang forever when we block on it before returning from Operator.Run. This commit addresses that issue by bringing in a resultChannel flow like we already use in pkg/start/start.go and pkg/cvo/metrics.go. It will give us logging for the number of outstanding goroutines, and tell us as each of them are reaped, which will make it easier to diagnose any other issues with hanging children, while also using shutdownContext to bail out to avoid a stuck child causing major problems. This commit also adds ShutDown calls for availableUpdatesQueue and upgradeableQueue, so we don't hang those two goroutines in processNextWorkItem's queue.Get() call. The shutdowns should have been added when the queues landed in 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45) and 0452814 (cvo: long running upgradeable sync worker, 2019-09-05, until this commit tightened down on the goroutine collection. There is a risk with these cluster-touching children, that if we hit the shutdownContext-timeout safety valve, we might exit Operator.Run and release the leader lock, and then have our abandoned child fighting over in-cluster state with some subsequent CVO which just acquired the leader lock. We might want to write an event or something to make it more obvious when that happens, but for now, I'm more interested in closing the context-already-canceled bug. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1921413

wking · 2021-02-04T19:09:08Z

I've pushed ffe5f5e -> 3d5b0e0, adding ShutDown calls for availableUpdatesQueue and upgradeableQueue, which I expect was why I wasn't reaping those goroutines in the e2e-operator tests. More details in one paragraph of the a9e075a commit message for the curious. If we see these goroutines successfully released in the current e2e-operator run, I'll drop the debugging commit.

wking · 2021-02-04T20:29:04Z

e2e-operator exploded on install, missing compute nodes, and possibly more. e2e passed install, and died in e2e test-cases, so e2e-operator failure must have been an unrelated flake.

/test e2e-agnostic-operator

wking · 2021-02-04T21:55:27Z

Success. I've dropped the debugging commit with 3d5b0e0 -> a9e075a.

/hold cancel

jottofar · 2021-02-04T23:00:41Z

/test unit

jottofar · 2021-02-04T23:11:36Z

/lgtm

openshift-ci-robot · 2021-02-04T23:11:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2021-02-05T00:12:36Z

/test unit

openshift-ci-robot · 2021-02-05T00:58:27Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#508

Bugzilla bug 1921413 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944

a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944

openshift-ci-robot requested review from sdodson and vrutkovs January 28, 2021 21:05

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2021

wking changed the title ~~pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels~~ Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels Jan 28, 2021

openshift-ci-robot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Jan 28, 2021

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 28, 2021

wking force-pushed the avoid-early-cancel-cvo-deadlock branch from c27f12d to 70a99d5 Compare January 28, 2021 21:34

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2021

wking force-pushed the avoid-early-cancel-cvo-deadlock branch from 70a99d5 to ffe5f5e Compare February 4, 2021 05:11

wking force-pushed the avoid-early-cancel-cvo-deadlock branch from ffe5f5e to 3d5b0e0 Compare February 4, 2021 19:02

wking force-pushed the avoid-early-cancel-cvo-deadlock branch from 3d5b0e0 to a9e075a Compare February 4, 2021 21:54

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2021

openshift-ci-robot assigned jottofar Feb 4, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2021

openshift-merge-robot merged commit 9e25ac8 into openshift:master Feb 5, 2021

wking deleted the avoid-early-cancel-cvo-deadlock branch February 9, 2021 05:51

wking mentioned this pull request Feb 11, 2021

Bug 1927944: pkg/start: Fix shutdown deadlock when die before getting a leader lock #519

Merged

wking mentioned this pull request Feb 16, 2021

Bug 1921413: pkg/start: Fix shutdown deadlock when die before getting a leader lock #521

Merged

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels #508

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels #508

Uh oh!

Conversation

wking commented Jan 28, 2021

Uh oh!

openshift-ci-robot commented Jan 28, 2021

Uh oh!

wking commented Jan 28, 2021

Uh oh!

openshift-ci-robot commented Jan 28, 2021

Uh oh!

wking commented Jan 29, 2021

Uh oh!

wking commented Jan 29, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

wking commented Feb 4, 2021

Uh oh!

jottofar commented Feb 4, 2021

Uh oh!

jottofar commented Feb 4, 2021

Uh oh!

openshift-ci-robot commented Feb 4, 2021

Uh oh!

wking commented Feb 5, 2021

Uh oh!

openshift-ci-robot commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants