Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jan 28, 2021

Builds on #501; possibly review that first.

In rbhz#1921413, we had a CVO fail to gracefully release the leader lock, with:

$ grep start.go: namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256
2021-01-26T19:02:49.273117163Z I0126 19:02:49.273074       1 start.go:260] Waiting on 2 outstanding goroutines.
2021-01-26T19:02:49.273512006Z E0126 19:02:49.273474       1 start.go:271] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273551       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273569       1 start.go:264] Run context completed; beginning two-minute graceful shutdown period.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273577       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:04:49.273938652Z I0126 19:04:49.273877       1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690       1 start.go:190] Shutting down due to terminated

So the bind failed, which should be rare, but can happen if the new CVO is scheduled before the kernel gets around to reaping the port from a previous CVO pod. The start.go handling notices the metrics server dying, and immediately cancels runContext, and we go into the two-minute shutdown grace period. That happened so early that the context was already done by the time we got into Operator.Run, with its:

go wait.UntilWithContext(runContext, func(runContext context.Context) {
 defer close(workerStopCh)

From the godocs for UntilWithContext:

$ git --no-pager grep -n '// UntilWithContext is syntactic sugar\|f may not be invoked if context\|func JitterUntilWithContext' vendor/k8s.io/apimachinery/pkg/util/wait
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:95:// UntilWithContext is syntactic sugar on top of JitterUntilWithContext
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:183:// Cancel context to stop. f may not be invoked if context is already expired.
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:184:func JitterUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration, jitterFactor float64, sliding bool) {

So when runContext is already done, we never close workerStopCh, and we hang forever when we block on it before returning from Operator.Run.

This commit addresses that issue by bringing in a resultChannel flow like we already use in pkg/start/start.go and pkg/cvo/metrics.go. It will give us logging for the number of outstanding goroutines, and tell us as each of them are reaped, which will make it easier to diagnose any other issues with hanging children, while also using shutdownContext to bail out to avoid a stuck child causing major problems.

There is a risk with these cluster-touching children, that if we hit the shutdownContext-timeout safety valve, we might exit Operator.Run and release the leader lock, and then have our abandoned child fighting over in-cluster state with some subsequent CVO which just acquired the leader lock. We might want to write an event or something to make it more obvious when that happens, but for now, I'm more interested in closing the context-already-canceled bug.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2021
@wking wking changed the title pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels Jan 28, 2021
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Jan 28, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1921413, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 28, 2021
@wking
Copy link
Member Author

wking commented Jan 28, 2021

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 28, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1921413, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the avoid-early-cancel-cvo-deadlock branch from c27f12d to 70a99d5 Compare January 28, 2021 21:34
@wking
Copy link
Member Author

wking commented Jan 29, 2021

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1354905902246793216/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\)\|FAIL'
...
--- FAIL: TestIntegrationCVO_gracefulStepDown (11.07s)
I0128 22:12:10.009539    9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:10.009563    9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:10.009690    9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:10.009699    9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:10.010016    9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:10.010021    9056 cvo.go:371] Waiting on 2 outstanding goroutines.
I0128 22:12:13.380240    9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:13.380281    9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:13.380332    9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:13.380338    9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:13.380711    9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:13.380720    9056 cvo.go:371] Waiting on 2 outstanding goroutines.
FAIL
...

So something's going on in those tests to bail us out before we actually reap all the children.

@wking
Copy link
Member Author

wking commented Jan 29, 2021

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2021
@wking
Copy link
Member Author

wking commented Feb 4, 2021

In case this was some sort of e2e-operator race (which we've been bitten by in the past, but not with this exact presentation):

/retest

@wking wking force-pushed the avoid-early-cancel-cvo-deadlock branch from 70a99d5 to ffe5f5e Compare February 4, 2021 05:11
@wking
Copy link
Member Author

wking commented Feb 4, 2021

Comparing the above logs (which are similar to the second attempt), the two children not getting reaped are available updates and upgradeable. I've pushed 70a99d5 -> ffe5f5e, which rebases on master now that #501 has landed, and adds some debug logging for the upgradeable goroutine so we can get a better feel for where things are getting stuck.

@wking
Copy link
Member Author

wking commented Feb 4, 2021

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1357195059073650688/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\|goroutine\)\|FAIL'
...
I0204 05:50:45.897564    9560 cvo.go:342] Launched upgradeable goroutine
I0204 05:50:45.897575    9560 cvo.go:344] Processing work for upgradeable goroutine
I0204 05:50:45.897530    9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907356    9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907371    9560 cvo.go:384] Collected sync worker goroutine.
I0204 05:50:45.907377    9560 cvo.go:378] Waiting on 4 outstanding goroutines.
I0204 05:50:45.907402    9560 cvo.go:384] Collected status notifier goroutine.
I0204 05:50:45.907425    9560 cvo.go:378] Waiting on 3 outstanding goroutines.
I0204 05:50:45.907430    9560 cvo.go:384] Collected cluster version sync goroutine.
I0204 05:50:45.907433    9560 cvo.go:378] Waiting on 2 outstanding goroutines.
I0204 05:50:46.087418    9560 cvo.go:399] Collected status notifier goroutine.
I0204 05:50:46.087451    9560 cvo.go:378] Waiting on 4 outstanding goroutines.
...

So stuck in processing, although without timestamps, it's hard to know for how long. I'm guessing we are getting stuck in:

func processNextWorkItem(ctx context.Context, queue workqueue.RateLimitingInterface, syncHandler func(context.Context, string) error, syncFailingStatus syncFailingStatusFunc) bool {
        key, quit := queue.Get()

Looking to see if there's a Context-aware Get...

In [1], we had a CVO fail to gracefully release the leader lock, with:

  $ grep start.go: namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
  2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120       1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256
  2021-01-26T19:02:49.273117163Z I0126 19:02:49.273074       1 start.go:260] Waiting on 2 outstanding goroutines.
  2021-01-26T19:02:49.273512006Z E0126 19:02:49.273474       1 start.go:271] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
  2021-01-26T19:02:49.273587622Z I0126 19:02:49.273551       1 start.go:260] Waiting on 1 outstanding goroutines.
  2021-01-26T19:02:49.273587622Z I0126 19:02:49.273569       1 start.go:264] Run context completed; beginning two-minute graceful shutdown period.
  2021-01-26T19:02:49.273587622Z I0126 19:02:49.273577       1 start.go:260] Waiting on 1 outstanding goroutines.
  2021-01-26T19:04:49.273938652Z I0126 19:04:49.273877       1 start.go:260] Waiting on 1 outstanding goroutines.
  2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690       1 start.go:190] Shutting down due to terminated

So the bind failed, which should be rare, but can happen if the new
CVO is scheduled before the kernel gets around to reaping the port
from a previous CVO pod.  The start.go handling notices the metrics
server dying, and immediately cancels runContext, and we go into the
two-minute shutdown grace period.  That happened so early that the
context was already done by the time we got into Operator.Run, with
its:

  go wait.UntilWithContext(runContext, func(runContext context.Context) {
    defer close(workerStopCh)

From the godocs for UntilWithContext:

  $ git --no-pager grep -n '// UntilWithContext is syntactic sugar\|f may not be invoked if context\|func JitterUntilWithContext' vendor/k8s.io/apimachinery/pkg/util/wait
  vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:95:// UntilWithContext is syntactic sugar on top of JitterUntilWithContext
  vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:183:// Cancel context to stop. f may not be invoked if context is already expired.
  vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:184:func JitterUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration, jitterFactor float64, sliding bool) {

So when runContext is already done, we never close workerStopCh, and
we hang forever when we block on it before returning from
Operator.Run.

This commit addresses that issue by bringing in a resultChannel flow
like we already use in pkg/start/start.go and pkg/cvo/metrics.go.  It
will give us logging for the number of outstanding goroutines, and
tell us as each of them are reaped, which will make it easier to
diagnose any other issues with hanging children, while also using
shutdownContext to bail out to avoid a stuck child causing major
problems.

This commit also adds ShutDown calls for availableUpdatesQueue and
upgradeableQueue, so we don't hang those two goroutines in
processNextWorkItem's queue.Get() call.  The shutdowns should have
been added when the queues landed in 90e9881 (cvo: Change the core
CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45) and
0452814 (cvo: long running upgradeable sync worker, 2019-09-05,
until this commit tightened down on the goroutine collection.

There is a risk with these cluster-touching children, that if we hit
the shutdownContext-timeout safety valve, we might exit Operator.Run
and release the leader lock, and then have our abandoned child
fighting over in-cluster state with some subsequent CVO which just
acquired the leader lock.  We might want to write an event or
something to make it more obvious when that happens, but for now, I'm
more interested in closing the context-already-canceled bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1921413
@wking wking force-pushed the avoid-early-cancel-cvo-deadlock branch from ffe5f5e to 3d5b0e0 Compare February 4, 2021 19:02
@wking
Copy link
Member Author

wking commented Feb 4, 2021

I've pushed ffe5f5e -> 3d5b0e0, adding ShutDown calls for availableUpdatesQueue and upgradeableQueue, which I expect was why I wasn't reaping those goroutines in the e2e-operator tests. More details in one paragraph of the a9e075a commit message for the curious. If we see these goroutines successfully released in the current e2e-operator run, I'll drop the debugging commit.

@wking
Copy link
Member Author

wking commented Feb 4, 2021

e2e-operator exploded on install, missing compute nodes, and possibly more. e2e passed install, and died in e2e test-cases, so e2e-operator failure must have been an unrelated flake.

/test e2e-agnostic-operator

@wking wking force-pushed the avoid-early-cancel-cvo-deadlock branch from 3d5b0e0 to a9e075a Compare February 4, 2021 21:54
@wking
Copy link
Member Author

wking commented Feb 4, 2021

Success. I've dropped the debugging commit with 3d5b0e0 -> a9e075a.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2021
@jottofar
Copy link
Contributor

jottofar commented Feb 4, 2021

/test unit

@jottofar
Copy link
Contributor

jottofar commented Feb 4, 2021

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2021
@wking
Copy link
Member Author

wking commented Feb 5, 2021

/test unit

@openshift-merge-robot openshift-merge-robot merged commit 9e25ac8 into openshift:master Feb 5, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1921413 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the avoid-early-cancel-cvo-deadlock branch February 9, 2021 05:51
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 16, 2021
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from
early cancels, 2021-01-28, openshift#508) made us more robust to situations
where we are canceled after acquiring the leader lock but before we
got into Operator.Run's UntilWithContext.  However, there was still a
bug from cc1921d (pkg/start: Release leader lease on graceful
shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock
[1].  postMainContext is used for metrics, informers, and the leader
election loop.  We used to only call postMainCancel after reaping the
main goroutine, and obviously that will only work if we've launched
the main goroutine.  This commit adds a new launchedMain to track
that.  If launchedMain is true, we get the old handling.  If
launchedMain is still false when runContext.Done, we now call
postMainCancel without waiting to reap a non-existent main goroutine.

There's also a new postMainCancel when the shutdown timer expires.  I
don't expect us to ever need that, but it protects us from future bugs
like this one.

I've added launchedMain without guarding it behind a lock, and it is
touched by both the main Options.run goroutine and the leader-election
callback.  So there's a racy chance of:

1. Options.run goroutine: runContext canceled, so runContext.Done() matches
2. Leader-election goroutine: Leader lock acquired
3. Options.run goroutine: !launchedMain, so we call postMainCancel()
4. Leader-election goroutine: launchedMain set true
5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...)

I'm trusting Operator.Run to respect runContext there and not do
anything significant, so the fact that we are already tearing down all
the post-main stuff won't cause problems.  Previous fixes like
a9e075a will help with that.  But there could still be bugs in
Operator.Run.  A lock around launchedMain that avoided calling
Operator.Run when runContext was already done would protect against
that, but seems like overkill in an already complicated goroutine
tangle.  Without the lock, we just have to field and fix any future
Operator.Run runContext issues as we find them.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants