-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1921413: pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels #508
Conversation
|
@wking: This pull request references Bugzilla bug 1921413, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1921413, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c27f12d to
70a99d5
Compare
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1354905902246793216/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\)\|FAIL'
...
--- FAIL: TestIntegrationCVO_gracefulStepDown (11.07s)
I0128 22:12:10.009539 9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:10.009563 9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:10.009690 9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:10.009699 9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:10.010016 9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:10.010021 9056 cvo.go:371] Waiting on 2 outstanding goroutines.
I0128 22:12:13.380240 9056 cvo.go:392] Collected status notifier goroutine.
I0128 22:12:13.380281 9056 cvo.go:371] Waiting on 4 outstanding goroutines.
I0128 22:12:13.380332 9056 cvo.go:377] Collected sync worker goroutine.
I0128 22:12:13.380338 9056 cvo.go:371] Waiting on 3 outstanding goroutines.
I0128 22:12:13.380711 9056 cvo.go:377] Collected cluster version sync goroutine.
I0128 22:12:13.380720 9056 cvo.go:371] Waiting on 2 outstanding goroutines.
FAIL
...So something's going on in those tests to bail us out before we actually reap all the children. |
|
/hold |
|
In case this was some sort of e2e-operator race (which we've been bitten by in the past, but not with this exact presentation): /retest |
70a99d5 to
ffe5f5e
Compare
|
Comparing the above logs (which are similar to the second attempt), the two children not getting reaped are |
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/508/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1357195059073650688/build-log.txt | grep 'cvo.go:.*\(Waiting\|Collected\|Shutting down ClusterVersionOperator\|goroutine\)\|FAIL'
...
I0204 05:50:45.897564 9560 cvo.go:342] Launched upgradeable goroutine
I0204 05:50:45.897575 9560 cvo.go:344] Processing work for upgradeable goroutine
I0204 05:50:45.897530 9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907356 9560 cvo.go:378] Waiting on 5 outstanding goroutines.
I0204 05:50:45.907371 9560 cvo.go:384] Collected sync worker goroutine.
I0204 05:50:45.907377 9560 cvo.go:378] Waiting on 4 outstanding goroutines.
I0204 05:50:45.907402 9560 cvo.go:384] Collected status notifier goroutine.
I0204 05:50:45.907425 9560 cvo.go:378] Waiting on 3 outstanding goroutines.
I0204 05:50:45.907430 9560 cvo.go:384] Collected cluster version sync goroutine.
I0204 05:50:45.907433 9560 cvo.go:378] Waiting on 2 outstanding goroutines.
I0204 05:50:46.087418 9560 cvo.go:399] Collected status notifier goroutine.
I0204 05:50:46.087451 9560 cvo.go:378] Waiting on 4 outstanding goroutines.
...So stuck in processing, although without timestamps, it's hard to know for how long. I'm guessing we are getting stuck in: func processNextWorkItem(ctx context.Context, queue workqueue.RateLimitingInterface, syncHandler func(context.Context, string) error, syncFailingStatus syncFailingStatusFunc) bool {
key, quit := queue.Get()Looking to see if there's a |
In [1], we had a CVO fail to gracefully release the leader lock, with:
$ grep start.go: namespaces/openshift-cluster-version/pods/cluster-version-operator-968d9fd48-25zkz/cluster-version-operator/cluster-version-operator/logs/current.log
2021-01-26T19:02:48.678276895Z I0126 19:02:48.678120 1 start.go:21] ClusterVersionOperator 4.6.0-202012050130.p0-39a4256
2021-01-26T19:02:49.273117163Z I0126 19:02:49.273074 1 start.go:260] Waiting on 2 outstanding goroutines.
2021-01-26T19:02:49.273512006Z E0126 19:02:49.273474 1 start.go:271] Collected metrics server goroutine: listen tcp 0.0.0.0:9099: bind: address already in use
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273551 1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273569 1 start.go:264] Run context completed; beginning two-minute graceful shutdown period.
2021-01-26T19:02:49.273587622Z I0126 19:02:49.273577 1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:04:49.273938652Z I0126 19:04:49.273877 1 start.go:260] Waiting on 1 outstanding goroutines.
2021-01-26T19:05:39.264746935Z I0126 19:05:39.264690 1 start.go:190] Shutting down due to terminated
So the bind failed, which should be rare, but can happen if the new
CVO is scheduled before the kernel gets around to reaping the port
from a previous CVO pod. The start.go handling notices the metrics
server dying, and immediately cancels runContext, and we go into the
two-minute shutdown grace period. That happened so early that the
context was already done by the time we got into Operator.Run, with
its:
go wait.UntilWithContext(runContext, func(runContext context.Context) {
defer close(workerStopCh)
From the godocs for UntilWithContext:
$ git --no-pager grep -n '// UntilWithContext is syntactic sugar\|f may not be invoked if context\|func JitterUntilWithContext' vendor/k8s.io/apimachinery/pkg/util/wait
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:95:// UntilWithContext is syntactic sugar on top of JitterUntilWithContext
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:183:// Cancel context to stop. f may not be invoked if context is already expired.
vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:184:func JitterUntilWithContext(ctx context.Context, f func(context.Context), period time.Duration, jitterFactor float64, sliding bool) {
So when runContext is already done, we never close workerStopCh, and
we hang forever when we block on it before returning from
Operator.Run.
This commit addresses that issue by bringing in a resultChannel flow
like we already use in pkg/start/start.go and pkg/cvo/metrics.go. It
will give us logging for the number of outstanding goroutines, and
tell us as each of them are reaped, which will make it easier to
diagnose any other issues with hanging children, while also using
shutdownContext to bail out to avoid a stuck child causing major
problems.
This commit also adds ShutDown calls for availableUpdatesQueue and
upgradeableQueue, so we don't hang those two goroutines in
processNextWorkItem's queue.Get() call. The shutdowns should have
been added when the queues landed in 90e9881 (cvo: Change the core
CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45) and
0452814 (cvo: long running upgradeable sync worker, 2019-09-05,
until this commit tightened down on the goroutine collection.
There is a risk with these cluster-touching children, that if we hit
the shutdownContext-timeout safety valve, we might exit Operator.Run
and release the leader lock, and then have our abandoned child
fighting over in-cluster state with some subsequent CVO which just
acquired the leader lock. We might want to write an event or
something to make it more obvious when that happens, but for now, I'm
more interested in closing the context-already-canceled bug.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1921413
ffe5f5e to
3d5b0e0
Compare
|
I've pushed ffe5f5e -> 3d5b0e0, adding |
|
e2e-operator exploded on install, missing compute nodes, and possibly more. e2e passed install, and died in e2e test-cases, so e2e-operator failure must have been an unrelated flake. /test e2e-agnostic-operator |
3d5b0e0 to
a9e075a
Compare
|
/test unit |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test unit |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1921413 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944
Builds on #501; possibly review that first.
In rbhz#1921413, we had a CVO fail to gracefully release the leader lock, with:
So the bind failed, which should be rare, but can happen if the new CVO is scheduled before the kernel gets around to reaping the port from a previous CVO pod. The
start.gohandling notices the metrics server dying, and immediately cancelsrunContext, and we go into the two-minute shutdown grace period. That happened so early that the context was already done by the time we got intoOperator.Run, with its:From the godocs for
UntilWithContext:So when
runContextis already done, we never closeworkerStopCh, and we hang forever when we block on it before returning fromOperator.Run.This commit addresses that issue by bringing in a
resultChannelflow like we already use inpkg/start/start.goandpkg/cvo/metrics.go. It will give us logging for the number of outstanding goroutines, and tell us as each of them are reaped, which will make it easier to diagnose any other issues with hanging children, while also usingshutdownContextto bail out to avoid a stuck child causing major problems.There is a risk with these cluster-touching children, that if we hit the
shutdownContext-timeout safety valve, we might exitOperator.Runand release the leader lock, and then have our abandoned child fighting over in-cluster state with some subsequent CVO which just acquired the leader lock. We might want to write an event or something to make it more obvious when that happens, but for now, I'm more interested in closing the context-already-canceled bug.