-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569
Conversation
Avoid situations like [1]: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4 I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (sync: During reconcile, try different parts of the payload and don't hang, 2019-04-17, openshift#166). As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses. I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
|
@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Cluster-bot cluster, via $ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest
$ oc delete crd servicemonitors.monitoring.coreos.com
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
$ oc -n openshift-cluster-version get pods
NAME READY STATUS RESTARTS AGE
cluster-version-operator-dbcdd99d4-9hv4f 1/1 Running 0 39m
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | head -n1
I0519 17:10:31.708389 1 start.go:21] ClusterVersionOperator v1.0.0-525-g8650e476-dirty
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | grep 'Running sync.*in state\|Result of work'
I0519 17:14:12.784365 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 0
I0519 17:17:03.746322 1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator monitoring is not available Cluster operator console is not available]
I0519 17:17:26.598895 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 1
I0519 17:20:17.556997 1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator console is not available]
I0519 17:21:06.994799 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 2
I0519 17:23:57.952505 1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:25:26.187900 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 3
I0519 17:28:17.145563 1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:31:18.530222 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 4
I0519 17:31:39.119608 1 task_graph.go:555] Result of work: []
I0519 17:34:30.076980 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:34:57.449939 1 task_graph.go:555] Result of work: []
I0519 17:37:48.406792 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:38:15.812221 1 task_graph.go:555] Result of work: []
I0519 17:41:06.769355 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:41:34.154324 1 task_graph.go:555] Result of work: []
I0519 17:44:25.111269 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:44:52.484736 1 task_graph.go:555] Result of work: []
I0519 17:47:43.441953 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:53:25.355226 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-node-tuning-operator/node-tuning-operator" (365 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (645 of 694): the server does not recognize this resource, check extension API servers]
I0519 17:56:24.502884 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 1
I0519 18:02:06.416771 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (273 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-console-operator/console-operator" (631 of 694): the server does not recognize this resource, check extension API servers update context deadline exceeded at 71 of 694]
I0519 18:05:03.089286 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 2
I0519 18:10:45.002771 1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
I0519 18:13:48.313555 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 3
I0519 18:14:15.702618 1 task_graph.go:555] Result of work: []
I0519 18:17:06.660367 1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 18:17:34.047241 1 task_graph.go:555] Result of work: []hooray :) |
| next = time.After(wait.Jitter(interval, 0.2)) | ||
|
|
||
| work.Completed = 0 | ||
| work.Attempt++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
|
|
||
| work.Completed++ | ||
| work.Attempt = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| // track how many times we have tried the current payload in the current | ||
| // state | ||
| if changed || w.work.State != work.State { | ||
| work.Attempt = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear why we need to make Attempt= 0 here? Because it means equalSyncWork returned false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to reset the count, because we are about to begin our first attempt at the new target.
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
Azure is having a tough day: but that's orthogonal to this PR. /override ci/prow/e2e-agnostic-upgrade |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1927168 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |

Avoid situations like:
where we are failing the reconcile, but not incrementing
Attempt. Because we rely onAttemptmoving to drive changes in theShiftOrderandPermuteOrdercalls inSyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (#166).As a side benefit, this change will no longer call
InitCOUpdateStartTimeinSyncWorker.syncOnceafter a failed reconciliation cycle, so if a ClusterOperator goesDegradedduring reconciliation, we will now complain about it after theUpdateEffectFailAfterIntervalthreshold elapses.I'm a bit suspicious about
steps := 8cappingShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permutingPermuteOrderwas doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit.