Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569

wking · 2021-05-19T05:34:05Z

Avoid situations like:

$ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4
I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (#166).

As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses.

I'm a bit suspicious about steps := 8 capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit.

Avoid situations like [1]: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4 I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (sync: During reconcile, try different parts of the payload and don't hang, 2019-04-17, openshift#166). As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses. I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10

openshift-ci · 2021-05-19T05:34:13Z

@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-05-19T18:15:23Z

Cluster-bot cluster, via launch openshift/cluster-version-operator#569:

$ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest
$ oc delete crd servicemonitors.monitoring.coreos.com
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
$ oc -n openshift-cluster-version get pods
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-version-operator-dbcdd99d4-9hv4f   1/1     Running   0          39m
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | head -n1
I0519 17:10:31.708389       1 start.go:21] ClusterVersionOperator v1.0.0-525-g8650e476-dirty
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | grep 'Running sync.*in state\|Result of work'
I0519 17:14:12.784365       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 0
I0519 17:17:03.746322       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator monitoring is not available Cluster operator console is not available]
I0519 17:17:26.598895       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 1
I0519 17:20:17.556997       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator console is not available]
I0519 17:21:06.994799       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 2
I0519 17:23:57.952505       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:25:26.187900       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 3
I0519 17:28:17.145563       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:31:18.530222       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 4
I0519 17:31:39.119608       1 task_graph.go:555] Result of work: []
I0519 17:34:30.076980       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:34:57.449939       1 task_graph.go:555] Result of work: []
I0519 17:37:48.406792       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:38:15.812221       1 task_graph.go:555] Result of work: []
I0519 17:41:06.769355       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:41:34.154324       1 task_graph.go:555] Result of work: []
I0519 17:44:25.111269       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:44:52.484736       1 task_graph.go:555] Result of work: []
I0519 17:47:43.441953       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:53:25.355226       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-node-tuning-operator/node-tuning-operator" (365 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (645 of 694): the server does not recognize this resource, check extension API servers]
I0519 17:56:24.502884       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 1
I0519 18:02:06.416771       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (273 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-console-operator/console-operator" (631 of 694): the server does not recognize this resource, check extension API servers update context deadline exceeded at 71 of 694]
I0519 18:05:03.089286       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 2
I0519 18:10:45.002771       1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
I0519 18:13:48.313555       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 3
I0519 18:14:15.702618       1 task_graph.go:555] Result of work: []
I0519 18:17:06.660367       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 18:17:34.047241       1 task_graph.go:555] Result of work: []

hooray :)

LalatenduMohanty · 2021-05-19T18:41:51Z

pkg/cvo/sync_worker.go

 				next = time.After(wait.Jitter(interval, 0.2))
-
+				work.Completed = 0
+				work.Attempt++


LalatenduMohanty · 2021-05-19T18:41:56Z

pkg/cvo/sync_worker.go

 			}

 			work.Completed++
+			work.Attempt = 0


LalatenduMohanty · 2021-05-19T18:47:29Z

pkg/cvo/sync_worker.go

-	// track how many times we have tried the current payload in the current
-	// state
-	if changed || w.work.State != work.State {
 		work.Attempt = 0


It is not clear why we need to make Attempt= 0 here? Because it means equalSyncWork returned false

We want to reset the count, because we are about to begin our first attempt at the new target.

LalatenduMohanty

/lgtm

openshift-ci · 2021-05-19T18:51:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-05-19T19:47:34Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2021-05-19T23:05:54Z

Azure is having a tough day:

but that's orthogonal to this PR.

/override ci/prow/e2e-agnostic-upgrade

openshift-ci · 2021-05-19T23:06:11Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Azure is having a tough day:

but that's orthogonal to this PR.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-19T23:07:30Z

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1927168 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2021

openshift-ci bot requested review from LalatenduMohanty and jottofar May 19, 2021 05:34

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2021

LalatenduMohanty suggested changes May 19, 2021

View reviewed changes

LalatenduMohanty approved these changes May 19, 2021

View reviewed changes

openshift-ci bot assigned LalatenduMohanty May 19, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021

openshift-merge-robot merged commit 5f7a33d into openshift:master May 19, 2021

wking deleted the reconciliation-attempt-increments branch May 19, 2021 23:09

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569

Uh oh!

Conversation

wking commented May 19, 2021

Uh oh!

openshift-ci bot commented May 19, 2021

Uh oh!

wking commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LalatenduMohanty May 19, 2021

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty May 19, 2021

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty May 19, 2021

Choose a reason for hiding this comment

Uh oh!

wking May 19, 2021

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 19, 2021

Uh oh!

openshift-bot commented May 19, 2021

Uh oh!

wking commented May 19, 2021

Uh oh!

openshift-ci bot commented May 19, 2021

Uh oh!

openshift-ci bot commented May 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wking commented May 19, 2021 •

edited

Loading