Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 19, 2021

Avoid situations like:

$ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4
I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (#166).

As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses.

I'm a bit suspicious about steps := 8 capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit.

Avoid situations like [1]:

  $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4
  I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
  I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

where we are failing the reconcile, but not incrementing Attempt.
Because we rely on Attempt moving to drive changes in the ShiftOrder
and PermuteOrder calls in SyncWorker.apply.  This fixes a bug we've
had since the permuation logic landed in 1f95ccf (sync: During
reconcile, try different parts of the payload and don't hang,
2019-04-17, openshift#166).

As a side benefit, this change will no longer call
InitCOUpdateStartTime in SyncWorker.syncOnce after a failed
reconciliation cycle, so if a ClusterOperator goes Degraded during
reconciliation, we will now complain about it after the
UpdateEffectFailAfterInterval threshold elapses.

I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we
have some manifest task-nodes that have a good deal more than that.
But blindly increasing it would reduce the amount of permuting
PermuteOrder was doing.  We might have to do something more drastic
for reconciliation mode, like collecting manifest failures and
eventually failing the sync cycle, but not aborting it early to report
that failure.  But one step at a time; I'm not going to overhaul all
of that in this commit.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2021

@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 19, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2021
@wking
Copy link
Member Author

wking commented May 19, 2021

Cluster-bot cluster, via launch openshift/cluster-version-operator#569:

$ oc get clusterversion -o jsonpath='{.status.desired.version}{"\n"}' version
4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest
$ oc delete crd servicemonitors.monitoring.coreos.com
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
$ oc -n openshift-cluster-version get pods
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-version-operator-dbcdd99d4-9hv4f   1/1     Running   0          39m
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | head -n1
I0519 17:10:31.708389       1 start.go:21] ClusterVersionOperator v1.0.0-525-g8650e476-dirty
$ oc -n openshift-cluster-version logs cluster-version-operator-dbcdd99d4-9hv4f | grep 'Running sync.*in state\|Result of work'
I0519 17:14:12.784365       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 0
I0519 17:17:03.746322       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator monitoring is not available Cluster operator console is not available]
I0519 17:17:26.598895       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 1
I0519 17:20:17.556997       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available Cluster operator console is not available]
I0519 17:21:06.994799       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 2
I0519 17:23:57.952505       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:25:26.187900       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 3
I0519 17:28:17.145563       1 task_graph.go:555] Result of work: [Cluster operator authentication is not available]
I0519 17:31:18.530222       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Initializing at attempt 4
I0519 17:31:39.119608       1 task_graph.go:555] Result of work: []
I0519 17:34:30.076980       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:34:57.449939       1 task_graph.go:555] Result of work: []
I0519 17:37:48.406792       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:38:15.812221       1 task_graph.go:555] Result of work: []
I0519 17:41:06.769355       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:41:34.154324       1 task_graph.go:555] Result of work: []
I0519 17:44:25.111269       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:44:52.484736       1 task_graph.go:555] Result of work: []
I0519 17:47:43.441953       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 17:53:25.355226       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-node-tuning-operator/node-tuning-operator" (365 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (645 of 694): the server does not recognize this resource, check extension API servers]
I0519 17:56:24.502884       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 1
I0519 18:02:06.416771       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (273 of 694): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-console-operator/console-operator" (631 of 694): the server does not recognize this resource, check extension API servers update context deadline exceeded at 71 of 694]
I0519 18:05:03.089286       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 2
I0519 18:10:45.002771       1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
I0519 18:13:48.313555       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 3
I0519 18:14:15.702618       1 task_graph.go:555] Result of work: []
I0519 18:17:06.660367       1 sync_worker.go:541] Running sync 4.8.0-0.ci.test-2021-05-19-163242-ci-ln-q0ht5wt-latest (force=false) on generation 1 in state Reconciling at attempt 0
I0519 18:17:34.047241       1 task_graph.go:555] Result of work: []

hooray :)

next = time.After(wait.Jitter(interval, 0.2))

work.Completed = 0
work.Attempt++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

work.Completed++
work.Attempt = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// track how many times we have tried the current payload in the current
// state
if changed || w.work.State != work.State {
work.Attempt = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear why we need to make Attempt= 0 here? Because it means equalSyncWork returned false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to reset the count, because we are about to begin our first attempt at the new target.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented May 19, 2021

Azure is having a tough day:

image

but that's orthogonal to this PR.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Azure is having a tough day:

image

but that's orthogonal to this PR.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 5f7a33d into openshift:master May 19, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 19, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1927168 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the reconciliation-attempt-increments branch May 19, 2021 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants