sync: During reconcile, try different parts of the payload and don't hang #166

smarterclayton · 2019-04-17T16:45:59Z

The worst outcome of reconcile would be that a break occurs in a late component and the CVO gets hung trying to reach it to correct the issue (i.e. 0000_30 is broken due to a change in 0000_70). Because we process in order we never get to 70 and fix it. The other issue is that a medium to long term break in 30 can prevent entropy repair at 70, so 70 degrades until 30 recovers.

Instead of running the reconcile graph in order, run it in random blocks at a lower parallelization factor to ensure that we will eventually try all blocks, even if something fundamental is broken. The permutation of the graph ordering is stable for a given payload and sync time (so that when we test locally we are testing the same order a customer we would see).

The only gap still left would be a component breaks that can't be recovered due to ordering (to fix a role binding we need to deploy a deployment, but the lack of the deployment prevents us from sending the role binding) which we can potentially investigate later by randomizing the task order.

We track "number of attempts" in a given state / payload to see the input.

smarterclayton · 2019-04-17T16:51:46Z

/assign @abhinavdahiya

I think with this we are in an "ok" state for ship with regards to entropy minimization in the CVO.

smarterclayton · 2019-04-17T17:02:43Z

/retest

…hang The worst outcome of reconcile would be that a break occurs in a late component and the CVO gets hung trying to reach it. Instead, when reconciling take the graph and run it in random blocks at a lower parallelization factor to ensure that we will eventually try all blocks, even if something fundamental is broken. The permutation of the graph ordering is stable for a given payload and sync time (so that when we test locally we are testing the same order a customer we would see). The only gap still left would be a component breaks that can't be recovered due to ordering (to fix a role binding we need to deploy a deployment, but the lack of the deployment prevents us from sending the role binding) which we can potentially investigate later by randomizing the task order. We track "number of attempts" in a given state / payload to see the input.

smarterclayton · 2019-04-18T15:56:09Z

@wking can you review?

abhinavdahiya · 2019-04-18T16:17:08Z

/lgtm

openshift-ci-robot · 2019-04-18T16:17:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-04-18T17:06:16Z

/retest

not related

Avoid situations like [1]: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4 I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (sync: During reconcile, try different parts of the payload and don't hang, 2019-04-17, openshift#166). As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses. I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10

openshift-ci-robot requested review from abhinavdahiya and crawford April 17, 2019 16:46

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2019

openshift-ci-robot assigned abhinavdahiya Apr 17, 2019

smarterclayton force-pushed the randomize_reconcile branch from 85f8620 to 1f95ccf Compare April 17, 2019 17:45

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2019

openshift-merge-robot merged commit c7e6a1d into openshift:master Apr 18, 2019

wking mentioned this pull request May 19, 2021

Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation #569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync: During reconcile, try different parts of the payload and don't hang #166

sync: During reconcile, try different parts of the payload and don't hang #166

Uh oh!

smarterclayton commented Apr 17, 2019 •

edited

Loading

Uh oh!

smarterclayton commented Apr 17, 2019

Uh oh!

smarterclayton commented Apr 17, 2019

Uh oh!

smarterclayton commented Apr 18, 2019

Uh oh!

abhinavdahiya commented Apr 18, 2019

Uh oh!

openshift-ci-robot commented Apr 18, 2019

Uh oh!

smarterclayton commented Apr 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sync: During reconcile, try different parts of the payload and don't hang #166

sync: During reconcile, try different parts of the payload and don't hang #166

Uh oh!

Conversation

smarterclayton commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Apr 17, 2019

Uh oh!

smarterclayton commented Apr 17, 2019

Uh oh!

smarterclayton commented Apr 18, 2019

Uh oh!

abhinavdahiya commented Apr 18, 2019

Uh oh!

openshift-ci-robot commented Apr 18, 2019

Uh oh!

smarterclayton commented Apr 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smarterclayton commented Apr 17, 2019 •

edited

Loading