Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Apr 17, 2019

The worst outcome of reconcile would be that a break occurs in a late component and the CVO gets hung trying to reach it to correct the issue (i.e. 0000_30 is broken due to a change in 0000_70). Because we process in order we never get to 70 and fix it. The other issue is that a medium to long term break in 30 can prevent entropy repair at 70, so 70 degrades until 30 recovers.

Instead of running the reconcile graph in order, run it in random blocks at a lower parallelization factor to ensure that we will eventually try all blocks, even if something fundamental is broken. The permutation of the graph ordering is stable for a given payload and sync time (so that when we test locally we are testing the same order a customer we would see).

The only gap still left would be a component breaks that can't be recovered due to ordering (to fix a role binding we need to deploy a deployment, but the lack of the deployment prevents us from sending the role binding) which we can potentially investigate later by randomizing the task order.

We track "number of attempts" in a given state / payload to see the input.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2019
@smarterclayton
Copy link
Contributor Author

/assign @abhinavdahiya

I think with this we are in an "ok" state for ship with regards to entropy minimization in the CVO.

@smarterclayton
Copy link
Contributor Author

/retest

…hang

The worst outcome of reconcile would be that a break occurs in a late
component and the CVO gets hung trying to reach it. Instead, when reconciling
take the graph and run it in random blocks at a lower parallelization
factor to ensure that we will eventually try all blocks, even if something
fundamental is broken. The permutation of the graph ordering is stable
for a given payload and sync time (so that when we test locally we are
testing the same order a customer we would see).

The only gap still left would be a component breaks that can't be recovered
due to ordering (to fix a role binding we need to deploy a deployment, but
the lack of the deployment prevents us from sending the role binding) which
we can potentially investigate later by randomizing the task order.

We track "number of attempts" in a given state / payload to see the input.
@smarterclayton
Copy link
Contributor Author

@wking can you review?

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

/retest

not related

@openshift-merge-robot openshift-merge-robot merged commit c7e6a1d into openshift:master Apr 18, 2019
wking added a commit to wking/cluster-version-operator that referenced this pull request May 19, 2021
Avoid situations like [1]:

  $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4
  I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
  I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

where we are failing the reconcile, but not incrementing Attempt.
Because we rely on Attempt moving to drive changes in the ShiftOrder
and PermuteOrder calls in SyncWorker.apply.  This fixes a bug we've
had since the permuation logic landed in 1f95ccf (sync: During
reconcile, try different parts of the payload and don't hang,
2019-04-17, openshift#166).

As a side benefit, this change will no longer call
InitCOUpdateStartTime in SyncWorker.syncOnce after a failed
reconciliation cycle, so if a ClusterOperator goes Degraded during
reconciliation, we will now complain about it after the
UpdateEffectFailAfterInterval threshold elapses.

I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we
have some manifest task-nodes that have a good deal more than that.
But blindly increasing it would reduce the amount of permuting
PermuteOrder was doing.  We might have to do something more drastic
for reconciliation mode, like collecting manifest failures and
eventually failing the sync cycle, but not aborting it early to report
that failure.  But one step at a time; I'm not going to overhaul all
of that in this commit.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
LalatenduMohanty pushed a commit to LalatenduMohanty/cluster-version-operator that referenced this pull request May 19, 2021
Avoid situations like [1]:

  $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4
  I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
  I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
  I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

where we are failing the reconcile, but not incrementing Attempt.
Because we rely on Attempt moving to drive changes in the ShiftOrder
and PermuteOrder calls in SyncWorker.apply.  This fixes a bug we've
had since the permuation logic landed in 1f95ccf (sync: During
reconcile, try different parts of the payload and don't hang,
2019-04-17, openshift#166).

As a side benefit, this change will no longer call
InitCOUpdateStartTime in SyncWorker.syncOnce after a failed
reconciliation cycle, so if a ClusterOperator goes Degraded during
reconciliation, we will now complain about it after the
UpdateEffectFailAfterInterval threshold elapses.

I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we
have some manifest task-nodes that have a good deal more than that.
But blindly increasing it would reduce the amount of permuting
PermuteOrder was doing.  We might have to do something more drastic
for reconciliation mode, like collecting manifest failures and
eventually failing the sync cycle, but not aborting it early to report
that failure.  But one step at a time; I'm not going to overhaul all
of that in this commit.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants