-
Notifications
You must be signed in to change notification settings - Fork 213
sync: During reconcile, try different parts of the payload and don't hang #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: During reconcile, try different parts of the payload and don't hang #166
Conversation
|
/assign @abhinavdahiya I think with this we are in an "ok" state for ship with regards to entropy minimization in the CVO. |
|
/retest |
…hang The worst outcome of reconcile would be that a break occurs in a late component and the CVO gets hung trying to reach it. Instead, when reconciling take the graph and run it in random blocks at a lower parallelization factor to ensure that we will eventually try all blocks, even if something fundamental is broken. The permutation of the graph ordering is stable for a given payload and sync time (so that when we test locally we are testing the same order a customer we would see). The only gap still left would be a component breaks that can't be recovered due to ordering (to fix a role binding we need to deploy a deployment, but the lack of the deployment prevents us from sending the role binding) which we can potentially investigate later by randomizing the task order. We track "number of attempts" in a given state / payload to see the input.
85f8620 to
1f95ccf
Compare
|
@wking can you review? |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest not related |
Avoid situations like [1]: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4 I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (sync: During reconcile, try different parts of the payload and don't hang, 2019-04-17, openshift#166). As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses. I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
Avoid situations like [1]: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail -n4 I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] where we are failing the reconcile, but not incrementing Attempt. Because we rely on Attempt moving to drive changes in the ShiftOrder and PermuteOrder calls in SyncWorker.apply. This fixes a bug we've had since the permuation logic landed in 1f95ccf (sync: During reconcile, try different parts of the payload and don't hang, 2019-04-17, openshift#166). As a side benefit, this change will no longer call InitCOUpdateStartTime in SyncWorker.syncOnce after a failed reconciliation cycle, so if a ClusterOperator goes Degraded during reconciliation, we will now complain about it after the UpdateEffectFailAfterInterval threshold elapses. I'm a bit suspicious about 'steps := 8' capping ShiftOrder, because we have some manifest task-nodes that have a good deal more than that. But blindly increasing it would reduce the amount of permuting PermuteOrder was doing. We might have to do something more drastic for reconciliation mode, like collecting manifest failures and eventually failing the sync cycle, but not aborting it early to report that failure. But one step at a time; I'm not going to overhaul all of that in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c10
The worst outcome of reconcile would be that a break occurs in a late component and the CVO gets hung trying to reach it to correct the issue (i.e. 0000_30 is broken due to a change in 0000_70). Because we process in order we never get to 70 and fix it. The other issue is that a medium to long term break in 30 can prevent entropy repair at 70, so 70 degrades until 30 recovers.
Instead of running the reconcile graph in order, run it in random blocks at a lower parallelization factor to ensure that we will eventually try all blocks, even if something fundamental is broken. The permutation of the graph ordering is stable for a given payload and sync time (so that when we test locally we are testing the same order a customer we would see).
The only gap still left would be a component breaks that can't be recovered due to ordering (to fix a role binding we need to deploy a deployment, but the lack of the deployment prevents us from sending the role binding) which we can potentially investigate later by randomizing the task order.
We track "number of attempts" in a given state / payload to see the input.