-
Notifications
You must be signed in to change notification settings - Fork 212
lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False #430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False #430
Conversation
…nd* Progressing=False Available=True, Progressing=False is the happy, steady state. Available=True, Progressing=True is a happy update. Available=False, Progressing=True is acceptable outage, e.g. during an update with the Recreate strategy [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep MinimumReplicasUnavailable | head -n1 Aug 6 17:56:00.674: INFO: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"dp-7f9df745ff\" is progressing."}}, CollisionCount:(*int32)(nil)} Available=False, Progressing=False is the Deployment controller saying "I cannot deliver my expected service level for this Deployment", so that's when we should be complaining. Fixes noise like: Aug 6 18:03:00.500: INFO: cluster upgrade is Failing: Multiple errors are preventing progress: * Could not update namespace "openshift-service-ca-operator" (467 of 608) * deployment openshift-cluster-machine-approver/machine-approver is not available MinimumReplicasUnavailable: Deployment does not have minimum availability. * deployment openshift-ingress-operator/ingress-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability. (the namespace part of that message is a separate issue). [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664
|
I think this looks right, I'm going to trigger an upgrade via clutser bot between 4.5 and this. |
|
Logs in that test run look correct |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
In this space is the upstream kubernetes/kubernetes#93933 fix. In the meantime, the CVO will continue to assume that Deployments which fall into that hole are happy enough, even if they should have been set I think this PR is orthogonal enough that it should not block on any of the above paragraph getting addressed. |
|
Agree, we should carry the fix to the workload controller for error detection, not CVO |
See in a customer cluster [1]:
$ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions'
[
{
"lastTransitionTime": "2021-04-12T22:04:41Z",
"lastUpdateTime": "2021-04-12T22:04:41Z",
"message": "Deployment has minimum availability.",
"reason": "MinimumReplicasAvailable",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2021-04-19T17:33:30Z",
"lastUpdateTime": "2021-04-19T17:33:30Z",
"message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.",
"reason": "ProgressDeadlineExceeded",
"status": "False",
"type": "Progressing"
}
]
Previous touch for this logic was 0442094
(lib/resourcebuilder/apps: Only error on Deployment Available=False
*and* Progressing=False, 2020-08-06, openshift#430), where I said:
Available=True, Progressing=False is the happy, steady state.
This commit tightens that back down a bit, to account for cases where
the Deployment controller claims the Deployment is still available,
but has given up on rolling out a requested update [2]. We definitely
want to complain about Deployments like that, because admins might
want to investigate the sticking issue, and possibly delete or
otherwise poke the Deployment so we come back in and reconcile it so
the Deployment controller gives it another attempt.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1951339#c0
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment
Available=True,Progressing=Falseis the happy, steady state.Available=True,Progressing=Trueis a happy update.Available=False,Progressing=Trueis acceptable outage, e.g. during an update with theRecreatestrategy:Available=False,Progressing=Falseis the Deployment controller saying "I cannot deliver my expected service level for this Deployment", so that's when we should be complaining. Fixes noise like:(the namespace part of that message is a separate issue).