lib/resourcebuilder/apps: Only error on Deployment Available=False and Progressing=False #430

wking · 2020-08-06T20:02:36Z

Available=True, Progressing=False is the happy, steady state.
Available=True, Progressing=True is a happy update.
Available=False, Progressing=True is acceptable outage, e.g. during an update with the Recreate strategy:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep MinimumReplicasUnavailable | head -n1
Aug  6 17:56:00.674: INFO: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"dp-7f9df745ff\" is progressing."}}, CollisionCount:(*int32)(nil)}

Available=False, Progressing=False is the Deployment controller saying "I cannot deliver my expected service level for this Deployment", so that's when we should be complaining. Fixes noise like:

  Aug  6 18:03:00.500: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
  * Could not update namespace "openshift-service-ca-operator" (467 of 608)
  * deployment openshift-cluster-machine-approver/machine-approver is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.
  * deployment openshift-ingress-operator/ingress-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.

(the namespace part of that message is a separate issue).

…nd* Progressing=False Available=True, Progressing=False is the happy, steady state. Available=True, Progressing=True is a happy update. Available=False, Progressing=True is acceptable outage, e.g. during an update with the Recreate strategy [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep MinimumReplicasUnavailable | head -n1 Aug 6 17:56:00.674: INFO: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"dp-7f9df745ff\" is progressing."}}, CollisionCount:(*int32)(nil)} Available=False, Progressing=False is the Deployment controller saying "I cannot deliver my expected service level for this Deployment", so that's when we should be complaining. Fixes noise like: Aug 6 18:03:00.500: INFO: cluster upgrade is Failing: Multiple errors are preventing progress: * Could not update namespace "openshift-service-ca-operator" (467 of 608) * deployment openshift-cluster-machine-approver/machine-approver is not available MinimumReplicasUnavailable: Deployment does not have minimum availability. * deployment openshift-ingress-operator/ingress-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability. (the namespace part of that message is a separate issue). [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664

smarterclayton · 2020-08-06T20:05:50Z

I think this looks right, I'm going to trigger an upgrade via clutser bot between 4.5 and this.

smarterclayton · 2020-08-06T20:09:34Z

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1291465594092130304 should not display the error

smarterclayton · 2020-08-06T21:31:26Z

Logs in that test run look correct

smarterclayton · 2020-08-12T17:47:36Z

/lgtm

openshift-ci-robot · 2020-08-12T17:47:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2020-08-12T18:54:45Z

/retest

wking · 2020-08-12T19:33:15Z

In this space is the upstream kubernetes/kubernetes#93933 fix. In the meantime, the CVO will continue to assume that Deployments which fall into that hole are happy enough, even if they should have been set ProgressDeadlineExceeded and blocked the CVO's manifest graph application. It's hard to work around that in the CVO though, without implementing our own deployment controller that monitors backing ReplicaSets directly. I think we should ignore the issue for now, and carry the upstream patch in our local deployment controllers instead of carrying a workaround in the CVO.

I think this PR is orthogonal enough that it should not block on any of the above paragraph getting addressed.

smarterclayton · 2020-08-13T17:43:53Z

Agree, we should carry the fix to the workload controller for error detection, not CVO

See in a customer cluster [1]: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-12T22:04:41Z", "lastUpdateTime": "2021-04-12T22:04:41Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-19T17:33:30Z", "lastUpdateTime": "2021-04-19T17:33:30Z", "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] Previous touch for this logic was 0442094 (lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False, 2020-08-06, openshift#430), where I said: Available=True, Progressing=False is the happy, steady state. This commit tightens that back down a bit, to account for cases where the Deployment controller claims the Deployment is still available, but has given up on rolling out a requested update [2]. We definitely want to complain about Deployments like that, because admins might want to investigate the sticking issue, and possibly delete or otherwise poke the Deployment so we come back in and reconcile it so the Deployment controller gives it another attempt. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1951339#c0 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment

openshift-ci-robot requested review from abhinavdahiya and sdodson August 6, 2020 20:02

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2020

openshift-ci-robot assigned smarterclayton Aug 12, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2020

openshift-merge-robot merged commit 71aef74 into openshift:master Aug 12, 2020

wking deleted the drop-available-deployment-check branch August 13, 2020 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lib/resourcebuilder/apps: Only error on Deployment Available=False and Progressing=False #430

lib/resourcebuilder/apps: Only error on Deployment Available=False and Progressing=False #430

wking commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 12, 2020

Uh oh!

openshift-ci-robot commented Aug 12, 2020

Uh oh!

smarterclayton commented Aug 12, 2020

Uh oh!

wking commented Aug 12, 2020

Uh oh!

smarterclayton commented Aug 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False #430

lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False #430

Conversation

wking commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 6, 2020

Uh oh!

smarterclayton commented Aug 12, 2020

Uh oh!

openshift-ci-robot commented Aug 12, 2020

Uh oh!

smarterclayton commented Aug 12, 2020

Uh oh!

wking commented Aug 12, 2020

Uh oh!

smarterclayton commented Aug 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lib/resourcebuilder/apps: Only error on Deployment Available=False and Progressing=False #430

lib/resourcebuilder/apps: Only error on Deployment Available=False and Progressing=False #430