Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 6, 2020

Available=True, Progressing=False is the happy, steady state.
Available=True, Progressing=True is a happy update.
Available=False, Progressing=True is acceptable outage, e.g. during an update with the Recreate strategy:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep MinimumReplicasUnavailable | head -n1
Aug  6 17:56:00.674: INFO: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"dp-7f9df745ff\" is progressing."}}, CollisionCount:(*int32)(nil)}

Available=False, Progressing=False is the Deployment controller saying "I cannot deliver my expected service level for this Deployment", so that's when we should be complaining. Fixes noise like:

  Aug  6 18:03:00.500: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
  * Could not update namespace "openshift-service-ca-operator" (467 of 608)
  * deployment openshift-cluster-machine-approver/machine-approver is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.
  * deployment openshift-ingress-operator/ingress-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.

(the namespace part of that message is a separate issue).

…nd* Progressing=False

Available=True, Progressing=False is the happy, steady state.
Available=True, Progressing=True is a happy update.
Available=False, Progressing=True is acceptable outage, e.g. during an
  update with the Recreate strategy [1]:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664/artifacts/e2e-gcp-upgrade/container-logs/test.log | grep MinimumReplicasUnavailable | head -n1
  Aug  6 17:56:00.674: INFO: deployment status: v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:1, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"False", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"MinimumReplicasUnavailable", Message:"Deployment does not have minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, LastTransitionTime:v1.Time{Time:time.Time{wall:0x0, ext:63732333358, loc:(*time.Location)(0x9e74040)}}, Reason:"ReplicaSetUpdated", Message:"ReplicaSet \"dp-7f9df745ff\" is progressing."}}, CollisionCount:(*int32)(nil)}

Available=False, Progressing=False is the Deployment controller saying
"I cannot deliver my expected service level for this Deployment", so
that's when we should be complaining.  Fixes noise like:

  Aug  6 18:03:00.500: INFO: cluster upgrade is Failing: Multiple errors are preventing progress:
  * Could not update namespace "openshift-service-ca-operator" (467 of 608)
  * deployment openshift-cluster-machine-approver/machine-approver is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.
  * deployment openshift-ingress-operator/ingress-operator is not available MinimumReplicasUnavailable: Deployment does not have minimum availability.

(the namespace part of that message is a separate issue).

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1291426211527921664
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2020
@smarterclayton
Copy link
Contributor

I think this looks right, I'm going to trigger an upgrade via clutser bot between 4.5 and this.

@smarterclayton
Copy link
Contributor

@smarterclayton
Copy link
Contributor

Logs in that test run look correct

@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor

/retest

@wking
Copy link
Member Author

wking commented Aug 12, 2020

In this space is the upstream kubernetes/kubernetes#93933 fix. In the meantime, the CVO will continue to assume that Deployments which fall into that hole are happy enough, even if they should have been set ProgressDeadlineExceeded and blocked the CVO's manifest graph application. It's hard to work around that in the CVO though, without implementing our own deployment controller that monitors backing ReplicaSets directly. I think we should ignore the issue for now, and carry the upstream patch in our local deployment controllers instead of carrying a workaround in the CVO.

I think this PR is orthogonal enough that it should not block on any of the above paragraph getting addressed.

@openshift-merge-robot openshift-merge-robot merged commit 71aef74 into openshift:master Aug 12, 2020
@wking wking deleted the drop-available-deployment-check branch August 13, 2020 01:14
@smarterclayton
Copy link
Contributor

Agree, we should carry the fix to the workload controller for error detection, not CVO

wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 20, 2021
See in a customer cluster [1]:

  $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions'
  [
    {
      "lastTransitionTime": "2021-04-12T22:04:41Z",
      "lastUpdateTime": "2021-04-12T22:04:41Z",
      "message": "Deployment has minimum availability.",
      "reason": "MinimumReplicasAvailable",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2021-04-19T17:33:30Z",
      "lastUpdateTime": "2021-04-19T17:33:30Z",
      "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ]

Previous touch for this logic was 0442094
(lib/resourcebuilder/apps: Only error on Deployment Available=False
*and* Progressing=False, 2020-08-06, openshift#430), where I said:

  Available=True, Progressing=False is the happy, steady state.

This commit tightens that back down a bit, to account for cases where
the Deployment controller claims the Deployment is still available,
but has given up on rolling out a requested update [2].  We definitely
want to complain about Deployments like that, because admins might
want to investigate the sticking issue, and possibly delete or
otherwise poke the Deployment so we come back in and reconcile it so
the Deployment controller gives it another attempt.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1951339#c0
[2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants