config/v1/types_cluster_operator: Clarify Available and Degraded severity #916

wking · 2021-04-29T23:27:43Z

Available=False is really bad. Possibly a page-at-midnight thing. "Hey, your registry is down, so any new pods based on local images will fail to launch" or "ingress is down, so your users cannot reach you". If it's not a page-at-midnight thing, it's at least going to be the first batch of things admins should look at when they get from-the-spout alerts.

Degraded=True is not great, but you should be able to survive with reduced quality-of-service until an admin wakes up in the morning.

…rity Available=False is really bad. Possibly a page-at-midnight thing. "Hey, your registry is down, so any new pods based on local images will fail to launch" or "ingress is down, so your users cannot reach you". If it's not a page-at-midnight thing, it's at least going to be the first batch of things admins should look at when they get the from-the-spout alerts [1]. Degraded=True is not great, but you should be able to survive with reduced quality-of-service until an admin wakes up in the morning. [1]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic#h.rvrk9gcasjzh Rob Ewaschuk, My Philosophy on Alerting

smarterclayton · 2021-04-29T23:47:20Z

/lgtm
/approve

thanks

openshift-ci-robot · 2021-04-29T23:47:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916

…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. This commit brings back * fb5257d (install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded, 2021-05-06, openshift#554) and * 92ed7f1 (install/0000_90_cluster-version-operator_02_servicemonitor: Update ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556). There are some conflicts, because I am not bringing back 90539f9 (pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550). But that one had its own conflicts in metrics.go [5], and the conflicts with this commit were orthogonal context issues, so moving this back to 4.7 first won't make it much harder to bring back openshift#550 and such later on, if we decide to do that. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916 [5]: openshift#550 (comment)

openshift-ci-robot requested review from adambkaplan and deads2k April 29, 2021 23:27

wking mentioned this pull request Apr 29, 2021

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up openshift/cluster-version-operator#550

Merged

openshift-ci-robot assigned smarterclayton Apr 29, 2021

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 29, 2021

smarterclayton added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 29, 2021

openshift-merge-robot merged commit e5fb810 into openshift:master Apr 29, 2021

wking deleted the available-degraded-severity branch April 29, 2021 23:54

wking mentioned this pull request May 7, 2021

Bug 1957991: install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded openshift/cluster-version-operator#554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

config/v1/types_cluster_operator: Clarify Available and Degraded severity #916

config/v1/types_cluster_operator: Clarify Available and Degraded severity #916

Uh oh!

wking commented Apr 29, 2021 •

edited

Loading

Uh oh!

smarterclayton commented Apr 29, 2021

Uh oh!

openshift-ci-robot commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

config/v1/types_cluster_operator: Clarify Available and Degraded severity #916

config/v1/types_cluster_operator: Clarify Available and Degraded severity #916

Uh oh!

Conversation

wking commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarterclayton commented Apr 29, 2021

Uh oh!

openshift-ci-robot commented Apr 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wking commented Apr 29, 2021 •

edited

Loading