Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Apr 29, 2021

Available=False is really bad. Possibly a page-at-midnight thing. "Hey, your registry is down, so any new pods based on local images will fail to launch" or "ingress is down, so your users cannot reach you". If it's not a page-at-midnight thing, it's at least going to be the first batch of things admins should look at when they get from-the-spout alerts.

Degraded=True is not great, but you should be able to survive with reduced quality-of-service until an admin wakes up in the morning.

…rity

Available=False is really bad.  Possibly a page-at-midnight thing.
"Hey, your registry is down, so any new pods based on local images
will fail to launch" or "ingress is down, so your users cannot reach
you".  If it's not a page-at-midnight thing, it's at least going to be
the first batch of things admins should look at when they get the
from-the-spout alerts [1].

Degraded=True is not great, but you should be able to survive with
reduced quality-of-service until an admin wakes up in the morning.

[1]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic#h.rvrk9gcasjzh
     Rob Ewaschuk, My Philosophy on Alerting
@smarterclayton
Copy link
Contributor

/lgtm
/approve

thanks

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 29, 2021
@smarterclayton smarterclayton added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 29, 2021
@openshift-merge-robot openshift-merge-robot merged commit e5fb810 into openshift:master Apr 29, 2021
@wking wking deleted the available-degraded-severity branch April 29, 2021 23:54
wking added a commit to wking/cluster-version-operator that referenced this pull request May 7, 2021
…usterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902 (clusteroperator: Don't block on failing
during initialization, 2019-03-11, openshift#136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: openshift/api#916
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 8, 2021
…usterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902 (clusteroperator: Don't block on failing
during initialization, 2019-03-11, openshift#136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

This commit brings back

* fb5257d
  (install/0000_90_cluster-version-operator_02_servicemonitor: Soften
  ClusterOperatorDegraded, 2021-05-06, openshift#554) and
* 92ed7f1
  (install/0000_90_cluster-version-operator_02_servicemonitor: Update
  ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556).

There are some conflicts, because I am not bringing back 90539f9
(pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550).
But that one had its own conflicts in metrics.go [5], and the
conflicts with this commit were orthogonal context issues, so moving
this back to 4.7 first won't make it much harder to bring back openshift#550
and such later on, if we decide to do that.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: openshift/api#916
[5]: openshift#550 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants