-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #550
Conversation
|
@wking: This pull request references Bugzilla bug 1834551, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2bcc3eb to
5dfb480
Compare
|
/bugzilla refresh |
|
@wking: This pull request references Bugzilla bug 1834551, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ClusterOperatorDown is based on cluster_operator_up, but we also have
ClusterOperatorDegraded based on
cluster_operator_conditions{condition="Degraded"}. Firing
ClusterOperatorDown for operators which are Available=True and
Degraded=True confuses users [1]. ClusterOperatorDegraded will also
be firing. With this commit, I'm adjusting cluster_operator_up to
only care about Available, to decouple the two alerts and bring them
in line with their existing "has not been available" and "has been
degraded" descriptions.
However, IsOperatorStatusConditionTrue requires the condition to be
present, and cluster_operator_conditions only creates entries when the
conditions are present. To guard against the Degraded-unset condition
in ClusterOperatorDegraded, I'm covering with an 'or' [2] and 'group
by' [3] guard. So we should have the following cases:
* Available unset or != True: ClusterOperatorDown will fire.
* Available=True, Degraded unset or != False: ClusterOperatorDegraded
will fire. Firing on unset is new in this commit. Not firing
ClusterOperatorDown here is new in this commit.
* Available=True, Degraded=False: Neither alert fires.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1834551#c0
[2]: https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators
[3]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
5dfb480 to
90539f9
Compare
| ( | ||
| cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} | ||
| or on (name) | ||
| group by (name) (cluster_operator_up{job="cluster-version-operator"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a hard time reading this:
"if operator is degraded OR is up == 1 for ten minutes report degraded?"
Shouldn't that be
" if operator is degraded == 1 OR up == 0 for ten minutes report degraded"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the OR portion is doing, so I'm not saying it's incorrect, but according to PR description ClusterOperatorDegraded should only fire when Degraded true or unset. And if up == 0 only ClusterOperatorDown should fire which is handled above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a bunch of ClusterOperators. Grouping into sets:
a. Don't set Available, or set Available!=True.
b. Set Available=True.
c. Don't set Degraded or set it to an unrecognized value.
d. Set Degraded=True.
e. Set Degraded=False.
a∪b is all operators. c∪d∪e is all operators.
cluster_operator_up{job="cluster-version-operator"} will be zero for (a) and one for (b). ClusterOperatorDown will fire for (a) but not for (b). Great.
cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} will be one for (d) and zero for (e), but misses (c). group by collapses the cluster_operator_up value to one, regardless of (a) vs. (b). So when those (c) clusters fall through to the or, they hit the group by one, and come out of the (A or B) block with one just like (d). So ClusterOperatorDegraded will fire for (c) and (d) but not for (e). Great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's right. i hate group by.
|
This LGTM, what docs do we need to change besides the one in this repo and maybe the api doc? |
Do we need to change any docs? These alerts are ideally self-documenting, although see the improvements in flight for 4.9 in #547. I don't think this PR makes any changes to our guidance for what operators should put in their ClusterOperator conditions. |
|
the critical vs warning alert comparison i think is an appropriate alignment. i’d like that in api docs if possible. /lgtm i’ll look at gating based on aggregate time spent degraded |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
I've floated openshift/api#916. |
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/550/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1387908284446412800/build-log.txt | grep 'Step e2e'
INFO[2021-04-29T23:20:13Z] Step e2e-agnostic-operator-ipi-conf succeeded after 10s.
INFO[2021-04-29T23:20:23Z] Step e2e-agnostic-operator-ipi-conf-azure succeeded after 10s.
INFO[2021-04-29T23:20:33Z] Step e2e-agnostic-operator-ipi-install-monitoringpvc succeeded after 10s.
INFO[2021-04-29T23:20:43Z] Step e2e-agnostic-operator-ipi-install-rbac succeeded after 10s.
INFO[2021-04-29T23:59:03Z] Step e2e-agnostic-operator-ipi-install-install succeeded after 38m20s.
INFO[2021-04-30T00:00:23Z] Step e2e-agnostic-operator-e2e-test succeeded after 1m20s.
INFO[2021-04-30T00:01:13Z] Step e2e-agnostic-operator-gather-core-dump succeeded after 50s.
INFO[2021-04-30T00:04:23Z] Step e2e-agnostic-operator-gather-must-gather succeeded after 3m10s.
INFO[2021-04-30T00:06:13Z] Step e2e-agnostic-operator-gather-extra succeeded after 1m50s.
INFO[2021-04-30T00:07:43Z] Step e2e-agnostic-operator-gather-audit-logs succeeded after 1m30s.
INFO[2021-04-30T00:42:53Z] Step e2e-agnostic-operator-ipi-deprovision-deprovision failed after 35m10s.This PR didn't break teardown. /override ci/prow/e2e-agnostic-operator |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-operator DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Or e2e teardown. /override ci/prow/e2e-agnostic |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 1834551 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.7 |
|
@wking: #550 failed to apply on top of branch "release-4.7": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. This commit brings back * fb5257d (install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded, 2021-05-06, openshift#554) and * 92ed7f1 (install/0000_90_cluster-version-operator_02_servicemonitor: Update ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556). There are some conflicts, because I am not bringing back 90539f9 (pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550). But that one had its own conflicts in metrics.go [5], and the conflicts with this commit were orthogonal context issues, so moving this back to 4.7 first won't make it much harder to bring back openshift#550 and such later on, if we decide to do that. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916 [5]: openshift#550 (comment)
ClusterOperatorDownis based oncluster_operator_up, but we also haveClusterOperatorDegradedbased oncluster_operator_conditions{condition="Degraded"}. FiringClusterOperatorDownfor operators which areAvailable=TrueandDegraded=Trueconfuses users.ClusterOperatorDegradedwill also be firing. With this commit, I'm adjustingcluster_operator_upto only care aboutAvailable, to decouple the two alerts and bring them in line with their existing "has not been available" and "has been degraded" descriptions.However,
IsOperatorStatusConditionTruerequires the condition to be present, andcluster_operator_conditionsonly creates entries when the conditions are present. To guard against the Degraded-unset condition inClusterOperatorDegraded, I'm covering with anorandgroup byguard. So we should have the following cases:Availableunset or!= True:ClusterOperatorDownwill fire.Available=True,Degradedunset or!= False:ClusterOperatorDegradedwill fire. Firing on unset is new in this commit. Not firingClusterOperatorDownhere is new in this commit.Available=True,Degraded=False: Neither alert fires.