Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #550

wking · 2021-04-26T20:16:32Z

ClusterOperatorDown is based on cluster_operator_up, but we also have ClusterOperatorDegraded based on cluster_operator_conditions{condition="Degraded"}. Firing ClusterOperatorDown for operators which are Available=True and Degraded=True confuses users. ClusterOperatorDegraded will also be firing. With this commit, I'm adjusting cluster_operator_up to only care about Available, to decouple the two alerts and bring them in line with their existing "has not been available" and "has been degraded" descriptions.

However, IsOperatorStatusConditionTrue requires the condition to be present, and cluster_operator_conditions only creates entries when the conditions are present. To guard against the Degraded-unset condition in ClusterOperatorDegraded, I'm covering with an or and group by guard. So we should have the following cases:

Available unset or != True: ClusterOperatorDown will fire.
Available=True, Degraded unset or != False: ClusterOperatorDegraded will fire. Firing on unset is new in this commit. Not firing ClusterOperatorDown here is new in this commit.
Available=True, Degraded=False: Neither alert fires.

openshift-ci-robot · 2021-04-26T20:16:35Z

@wking: This pull request references Bugzilla bug 1834551, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-04-26T20:39:54Z

/bugzilla refresh

openshift-ci-robot · 2021-04-26T20:40:01Z

@wking: This pull request references Bugzilla bug 1834551, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ClusterOperatorDown is based on cluster_operator_up, but we also have ClusterOperatorDegraded based on cluster_operator_conditions{condition="Degraded"}. Firing ClusterOperatorDown for operators which are Available=True and Degraded=True confuses users [1]. ClusterOperatorDegraded will also be firing. With this commit, I'm adjusting cluster_operator_up to only care about Available, to decouple the two alerts and bring them in line with their existing "has not been available" and "has been degraded" descriptions. However, IsOperatorStatusConditionTrue requires the condition to be present, and cluster_operator_conditions only creates entries when the conditions are present. To guard against the Degraded-unset condition in ClusterOperatorDegraded, I'm covering with an 'or' [2] and 'group by' [3] guard. So we should have the following cases: * Available unset or != True: ClusterOperatorDown will fire. * Available=True, Degraded unset or != False: ClusterOperatorDegraded will fire. Firing on unset is new in this commit. Not firing ClusterOperatorDown here is new in this commit. * Available=True, Degraded=False: Neither alert fires. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1834551#c0 [2]: https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators [3]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators

smarterclayton · 2021-04-26T21:06:41Z

install/0000_90_cluster-version-operator_02_servicemonitor.yaml

+        (
+          cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
+          or on (name)
+          group by (name) (cluster_operator_up{job="cluster-version-operator"})


Having a hard time reading this:

"if operator is degraded OR is up == 1 for ten minutes report degraded?"

Shouldn't that be

" if operator is degraded == 1 OR up == 0 for ten minutes report degraded"?

I'm not sure what the OR portion is doing, so I'm not saying it's incorrect, but according to PR description ClusterOperatorDegraded should only fire when Degraded true or unset. And if up == 0 only ClusterOperatorDown should fire which is handled above.

We have a bunch of ClusterOperators. Grouping into sets:

a. Don't set Available, or set Available!=True.
b. Set Available=True.
c. Don't set Degraded or set it to an unrecognized value.
d. Set Degraded=True.
e. Set Degraded=False.

a∪b is all operators. c∪d∪e is all operators.

cluster_operator_up{job="cluster-version-operator"} will be zero for (a) and one for (b). ClusterOperatorDown will fire for (a) but not for (b). Great.

cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} will be one for (d) and zero for (e), but misses (c). group by collapses the cluster_operator_up value to one, regardless of (a) vs. (b). So when those (c) clusters fall through to the or, they hit the group by one, and come out of the (A or B) block with one just like (d). So ClusterOperatorDegraded will fire for (c) and (d) but not for (e). Great.

oh that's right. i hate group by.

smarterclayton · 2021-04-27T20:14:57Z

This LGTM, what docs do we need to change besides the one in this repo and maybe the api doc?

wking · 2021-04-29T03:58:56Z

This LGTM, what docs do we need to change besides the one in this repo and maybe the api doc?

Do we need to change any docs? These alerts are ideally self-documenting, although see the improvements in flight for 4.9 in #547. I don't think this PR makes any changes to our guidance for what operators should put in their ClusterOperator conditions.

smarterclayton · 2021-04-29T23:11:48Z

the critical vs warning alert comparison i think is an appropriate alignment. i’d like that in api docs if possible.

/lgtm
/approve

i’ll look at gating based on aggregate time spent degraded

openshift-ci-robot · 2021-04-29T23:12:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2021-04-29T23:14:44Z

/retest

wking · 2021-04-29T23:29:42Z

the critical vs warning alert comparison i think is an appropriate alignment. i’d like that in api docs if possible.

I've floated openshift/api#916.

wking · 2021-04-30T02:50:25Z

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/550/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1387908284446412800/build-log.txt | grep 'Step e2e'
INFO[2021-04-29T23:20:13Z] Step e2e-agnostic-operator-ipi-conf succeeded after 10s.
INFO[2021-04-29T23:20:23Z] Step e2e-agnostic-operator-ipi-conf-azure succeeded after 10s.
INFO[2021-04-29T23:20:33Z] Step e2e-agnostic-operator-ipi-install-monitoringpvc succeeded after 10s.
INFO[2021-04-29T23:20:43Z] Step e2e-agnostic-operator-ipi-install-rbac succeeded after 10s.
INFO[2021-04-29T23:59:03Z] Step e2e-agnostic-operator-ipi-install-install succeeded after 38m20s.
INFO[2021-04-30T00:00:23Z] Step e2e-agnostic-operator-e2e-test succeeded after 1m20s.
INFO[2021-04-30T00:01:13Z] Step e2e-agnostic-operator-gather-core-dump succeeded after 50s.
INFO[2021-04-30T00:04:23Z] Step e2e-agnostic-operator-gather-must-gather succeeded after 3m10s.
INFO[2021-04-30T00:06:13Z] Step e2e-agnostic-operator-gather-extra succeeded after 1m50s.
INFO[2021-04-30T00:07:43Z] Step e2e-agnostic-operator-gather-audit-logs succeeded after 1m30s.
INFO[2021-04-30T00:42:53Z] Step e2e-agnostic-operator-ipi-deprovision-deprovision failed after 35m10s.

This PR didn't break teardown.

/override ci/prow/e2e-agnostic-operator

openshift-ci-robot · 2021-04-30T02:50:40Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-operator

Details

In response to this:

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/550/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1387908284446412800/build-log.txt | grep 'Step e2e'
INFO[2021-04-29T23:20:13Z] Step e2e-agnostic-operator-ipi-conf succeeded after 10s.
INFO[2021-04-29T23:20:23Z] Step e2e-agnostic-operator-ipi-conf-azure succeeded after 10s.
INFO[2021-04-29T23:20:33Z] Step e2e-agnostic-operator-ipi-install-monitoringpvc succeeded after 10s.
INFO[2021-04-29T23:20:43Z] Step e2e-agnostic-operator-ipi-install-rbac succeeded after 10s.
INFO[2021-04-29T23:59:03Z] Step e2e-agnostic-operator-ipi-install-install succeeded after 38m20s.
INFO[2021-04-30T00:00:23Z] Step e2e-agnostic-operator-e2e-test succeeded after 1m20s.
INFO[2021-04-30T00:01:13Z] Step e2e-agnostic-operator-gather-core-dump succeeded after 50s.
INFO[2021-04-30T00:04:23Z] Step e2e-agnostic-operator-gather-must-gather succeeded after 3m10s.
INFO[2021-04-30T00:06:13Z] Step e2e-agnostic-operator-gather-extra succeeded after 1m50s.
INFO[2021-04-30T00:07:43Z] Step e2e-agnostic-operator-gather-audit-logs succeeded after 1m30s.
INFO[2021-04-30T00:42:53Z] Step e2e-agnostic-operator-ipi-deprovision-deprovision failed after 35m10s.

This PR didn't break teardown.

/override ci/prow/e2e-agnostic-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-04-30T02:53:26Z

Or e2e teardown.

/override ci/prow/e2e-agnostic

openshift-ci-robot · 2021-04-30T02:53:39Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic

Details

In response to this:

Or e2e.

/override ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-04-30T03:42:55Z

Total Regional Cores quota overshoot on install.

/retest

openshift-ci-robot · 2021-04-30T06:47:09Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#550

Bugzilla bug 1834551 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2021-06-08T14:25:13Z

/cherrypick release-4.7

openshift-cherrypick-robot · 2021-06-08T14:25:47Z

@wking: #550 failed to apply on top of branch "release-4.7":

Applying: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up
Using index info to reconstruct a base tree...
M	pkg/cvo/metrics.go
M	pkg/cvo/metrics_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/metrics_test.go
Auto-merging pkg/cvo/metrics.go
CONFLICT (content): Merge conflict in pkg/cvo/metrics.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pkg/cvo/metrics: Ignore Degraded for cluster_operator_up
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…usterOperatorDegraded During install, the CVO has pushed manifests into the cluster as fast as possible without blocking on "has the in-cluster resource leveled?" since way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136). That can lead to ClusterOperatorDown and ClusterOperatorDegraded firing during install, as we see in [1], where: * ClusterOperatorDegraded started pending at 5:00:15Z [2]. * Install completed at 5:09:58Z [3]. * ClusterOperatorDegraded started firing at 5:10:04Z [2]. * ClusterOperatorDegraded stopped firing at 5:10:23Z [2]. * The e2e suite complained about [1]: alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580) ClusterOperatorDown is similar, but I'll leave addressing it to a separate commit. For ClusterOperatorDegraded, the degraded condition should not be particularly urgent [4], so we should be find bumping it to 'warning' and using 'for: 30m' or something more relaxed than the current 10m. This commit brings back * fb5257d (install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded, 2021-05-06, openshift#554) and * 92ed7f1 (install/0000_90_cluster-version-operator_02_servicemonitor: Update ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556). There are some conflicts, because I am not bringing back 90539f9 (pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550). But that one had its own conflicts in metrics.go [5], and the conflicts with this commit were orthogonal context issues, so moving this back to 4.7 first won't make it much harder to bring back openshift#550 and such later on, if we decide to do that. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776 group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"}) [3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json [4]: openshift/api#916 [5]: openshift#550 (comment)

openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 26, 2021

openshift-ci-robot requested review from abhinavdahiya and jottofar April 26, 2021 20:16

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021

wking force-pushed the decouple-degraded-from-ClusterOperatorDown branch 2 times, most recently from 2bcc3eb to 5dfb480 Compare April 26, 2021 20:36

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 26, 2021

wking force-pushed the decouple-degraded-from-ClusterOperatorDown branch from 5dfb480 to 90539f9 Compare April 26, 2021 20:41

smarterclayton reviewed Apr 26, 2021

View reviewed changes

openshift-ci-robot assigned smarterclayton Apr 29, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2021

openshift-merge-robot merged commit 7e75440 into openshift:master Apr 30, 2021

wking deleted the decouple-degraded-from-ClusterOperatorDown branch April 30, 2021 13:32

wking mentioned this pull request May 9, 2021

cvo: report only unavailable operators with cluster_operator_up #334

Closed

wking mentioned this pull request Jun 8, 2021

Bug 1969501: install/0000_90_cluster-version-operator_02_servicemonitor: Soften ClusterOperatorDegraded #587

Merged

wking mentioned this pull request Aug 6, 2021

Bug 1991010: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #638

Merged

wking mentioned this pull request Nov 22, 2022

OTA-844: pkg/cvo/metrics: Add 'reason' to cluster_operator_up #868

Merged

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #550

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up #550

Uh oh!

Conversation

wking commented Apr 26, 2021

Uh oh!

openshift-ci-robot commented Apr 26, 2021

Uh oh!

wking commented Apr 26, 2021

Uh oh!

openshift-ci-robot commented Apr 26, 2021

Uh oh!

smarterclayton Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

jottofar Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

wking Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

smarterclayton Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Apr 27, 2021

Uh oh!

wking commented Apr 29, 2021

Uh oh!

smarterclayton commented Apr 29, 2021

Uh oh!

openshift-ci-robot commented Apr 29, 2021

Uh oh!

smarterclayton commented Apr 29, 2021

Uh oh!

wking commented Apr 29, 2021

Uh oh!

wking commented Apr 30, 2021

Uh oh!

openshift-ci-robot commented Apr 30, 2021

Uh oh!

wking commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 30, 2021

Uh oh!

wking commented Apr 30, 2021

Uh oh!

openshift-ci-robot commented Apr 30, 2021

Uh oh!

wking commented Jun 8, 2021

Uh oh!

openshift-cherrypick-robot commented Jun 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wking commented Apr 30, 2021 •

edited

Loading