Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Apr 26, 2021

ClusterOperatorDown is based on cluster_operator_up, but we also have ClusterOperatorDegraded based on cluster_operator_conditions{condition="Degraded"}. Firing ClusterOperatorDown for operators which are Available=True and Degraded=True confuses users. ClusterOperatorDegraded will also be firing. With this commit, I'm adjusting cluster_operator_up to only care about Available, to decouple the two alerts and bring them in line with their existing "has not been available" and "has been degraded" descriptions.

However, IsOperatorStatusConditionTrue requires the condition to be present, and cluster_operator_conditions only creates entries when the conditions are present. To guard against the Degraded-unset condition in ClusterOperatorDegraded, I'm covering with an or and group by guard. So we should have the following cases:

  • Available unset or != True: ClusterOperatorDown will fire.
  • Available=True, Degraded unset or != False: ClusterOperatorDegraded will fire. Firing on unset is new in this commit. Not firing ClusterOperatorDown here is new in this commit.
  • Available=True, Degraded=False: Neither alert fires.

@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1834551, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 26, 2021
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021
@wking wking force-pushed the decouple-degraded-from-ClusterOperatorDown branch 2 times, most recently from 2bcc3eb to 5dfb480 Compare April 26, 2021 20:36
@wking
Copy link
Member Author

wking commented Apr 26, 2021

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 26, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1834551, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ClusterOperatorDown is based on cluster_operator_up, but we also have
ClusterOperatorDegraded based on
cluster_operator_conditions{condition="Degraded"}.  Firing
ClusterOperatorDown for operators which are Available=True and
Degraded=True confuses users [1].  ClusterOperatorDegraded will also
be firing.  With this commit, I'm adjusting cluster_operator_up to
only care about Available, to decouple the two alerts and bring them
in line with their existing "has not been available" and "has been
degraded" descriptions.

However, IsOperatorStatusConditionTrue requires the condition to be
present, and cluster_operator_conditions only creates entries when the
conditions are present.  To guard against the Degraded-unset condition
in ClusterOperatorDegraded, I'm covering with an 'or' [2] and 'group
by' [3] guard.  So we should have the following cases:

* Available unset or != True: ClusterOperatorDown will fire.
* Available=True, Degraded unset or != False: ClusterOperatorDegraded
  will fire.  Firing on unset is new in this commit.  Not firing
  ClusterOperatorDown here is new in this commit.
* Available=True, Degraded=False: Neither alert fires.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1834551#c0
[2]: https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators
[3]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
@wking wking force-pushed the decouple-degraded-from-ClusterOperatorDown branch from 5dfb480 to 90539f9 Compare April 26, 2021 20:41
(
cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
or on (name)
group by (name) (cluster_operator_up{job="cluster-version-operator"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a hard time reading this:

"if operator is degraded OR is up == 1 for ten minutes report degraded?"

Shouldn't that be

" if operator is degraded == 1 OR up == 0 for ten minutes report degraded"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the OR portion is doing, so I'm not saying it's incorrect, but according to PR description ClusterOperatorDegraded should only fire when Degraded true or unset. And if up == 0 only ClusterOperatorDown should fire which is handled above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a bunch of ClusterOperators. Grouping into sets:

a. Don't set Available, or set Available!=True.
b. Set Available=True.
c. Don't set Degraded or set it to an unrecognized value.
d. Set Degraded=True.
e. Set Degraded=False.

a∪b is all operators. c∪d∪e is all operators.

cluster_operator_up{job="cluster-version-operator"} will be zero for (a) and one for (b). ClusterOperatorDown will fire for (a) but not for (b). Great.

cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} will be one for (d) and zero for (e), but misses (c). group by collapses the cluster_operator_up value to one, regardless of (a) vs. (b). So when those (c) clusters fall through to the or, they hit the group by one, and come out of the (A or B) block with one just like (d). So ClusterOperatorDegraded will fire for (c) and (d) but not for (e). Great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's right. i hate group by.

@smarterclayton
Copy link
Contributor

This LGTM, what docs do we need to change besides the one in this repo and maybe the api doc?

@wking
Copy link
Member Author

wking commented Apr 29, 2021

This LGTM, what docs do we need to change besides the one in this repo and maybe the api doc?

Do we need to change any docs? These alerts are ideally self-documenting, although see the improvements in flight for 4.9 in #547. I don't think this PR makes any changes to our guidance for what operators should put in their ClusterOperator conditions.

@smarterclayton
Copy link
Contributor

the critical vs warning alert comparison i think is an appropriate alignment. i’d like that in api docs if possible.

/lgtm
/approve

i’ll look at gating based on aggregate time spent degraded

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2021
@smarterclayton
Copy link
Contributor

/retest

@wking
Copy link
Member Author

wking commented Apr 29, 2021

the critical vs warning alert comparison i think is an appropriate alignment. i’d like that in api docs if possible.

I've floated openshift/api#916.

@wking
Copy link
Member Author

wking commented Apr 30, 2021

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/550/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1387908284446412800/build-log.txt | grep 'Step e2e'
INFO[2021-04-29T23:20:13Z] Step e2e-agnostic-operator-ipi-conf succeeded after 10s.
INFO[2021-04-29T23:20:23Z] Step e2e-agnostic-operator-ipi-conf-azure succeeded after 10s.
INFO[2021-04-29T23:20:33Z] Step e2e-agnostic-operator-ipi-install-monitoringpvc succeeded after 10s.
INFO[2021-04-29T23:20:43Z] Step e2e-agnostic-operator-ipi-install-rbac succeeded after 10s.
INFO[2021-04-29T23:59:03Z] Step e2e-agnostic-operator-ipi-install-install succeeded after 38m20s.
INFO[2021-04-30T00:00:23Z] Step e2e-agnostic-operator-e2e-test succeeded after 1m20s.
INFO[2021-04-30T00:01:13Z] Step e2e-agnostic-operator-gather-core-dump succeeded after 50s.
INFO[2021-04-30T00:04:23Z] Step e2e-agnostic-operator-gather-must-gather succeeded after 3m10s.
INFO[2021-04-30T00:06:13Z] Step e2e-agnostic-operator-gather-extra succeeded after 1m50s.
INFO[2021-04-30T00:07:43Z] Step e2e-agnostic-operator-gather-audit-logs succeeded after 1m30s.
INFO[2021-04-30T00:42:53Z] Step e2e-agnostic-operator-ipi-deprovision-deprovision failed after 35m10s.

This PR didn't break teardown.

/override ci/prow/e2e-agnostic-operator

@openshift-ci-robot
Copy link
Contributor

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-operator

Details

In response to this:

e2e-operator:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/550/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1387908284446412800/build-log.txt | grep 'Step e2e'
INFO[2021-04-29T23:20:13Z] Step e2e-agnostic-operator-ipi-conf succeeded after 10s.
INFO[2021-04-29T23:20:23Z] Step e2e-agnostic-operator-ipi-conf-azure succeeded after 10s.
INFO[2021-04-29T23:20:33Z] Step e2e-agnostic-operator-ipi-install-monitoringpvc succeeded after 10s.
INFO[2021-04-29T23:20:43Z] Step e2e-agnostic-operator-ipi-install-rbac succeeded after 10s.
INFO[2021-04-29T23:59:03Z] Step e2e-agnostic-operator-ipi-install-install succeeded after 38m20s.
INFO[2021-04-30T00:00:23Z] Step e2e-agnostic-operator-e2e-test succeeded after 1m20s.
INFO[2021-04-30T00:01:13Z] Step e2e-agnostic-operator-gather-core-dump succeeded after 50s.
INFO[2021-04-30T00:04:23Z] Step e2e-agnostic-operator-gather-must-gather succeeded after 3m10s.
INFO[2021-04-30T00:06:13Z] Step e2e-agnostic-operator-gather-extra succeeded after 1m50s.
INFO[2021-04-30T00:07:43Z] Step e2e-agnostic-operator-gather-audit-logs succeeded after 1m30s.
INFO[2021-04-30T00:42:53Z] Step e2e-agnostic-operator-ipi-deprovision-deprovision failed after 35m10s.

This PR didn't break teardown.

/override ci/prow/e2e-agnostic-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Apr 30, 2021

Or e2e teardown.

/override ci/prow/e2e-agnostic

@openshift-ci-robot
Copy link
Contributor

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic

Details

In response to this:

Or e2e.

/override ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Apr 30, 2021

@openshift-merge-robot openshift-merge-robot merged commit 7e75440 into openshift:master Apr 30, 2021
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1834551 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the decouple-degraded-from-ClusterOperatorDown branch April 30, 2021 13:32
@wking
Copy link
Member Author

wking commented Jun 8, 2021

/cherrypick release-4.7

@openshift-cherrypick-robot

@wking: #550 failed to apply on top of branch "release-4.7":

Applying: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up
Using index info to reconstruct a base tree...
M	pkg/cvo/metrics.go
M	pkg/cvo/metrics_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/metrics_test.go
Auto-merging pkg/cvo/metrics.go
CONFLICT (content): Merge conflict in pkg/cvo/metrics.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 pkg/cvo/metrics: Ignore Degraded for cluster_operator_up
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 8, 2021
…usterOperatorDegraded

During install, the CVO has pushed manifests into the cluster as fast
as possible without blocking on "has the in-cluster resource leveled?"
since way back in b0b4902 (clusteroperator: Don't block on failing
during initialization, 2019-03-11, openshift#136).  That can lead to
ClusterOperatorDown and ClusterOperatorDegraded firing during install,
as we see in [1], where:

* ClusterOperatorDegraded started pending at 5:00:15Z [2].
* Install completed at 5:09:58Z [3].
* ClusterOperatorDegraded started firing at 5:10:04Z [2].
* ClusterOperatorDegraded stopped firing at 5:10:23Z [2].
* The e2e suite complained about [1]:

    alert ClusterOperatorDegraded fired for 15 seconds with labels: {... name="authentication"...} (open bug: https://bugzilla.redhat.com/show_bug.cgi?id=1939580)

ClusterOperatorDown is similar, but I'll leave addressing it to a
separate commit.  For ClusterOperatorDegraded, the degraded condition
should not be particularly urgent [4], so we should be find bumping it
to 'warning' and using 'for: 30m' or something more relaxed than the
current 10m.

This commit brings back

* fb5257d
  (install/0000_90_cluster-version-operator_02_servicemonitor: Soften
  ClusterOperatorDegraded, 2021-05-06, openshift#554) and
* 92ed7f1
  (install/0000_90_cluster-version-operator_02_servicemonitor: Update
  ClusterOperatorDegraded message to 30m, 2021-05-08, openshift#556).

There are some conflicts, because I am not bringing back 90539f9
(pkg/cvo/metrics: Ignore Degraded for cluster_operator_up, 2021-04-26, openshift#550).
But that one had its own conflicts in metrics.go [5], and the
conflicts with this commit were orthogonal context issues, so moving
this back to 4.7 first won't make it much harder to bring back openshift#550
and such later on, if we decide to do that.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776
     group by (alertstate) (ALERTS{alertname="ClusterOperatorDegraded"})
[3]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.8/1389436726862155776/artifacts/e2e-aws-upi/clusterversion.json
[4]: openshift/api#916
[5]: openshift#550 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants