Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Nov 21, 2022

To give users sub-component granularity about why they're getting a critical alert.

I'm still avoiding the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2022
g.Set(0)
isUp := 0
var reason := "NoAvailableCondition"
if condition := resourcemerge.FindOperatorStatusCondition(op.Status.Conditions, configv1.OperatorAvailable); condition != nil {
Copy link

@Nikokolas3270 Nikokolas3270 Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric cluster_operator_up is equivalent to metric cluster_operator_conditions{condition="Available"}.
What is the point of keeping it?
Also semantic of cluster_operator_conditions metric is clear (I do not have to look at the code producing that metric to know what it does); semantic of metric cluster_operator_up is less clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric cluster_operator_up is equivalent to metric cluster_operator_conditions{condition="Available"}.

cluster_operator_up also includes the operator version from the ClusterOperator. And it detects "ClusterOperator exists but Available is unset", where cluster_operator_conditions has no time-series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also semantic of cluster_operator_conditions metric is clear (I do not have to look at the code producing that metric to know what it does)...

I'm completely onboard with improving help-text for cluster_operator_up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded cluster_operator_up's help text in 9f587ce -> edf2574.

Copy link

@Nikokolas3270 Nikokolas3270 Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_operator_up also includes the operator version from the ClusterOperator.

Couldn't the cluster_operator_conditions metric ship the version as well?

And it detects "ClusterOperator exists but Available is unset", where cluster_operator_conditions has no time-series.

Well, here cluster_operator_up will be set to 0 whether the Available condition is unset to set to 0. Only cluster_operator_conditions{condition="Available"} metric can be use to distinguish it from being unset or set to 0.

I'm completely onboard with improving help-text for cluster_operator_up.

Thanks for that, but as said, just with its name, I know what cluster_operator_conditions does. I have to dig further to know what cluster_operator_up does.

Last, please note that in some other request (OSD-8320), we have been asked to use the reason of the Degraded condition in conjunction with cluster_operator_up to inhibit ClusterOperatorDown alert.
Maybe this is a misstake as it would be strange to pick up the value from one metric (cluster_operator_conditions{condition="Available"}) and the reason from an other (cluster_operator_conditions{condition="Degraded"}); but hits is puzzling to me and need to be clarified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the cluster_operator_conditions metric ship the version as well?

It could, but the version is ClusterOperator-scoped, and not condition-scoped. cluster_operator_up is ClusterOperator-scoped, so it's a good fit there. cluster_operator_conditions is condition-scoped, so it's not a good fit there (would you copy the same version on to time-series for every one of the operator's conditions?). You could obviously also add a new ClusterOperator-scoped metric whose sole purpose was exposing the version information, but 🤷, that seems like a lot of churn, and I'm not clear on the benefit.

Only cluster_operator_conditions{condition="Available"} metric can be use to distinguish it from being unset or set to 0.

With this pull request, you can make that distinction by seeing if the cluster_operator_up == 0 reason is NoAvailableCondition or otherwise. But as far as "when to page an admin?" goes, I don't think that distinction is all that important. Both are bad enough that we want to be paging admins.

I know what cluster_operator_conditions does. I have to dig further to know what cluster_operator_up does.

I don't mind if other cluster-version-operator approvers want to override me, but personally "many metrics consumers are able to intuit the semantics without having to read docs" is a useful property, and worthy of aspiring to, but not something that must be delivered in order for a metric to exist.

Last, please note that in some other request (OSD-8320), we have been asked to use the reason of the Degraded condition in conjunction with cluster_operator_up to inhibit ClusterOperatorDown alert.

That seems like an anti-pattern to me, since Degraded=True is theoretically orthogonal to Available=False. And it's Available=False (or absent) that are driving ClusterOperatorDown. We've had it decoupled from Degraded since #550. That said, my push-back is just that I don't want to re-mix Degradedness into ClusterOperatorDown; I have no problem at all if consumers like OSD decide to join ClusterOperatorDown with Degraded information when deciding if/how to respond.

Copy link

@Nikokolas3270 Nikokolas3270 Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback Trevor.
As my remark is more a matter of preference, I am ok to close this discussion.
My worries were linked to OSD-8320 which is asking to use a different condition for the reason. From what I understand using the reason coming from the Available condition is perfectly fine with the silences defined there:
openshift/configure-alertmanager-operator@db85a6d
(although I wonder if it would be better to not silence the alert and let CAD process it by sending an automated SL to the customer)

@wking wking force-pushed the cluster_operator_up-reason branch 3 times, most recently from 70f969d to 917170d Compare November 22, 2022 00:59
@wking
Copy link
Member Author

wking commented Nov 22, 2022

Feeding e2e-agnostic into PromeCIeus, ALERTS{alertname="ClusterOperatorDown"} shows release on pending alerts.

image

However, sharding by reason means the alert may not fire if it stays Available=False but flips around the reason string (e.g. in this CI run, monitoring went from UpdatingPrometheusOperatorFailed to UpdatingPrometheusAdapterFailed). Not clear to me how to balance wanting the extra detail against the risk of not firing if reason is churining.

description: The {{ "{{ $labels.name }}" }} operator may be down or disabled because {{ "${{ $labels.reason }}" }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
expr: |
max by (namespace, name) (cluster_operator_up{job="cluster-version-operator"} == 0)
max by (namespace, name, reason) (cluster_operator_up{job="cluster-version-operator"} == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still avoiding the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.

I am not clear how much cardinality increase we are looking at by adding reason. I am still not clear about the motivation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear how much cardinality increase we are looking at by adding reason.

We already have reason cardinality on cluster_operator_conditions, and that seems ok. Checking 4.11 Telemetry:

sort_desc(
  count by (name) (
    count by (name, reason) (
      (cluster_operator_conditions{condition="Available"} == 0)
      + on (_id) group_left ()
      group by (_id) (cluster_version{type="current",version=~"4[.]11[.][0-9]*"})
    )
  )
)

gives:

{name="authentication"}  16
{name="monitoring"}  10
{name="storage"}  6
{name="console"}  5
{name="image-registry"}  3
...

so the cardinality increase would be mostly limited to those particular operators.

message cardinality will be higher, potentially including things like request port numbers, API response IDs, and that sort of thing.

I am still not clear about the motivation.

Cluster admins like SRE-P get a ClusterOperatorDown alert. They wonder if it's a true-positive (actual cluster pain; needs quick manual intervention) or a false-positive (operator is over-reacting, cluster is actually ok, or will soon self-heal to become ok). Splitting ClusterOperatorDown by name gives you some granularity (e.g. you can decide whether a particular component like etcd is often a true-positive or often a false-positive. But splitting by (name, reason) tuples gives more granularity, so you can more precisely separate likely-to-be-a-false-positive from likely-to-be-a-true-positive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{name="authentication"}  16

Digging into this one out of curiousity:

group by (reason) (
  (cluster_operator_conditions{name="authentication",condition="Available"} == 0)
  + on (_id) group_left ()
  group by (_id) (cluster_version{type="current",version=~"4[.]11[.][0-9]*"})
)

gives:

{reason="OAuthServerDeployment_NoPod::OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable"}  1
{reason="WellKnown_NotReady"}  1
{reason="OAuthServerDeployment_NoPod"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady"}  1
{reason="APIServices_PreconditionNotReady"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1
{reason="APIServerDeployment_NoPod"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady::WellKnown_NotReady"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1
{reason="APIServices_Error"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1

So if reason cardinality is a concern, we can reduce it somewhat (at the cost of reduced granularity) by collecting multiple-reason conditions under MultipleReasons (like the cluster-version operator does for some of its conditions) instead of concatenating on ::.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster admins like SRE-P get a ClusterOperatorDown alert. They wonder if it's a true-positive (actual cluster pain; needs quick manual intervention) or a false-positive (operator is over-reacting, cluster is actually ok, or will soon self-heal to become ok). Splitting ClusterOperatorDown by name gives you some granularity (e.g. you can decide whether a particular component like etcd is often a true-positive or often a false-positive. But splitting by (name, reason) tuples gives more granularity, so you can more precisely separate likely-to-be-a-false-positive from likely-to-be-a-true-positive.

I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh.

That said, it does not seem the cost of adding the labels is too high, and I can imagine they can be useful when troubleshooting, diagnosis or maybe conditional risks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh.

It is true that the aim of adding the reason is to silence the alert for some reasons as done here:
openshift/configure-alertmanager-operator@db85a6d

It would (maybe) be better to not silence the alert and let CAD (Cluster Anomaly Detection) deal with the alert and send the proper SL to the customer.
@wking & @petr-muller : what do you think about that?
Maybe we should open a card on our side to track that suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would (maybe) be better to not silence the alert and let CAD (Cluster Anomaly Detection) deal with the alert and send the proper SL to the customer.

An issues with automatically service-logging on these alerts is that, while sometimes they are things the customer can fix (hooray). sometimes they are things that SRE-P should fix for the customer (that's why customers pay us to manage clusters). For example:

  • Sometimes the alert is really a false-positive. The appropriate action is to file an OCPBUGS ticket, which then gets fixed, and the fixes ship, and then clusters get updated to pick up the fix. That can still be a long turnaround time, and because some managed offerings like OSD defer control-plane patch updates to the customer, the turnaround time can be very slow. Silences give service delivery something to de-noise in the meantime. Customers don't want to be bothered by this loop.
  • Sometimes the alert is a true-positive, but it's something SRE-P is supposed to fix without bothering the customer. E.g. monitoring's UpdatingPrometheusOperatorFailed is probably not something that the customer wants to be bothered about.

To give users sub-component granularity about why they're getting a
critical alert.

I'm still avoiding the cardinality hit of including the full message
in the metric, because we don't want to load Prometheus down with
*that* many time-series.  For message-level granularity, users still
have to follow the 'oc ...' or web-console links from the alert
description.
@wking wking force-pushed the cluster_operator_up-reason branch from 917170d to bbcc33d Compare November 22, 2022 19:25
annotations:
summary: Cluster operator has not been available for 10 minutes.
description: The {{ "{{ $labels.name }}" }} operator may be down or disabled, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
description: The {{ "{{ $labels.name }}" }} operator may be down or disabled because {{ "${{ $labels.reason }}" }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at your authentication examples below, will we produce UI messages like "...down or disable because OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady...".

If yes, is that desirable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err... I guess? I could see truncating on 50 characters or something if we felt like ClusterOperator writers couldn't be trusted. I could also see asking the auth and other folks to consolidate to MultipleReasons or similar, as a less open-ended slug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Length is one thing but this also feels like techy clutter in what is otherwise a natural language sequence

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also concern about putting words which can only OpenShift engineers can understand. However it is not better than not putting the reasons e.g. https://issues.redhat.com/browse/OSD-8320. If this turn out to be a bad idea we can revert it back.

description: The {{ "{{ $labels.name }}" }} operator may be down or disabled because {{ "${{ $labels.reason }}" }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
expr: |
max by (namespace, name) (cluster_operator_up{job="cluster-version-operator"} == 0)
max by (namespace, name, reason) (cluster_operator_up{job="cluster-version-operator"} == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster admins like SRE-P get a ClusterOperatorDown alert. They wonder if it's a true-positive (actual cluster pain; needs quick manual intervention) or a false-positive (operator is over-reacting, cluster is actually ok, or will soon self-heal to become ok). Splitting ClusterOperatorDown by name gives you some granularity (e.g. you can decide whether a particular component like etcd is often a true-positive or often a false-positive. But splitting by (name, reason) tuples gives more granularity, so you can more precisely separate likely-to-be-a-false-positive from likely-to-be-a-true-positive.

I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh.

That said, it does not seem the cost of adding the labels is too high, and I can imagine they can be useful when troubleshooting, diagnosis or maybe conditional risks.

@wking wking changed the title pkg/cvo/metrics: Add 'reason' to cluster_operator_up OTA-844: pkg/cvo/metrics: Add 'reason' to cluster_operator_up Dec 7, 2022
Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member

/retest

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9c23665 and 2 for PR HEAD bbcc33d in total

@wking
Copy link
Member Author

wking commented Dec 17, 2022

cache-kube-api-reused-connections is orthogonal.

/override ci/prow/e2e-agnostic-upgrade-into-change

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

cache-kube-api-reused-connections is orthogonal.

/override ci/prow/e2e-agnostic-upgrade-into-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 75c5b85 into openshift:master Dec 17, 2022
@wking wking deleted the cluster_operator_up-reason branch August 29, 2023 18:04
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 31, 2023
…om ${{

Fixing a ClusterOperatorDown typo from bbcc33d (pkg/cvo/metrics:
Add 'reason' to cluster_operator_up, 2022-11-21, openshift#868) that I'd
carried forward into ClusterReleaseNotAccepted in 734b9c5
(install/0000_90_cluster-version-operator_02_servicemonitor: Add
ClusterReleaseNotAccepted, 2023-02-23, openshift#906) [1].

[1]: https://issues.redhat.com/browse/OCPBUGS-8079
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Mar 13, 2024
…om ${{

Fixing a ClusterOperatorDown typo from bbcc33d (pkg/cvo/metrics:
Add 'reason' to cluster_operator_up, 2022-11-21, openshift#868) that I'd
carried forward into ClusterReleaseNotAccepted in 734b9c5
(install/0000_90_cluster-version-operator_02_servicemonitor: Add
ClusterReleaseNotAccepted, 2023-02-23, openshift#906) [1].

[1]: https://issues.redhat.com/browse/OCPBUGS-8079
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants