-
Notifications
You must be signed in to change notification settings - Fork 213
OTA-844: pkg/cvo/metrics: Add 'reason' to cluster_operator_up #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -87,9 +87,9 @@ spec: | |
| - alert: ClusterOperatorDown | ||
| annotations: | ||
| summary: Cluster operator has not been available for 10 minutes. | ||
| description: The {{ "{{ $labels.name }}" }} operator may be down or disabled, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}. | ||
| description: The {{ "{{ $labels.name }}" }} operator may be down or disabled because {{ "${{ $labels.reason }}" }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}. | ||
| expr: | | ||
| max by (namespace, name) (cluster_operator_up{job="cluster-version-operator"} == 0) | ||
| max by (namespace, name, reason) (cluster_operator_up{job="cluster-version-operator"} == 0) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not clear how much cardinality increase we are looking at by adding reason. I am still not clear about the motivation.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We already have gives: so the cardinality increase would be mostly limited to those particular operators.
Cluster admins like SRE-P get a
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Digging into this one out of curiousity: gives: So if
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh. That said, it does not seem the cost of adding the labels is too high, and I can imagine they can be useful when troubleshooting, diagnosis or maybe conditional risks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is true that the aim of adding the reason is to silence the alert for some reasons as done here: It would (maybe) be better to not silence the alert and let CAD (Cluster Anomaly Detection) deal with the alert and send the proper SL to the customer.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
An issues with automatically service-logging on these alerts is that, while sometimes they are things the customer can fix (hooray). sometimes they are things that SRE-P should fix for the customer (that's why customers pay us to manage clusters). For example:
|
||
| for: 10m | ||
| labels: | ||
| severity: critical | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -94,8 +94,8 @@ version for 'cluster', or empty for 'initial'. | |
| }, []string{"name"}), | ||
| clusterOperatorUp: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "cluster_operator_up", | ||
| Help: "Reports key highlights of the active cluster operators.", | ||
| }, []string{"name", "version"}), | ||
| Help: "1 if a cluster operator is Available=True. 0 otherwise, including if a cluster operator sets no Available condition. The 'version' label tracks the 'operator' version. The 'reason' label is passed through from the Available condition, unless the cluster operator sets no Available condition, in which case NoAvailableCondition is used.", | ||
| }, []string{"name", "version", "reason"}), | ||
| clusterOperatorConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "cluster_operator_conditions", | ||
| Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.", | ||
|
|
@@ -339,7 +339,7 @@ func (m *operatorMetrics) Describe(ch chan<- *prometheus.Desc) { | |
| ch <- m.version.WithLabelValues("", "", "", "").Desc() | ||
| ch <- m.availableUpdates.WithLabelValues("", "").Desc() | ||
| ch <- m.capability.WithLabelValues("").Desc() | ||
| ch <- m.clusterOperatorUp.WithLabelValues("", "").Desc() | ||
| ch <- m.clusterOperatorUp.WithLabelValues("", "", "").Desc() | ||
| ch <- m.clusterOperatorConditions.WithLabelValues("", "", "").Desc() | ||
| ch <- m.clusterOperatorConditionTransitions.WithLabelValues("", "").Desc() | ||
| ch <- m.clusterInstaller.WithLabelValues("", "", "").Desc() | ||
|
|
@@ -489,12 +489,16 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) { | |
| if version == "" { | ||
| klog.V(2).Infof("ClusterOperator %s is not setting the 'operator' version", op.Name) | ||
| } | ||
| g := m.clusterOperatorUp.WithLabelValues(op.Name, version) | ||
| if resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable) { | ||
| g.Set(1) | ||
| } else { | ||
| g.Set(0) | ||
| var isUp float64 | ||
| reason := "NoAvailableCondition" | ||
| if condition := resourcemerge.FindOperatorStatusCondition(op.Status.Conditions, configv1.OperatorAvailable); condition != nil { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metric
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm completely onboard with improving help-text for
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Couldn't the
Well, here
Thanks for that, but as said, just with its name, I know what Last, please note that in some other request (OSD-8320), we have been asked to use the reason of the Degraded condition in conjunction with
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It could, but the version is ClusterOperator-scoped, and not condition-scoped.
With this pull request, you can make that distinction by seeing if the
I don't mind if other cluster-version-operator approvers want to override me, but personally "many metrics consumers are able to intuit the semantics without having to read docs" is a useful property, and worthy of aspiring to, but not something that must be delivered in order for a metric to exist.
That seems like an anti-pattern to me, since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feedback Trevor. |
||
| reason = condition.Reason | ||
| if condition.Status == configv1.ConditionTrue { | ||
| isUp = 1 | ||
| } | ||
| } | ||
| g := m.clusterOperatorUp.WithLabelValues(op.Name, version, reason) | ||
| g.Set(isUp) | ||
| ch <- g | ||
| for _, condition := range op.Status.Conditions { | ||
| if condition.Status != configv1.ConditionFalse && condition.Status != configv1.ConditionTrue { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at your authentication examples below, will we produce UI messages like "...down or disable because OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady...".
If yes, is that desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err... I guess? I could see truncating on 50 characters or something if we felt like ClusterOperator writers couldn't be trusted. I could also see asking the auth and other folks to consolidate to
MultipleReasonsor similar, as a less open-ended slug.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Length is one thing but this also feels like techy clutter in what is otherwise a natural language sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also concern about putting words which can only OpenShift engineers can understand. However it is not better than not putting the reasons e.g. https://issues.redhat.com/browse/OSD-8320. If this turn out to be a bad idea we can revert it back.