Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,9 @@ spec:
- alert: ClusterOperatorDown
annotations:
summary: Cluster operator has not been available for 10 minutes.
description: The {{ "{{ $labels.name }}" }} operator may be down or disabled, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
description: The {{ "{{ $labels.name }}" }} operator may be down or disabled because {{ "${{ $labels.reason }}" }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ "{{ $labels.name }}" }}'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at your authentication examples below, will we produce UI messages like "...down or disable because OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady...".

If yes, is that desirable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err... I guess? I could see truncating on 50 characters or something if we felt like ClusterOperator writers couldn't be trusted. I could also see asking the auth and other folks to consolidate to MultipleReasons or similar, as a less open-ended slug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Length is one thing but this also feels like techy clutter in what is otherwise a natural language sequence

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also concern about putting words which can only OpenShift engineers can understand. However it is not better than not putting the reasons e.g. https://issues.redhat.com/browse/OSD-8320. If this turn out to be a bad idea we can revert it back.

expr: |
max by (namespace, name) (cluster_operator_up{job="cluster-version-operator"} == 0)
max by (namespace, name, reason) (cluster_operator_up{job="cluster-version-operator"} == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still avoiding the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.

I am not clear how much cardinality increase we are looking at by adding reason. I am still not clear about the motivation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear how much cardinality increase we are looking at by adding reason.

We already have reason cardinality on cluster_operator_conditions, and that seems ok. Checking 4.11 Telemetry:

sort_desc(
  count by (name) (
    count by (name, reason) (
      (cluster_operator_conditions{condition="Available"} == 0)
      + on (_id) group_left ()
      group by (_id) (cluster_version{type="current",version=~"4[.]11[.][0-9]*"})
    )
  )
)

gives:

{name="authentication"}  16
{name="monitoring"}  10
{name="storage"}  6
{name="console"}  5
{name="image-registry"}  3
...

so the cardinality increase would be mostly limited to those particular operators.

message cardinality will be higher, potentially including things like request port numbers, API response IDs, and that sort of thing.

I am still not clear about the motivation.

Cluster admins like SRE-P get a ClusterOperatorDown alert. They wonder if it's a true-positive (actual cluster pain; needs quick manual intervention) or a false-positive (operator is over-reacting, cluster is actually ok, or will soon self-heal to become ok). Splitting ClusterOperatorDown by name gives you some granularity (e.g. you can decide whether a particular component like etcd is often a true-positive or often a false-positive. But splitting by (name, reason) tuples gives more granularity, so you can more precisely separate likely-to-be-a-false-positive from likely-to-be-a-true-positive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{name="authentication"}  16

Digging into this one out of curiousity:

group by (reason) (
  (cluster_operator_conditions{name="authentication",condition="Available"} == 0)
  + on (_id) group_left ()
  group by (_id) (cluster_version{type="current",version=~"4[.]11[.][0-9]*"})
)

gives:

{reason="OAuthServerDeployment_NoPod::OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable"}  1
{reason="WellKnown_NotReady"}  1
{reason="OAuthServerDeployment_NoPod"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady"}  1
{reason="APIServices_PreconditionNotReady"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1
{reason="APIServerDeployment_NoPod"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady::WellKnown_NotReady"}  1
{reason="APIServerDeployment_NoPod::APIServices_PreconditionNotReady::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1
{reason="APIServices_Error"}  1
{reason="OAuthServerRouteEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::WellKnown_NotReady"}  1

So if reason cardinality is a concern, we can reduce it somewhat (at the cost of reduced granularity) by collecting multiple-reason conditions under MultipleReasons (like the cluster-version operator does for some of its conditions) instead of concatenating on ::.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster admins like SRE-P get a ClusterOperatorDown alert. They wonder if it's a true-positive (actual cluster pain; needs quick manual intervention) or a false-positive (operator is over-reacting, cluster is actually ok, or will soon self-heal to become ok). Splitting ClusterOperatorDown by name gives you some granularity (e.g. you can decide whether a particular component like etcd is often a true-positive or often a false-positive. But splitting by (name, reason) tuples gives more granularity, so you can more precisely separate likely-to-be-a-false-positive from likely-to-be-a-true-positive.

I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh.

That said, it does not seem the cost of adding the labels is too high, and I can imagine they can be useful when troubleshooting, diagnosis or maybe conditional risks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced about this. Do we want to optimize for what is basically an OpenShift bug (operator down should never be a false-positive thing to be ignored, right?). I feel that we'd just add some tech jargon to admin-facing messages, after admin-ing the CI clusters myself for some time I'm not sure if I found that too useful tbh.

It is true that the aim of adding the reason is to silence the alert for some reasons as done here:
openshift/configure-alertmanager-operator@db85a6d

It would (maybe) be better to not silence the alert and let CAD (Cluster Anomaly Detection) deal with the alert and send the proper SL to the customer.
@wking & @petr-muller : what do you think about that?
Maybe we should open a card on our side to track that suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would (maybe) be better to not silence the alert and let CAD (Cluster Anomaly Detection) deal with the alert and send the proper SL to the customer.

An issues with automatically service-logging on these alerts is that, while sometimes they are things the customer can fix (hooray). sometimes they are things that SRE-P should fix for the customer (that's why customers pay us to manage clusters). For example:

  • Sometimes the alert is really a false-positive. The appropriate action is to file an OCPBUGS ticket, which then gets fixed, and the fixes ship, and then clusters get updated to pick up the fix. That can still be a long turnaround time, and because some managed offerings like OSD defer control-plane patch updates to the customer, the turnaround time can be very slow. Silences give service delivery something to de-noise in the meantime. Customers don't want to be bothered by this loop.
  • Sometimes the alert is a true-positive, but it's something SRE-P is supposed to fix without bothering the customer. E.g. monitoring's UpdatingPrometheusOperatorFailed is probably not something that the customer wants to be bothered about.

for: 10m
labels:
severity: critical
Expand Down
20 changes: 12 additions & 8 deletions pkg/cvo/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ version for 'cluster', or empty for 'initial'.
}, []string{"name"}),
clusterOperatorUp: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "cluster_operator_up",
Help: "Reports key highlights of the active cluster operators.",
}, []string{"name", "version"}),
Help: "1 if a cluster operator is Available=True. 0 otherwise, including if a cluster operator sets no Available condition. The 'version' label tracks the 'operator' version. The 'reason' label is passed through from the Available condition, unless the cluster operator sets no Available condition, in which case NoAvailableCondition is used.",
}, []string{"name", "version", "reason"}),
clusterOperatorConditions: prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "cluster_operator_conditions",
Help: "Report the conditions for active cluster operators. 0 is False and 1 is True.",
Expand Down Expand Up @@ -339,7 +339,7 @@ func (m *operatorMetrics) Describe(ch chan<- *prometheus.Desc) {
ch <- m.version.WithLabelValues("", "", "", "").Desc()
ch <- m.availableUpdates.WithLabelValues("", "").Desc()
ch <- m.capability.WithLabelValues("").Desc()
ch <- m.clusterOperatorUp.WithLabelValues("", "").Desc()
ch <- m.clusterOperatorUp.WithLabelValues("", "", "").Desc()
ch <- m.clusterOperatorConditions.WithLabelValues("", "", "").Desc()
ch <- m.clusterOperatorConditionTransitions.WithLabelValues("", "").Desc()
ch <- m.clusterInstaller.WithLabelValues("", "", "").Desc()
Expand Down Expand Up @@ -489,12 +489,16 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) {
if version == "" {
klog.V(2).Infof("ClusterOperator %s is not setting the 'operator' version", op.Name)
}
g := m.clusterOperatorUp.WithLabelValues(op.Name, version)
if resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable) {
g.Set(1)
} else {
g.Set(0)
var isUp float64
reason := "NoAvailableCondition"
if condition := resourcemerge.FindOperatorStatusCondition(op.Status.Conditions, configv1.OperatorAvailable); condition != nil {
Copy link

@Nikokolas3270 Nikokolas3270 Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric cluster_operator_up is equivalent to metric cluster_operator_conditions{condition="Available"}.
What is the point of keeping it?
Also semantic of cluster_operator_conditions metric is clear (I do not have to look at the code producing that metric to know what it does); semantic of metric cluster_operator_up is less clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric cluster_operator_up is equivalent to metric cluster_operator_conditions{condition="Available"}.

cluster_operator_up also includes the operator version from the ClusterOperator. And it detects "ClusterOperator exists but Available is unset", where cluster_operator_conditions has no time-series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also semantic of cluster_operator_conditions metric is clear (I do not have to look at the code producing that metric to know what it does)...

I'm completely onboard with improving help-text for cluster_operator_up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've expanded cluster_operator_up's help text in 9f587ce -> edf2574.

Copy link

@Nikokolas3270 Nikokolas3270 Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_operator_up also includes the operator version from the ClusterOperator.

Couldn't the cluster_operator_conditions metric ship the version as well?

And it detects "ClusterOperator exists but Available is unset", where cluster_operator_conditions has no time-series.

Well, here cluster_operator_up will be set to 0 whether the Available condition is unset to set to 0. Only cluster_operator_conditions{condition="Available"} metric can be use to distinguish it from being unset or set to 0.

I'm completely onboard with improving help-text for cluster_operator_up.

Thanks for that, but as said, just with its name, I know what cluster_operator_conditions does. I have to dig further to know what cluster_operator_up does.

Last, please note that in some other request (OSD-8320), we have been asked to use the reason of the Degraded condition in conjunction with cluster_operator_up to inhibit ClusterOperatorDown alert.
Maybe this is a misstake as it would be strange to pick up the value from one metric (cluster_operator_conditions{condition="Available"}) and the reason from an other (cluster_operator_conditions{condition="Degraded"}); but hits is puzzling to me and need to be clarified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the cluster_operator_conditions metric ship the version as well?

It could, but the version is ClusterOperator-scoped, and not condition-scoped. cluster_operator_up is ClusterOperator-scoped, so it's a good fit there. cluster_operator_conditions is condition-scoped, so it's not a good fit there (would you copy the same version on to time-series for every one of the operator's conditions?). You could obviously also add a new ClusterOperator-scoped metric whose sole purpose was exposing the version information, but 🤷, that seems like a lot of churn, and I'm not clear on the benefit.

Only cluster_operator_conditions{condition="Available"} metric can be use to distinguish it from being unset or set to 0.

With this pull request, you can make that distinction by seeing if the cluster_operator_up == 0 reason is NoAvailableCondition or otherwise. But as far as "when to page an admin?" goes, I don't think that distinction is all that important. Both are bad enough that we want to be paging admins.

I know what cluster_operator_conditions does. I have to dig further to know what cluster_operator_up does.

I don't mind if other cluster-version-operator approvers want to override me, but personally "many metrics consumers are able to intuit the semantics without having to read docs" is a useful property, and worthy of aspiring to, but not something that must be delivered in order for a metric to exist.

Last, please note that in some other request (OSD-8320), we have been asked to use the reason of the Degraded condition in conjunction with cluster_operator_up to inhibit ClusterOperatorDown alert.

That seems like an anti-pattern to me, since Degraded=True is theoretically orthogonal to Available=False. And it's Available=False (or absent) that are driving ClusterOperatorDown. We've had it decoupled from Degraded since #550. That said, my push-back is just that I don't want to re-mix Degradedness into ClusterOperatorDown; I have no problem at all if consumers like OSD decide to join ClusterOperatorDown with Degraded information when deciding if/how to respond.

Copy link

@Nikokolas3270 Nikokolas3270 Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback Trevor.
As my remark is more a matter of preference, I am ok to close this discussion.
My worries were linked to OSD-8320 which is asking to use a different condition for the reason. From what I understand using the reason coming from the Available condition is perfectly fine with the silences defined there:
openshift/configure-alertmanager-operator@db85a6d
(although I wonder if it would be better to not silence the alert and let CAD process it by sending an automated SL to the customer)

reason = condition.Reason
if condition.Status == configv1.ConditionTrue {
isUp = 1
}
}
g := m.clusterOperatorUp.WithLabelValues(op.Name, version, reason)
g.Set(isUp)
ch <- g
for _, condition := range op.Status.Conditions {
if condition.Status != configv1.ConditionFalse && condition.Status != configv1.ConditionTrue {
Expand Down
2 changes: 1 addition & 1 deletion pkg/cvo/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ func Test_operatorMetrics_Collect(t *testing.T) {
t.Fatalf("Unexpected metrics %s", spew.Sdump(metrics))
}
expectMetric(t, metrics[0], 0, map[string]string{"type": "current", "version": "", "image": "", "from_version": ""})
expectMetric(t, metrics[1], 0, map[string]string{"name": "test", "version": "10.1.5-1"})
expectMetric(t, metrics[1], 0, map[string]string{"name": "test", "version": "10.1.5-1", "reason": "NoAvailableCondition"})
expectMetric(t, metrics[2], 1, map[string]string{"type": ""})
},
},
Expand Down