Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,11 @@ spec:
annotations:
message: Cluster operator {{ "{{ $labels.name }}" }} has been degraded for 10 minutes. Operator is degraded because {{ "{{ $labels.reason }}" }} and cluster upgrades will be unstable.
expr: |
cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} == 1
(
cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"}
or on (name)
group by (name) (cluster_operator_up{job="cluster-version-operator"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a hard time reading this:

"if operator is degraded OR is up == 1 for ten minutes report degraded?"

Shouldn't that be

" if operator is degraded == 1 OR up == 0 for ten minutes report degraded"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the OR portion is doing, so I'm not saying it's incorrect, but according to PR description ClusterOperatorDegraded should only fire when Degraded true or unset. And if up == 0 only ClusterOperatorDown should fire which is handled above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a bunch of ClusterOperators. Grouping into sets:

a. Don't set Available, or set Available!=True.
b. Set Available=True.
c. Don't set Degraded or set it to an unrecognized value.
d. Set Degraded=True.
e. Set Degraded=False.

a∪b is all operators. c∪d∪e is all operators.

cluster_operator_up{job="cluster-version-operator"} will be zero for (a) and one for (b). ClusterOperatorDown will fire for (a) but not for (b). Great.

cluster_operator_conditions{job="cluster-version-operator", condition="Degraded"} will be one for (d) and zero for (e), but misses (c). group by collapses the cluster_operator_up value to one, regardless of (a) vs. (b). So when those (c) clusters fall through to the or, they hit the group by one, and come out of the (A or B) block with one just like (d). So ClusterOperatorDegraded will fire for (c) and (d) but not for (e). Great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's right. i hate group by.

) == 1
for: 10m
labels:
severity: critical
Expand Down
4 changes: 1 addition & 3 deletions pkg/cvo/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -358,9 +358,7 @@ func (m *operatorMetrics) Collect(ch chan<- prometheus.Metric) {
klog.V(4).Infof("ClusterOperator %s is not setting the 'operator' version", op.Name)
}
g := m.clusterOperatorUp.WithLabelValues(op.Name, version)
failing := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorDegraded)
available := resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable)
if available && !failing {
if resourcemerge.IsOperatorStatusConditionTrue(op.Status.Conditions, configv1.OperatorAvailable) {
g.Set(1)
} else {
g.Set(0)
Expand Down
64 changes: 62 additions & 2 deletions pkg/cvo/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,67 @@ func Test_operatorMetrics_Collect(t *testing.T) {
},
},
{
name: "collects cluster operator status failure",
name: "collects cluster operator without conditions",
optr: &Operator{
coLister: &coLister{
Items: []*configv1.ClusterOperator{
{
ObjectMeta: metav1.ObjectMeta{
Name: "test",
},
Status: configv1.ClusterOperatorStatus{
Versions: []configv1.OperandVersion{
{Name: "operator", Version: "10.1.5-1"},
{Name: "operand", Version: "10.1.5-2"},
},
},
},
},
},
},
wants: func(t *testing.T, metrics []prometheus.Metric) {
if len(metrics) != 3 {
t.Fatalf("Unexpected metrics %s", spew.Sdump(metrics))
}
expectMetric(t, metrics[0], 0, map[string]string{"type": "current", "version": "", "image": "", "from_version": ""})
expectMetric(t, metrics[1], 0, map[string]string{"name": "test", "version": "10.1.5-1"})
expectMetric(t, metrics[2], 1, map[string]string{"type": ""})
},
},
{
name: "collects cluster operator unavailable",
optr: &Operator{
coLister: &coLister{
Items: []*configv1.ClusterOperator{
{
ObjectMeta: metav1.ObjectMeta{
Name: "test",
},
Status: configv1.ClusterOperatorStatus{
Versions: []configv1.OperandVersion{
{Name: "operator", Version: "10.1.5-1"},
{Name: "operand", Version: "10.1.5-2"},
},
Conditions: []configv1.ClusterOperatorStatusCondition{
{Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse},
},
},
},
},
},
},
wants: func(t *testing.T, metrics []prometheus.Metric) {
if len(metrics) != 4 {
t.Fatalf("Unexpected metrics %s", spew.Sdump(metrics))
}
expectMetric(t, metrics[0], 0, map[string]string{"type": "current", "version": "", "image": "", "from_version": ""})
expectMetric(t, metrics[1], 0, map[string]string{"name": "test", "version": "10.1.5-1"})
expectMetric(t, metrics[2], 0, map[string]string{"name": "test", "condition": "Available"})
expectMetric(t, metrics[3], 1, map[string]string{"type": ""})
},
},
{
name: "collects cluster operator degraded",
optr: &Operator{
coLister: &coLister{
Items: []*configv1.ClusterOperator{
Expand All @@ -197,7 +257,7 @@ func Test_operatorMetrics_Collect(t *testing.T) {
t.Fatalf("Unexpected metrics %s", spew.Sdump(metrics))
}
expectMetric(t, metrics[0], 0, map[string]string{"type": "current", "version": "", "image": "", "from_version": ""})
expectMetric(t, metrics[1], 0, map[string]string{"name": "test", "version": "10.1.5-1"})
expectMetric(t, metrics[1], 1, map[string]string{"name": "test", "version": "10.1.5-1"})
expectMetric(t, metrics[2], 1, map[string]string{"name": "test", "condition": "Available"})
expectMetric(t, metrics[3], 1, map[string]string{"name": "test", "condition": "Degraded"})
expectMetric(t, metrics[4], 1, map[string]string{"type": ""})
Expand Down