Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pkg/monitor/monitorapi/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,8 @@ const (
FailedToAuthenticateWithOpenShiftUser IntervalReason = "FailedToAuthenticateWithOpenShiftUser"
FailedContactingAPIReason IntervalReason = "FailedContactingAPI"

UnhealthyReason IntervalReason = "Unhealthy"

UpgradeStartedReason IntervalReason = "UpgradeStarted"
UpgradeVersionReason IntervalReason = "UpgradeVersion"
UpgradeRollbackReason IntervalReason = "UpgradeRollback"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,9 @@ func NewUniversalPathologicalEventMatchers(kubeConfig *rest.Config, finalInterva
twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals)
registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher)

prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this fail for (what looks to be) non-upgrade jobs, such as periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop the DuringUpgrades part to not be misleading.

PLMK if you still think this should be moved to upgrade jobs exclusively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From [openshift-origin-30372-nightly-4.21-e2e-agent-ha-dualstack-conformance], it seems a liveness probe failure:

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted

Prometheus container in the affected pod seems fine, but other containers in the pod seem to be facing connection issues (possibly due to https://issues.redhat.com/browse/OCPBUGS-32021)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.

Copy link
Member Author

@rexagod rexagod Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).

Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a PathologicalNew error if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted
22:57:39 (x7) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe failed: command timed out
23:02:09 (x55) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dgoodwin dgoodwin Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.

This dualstack job has a known bug related to that network disruption.

registry.AddPathologicalEventMatcherOrDie(prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher)

return registry
}

Expand Down Expand Up @@ -1171,3 +1174,66 @@ func newCrioReloadedTooOftenEventMatcher(finalInternals monitorapi.Intervals) Ev
allowIfWithinIntervals: crioReloadedIntervals,
}
}

func newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals monitorapi.Intervals) EventMatcher {
podNamePrefix := "prometheus-k8s"
podNamespace := "openshift-monitoring"
messageHumanizedSubstring := "Readiness probe errored"
messageReason := monitorapi.UnhealthyReason
matcher := &SimplePathologicalEventMatcher{
name: "PrometheusReadinessProbeErrorsDuringUpgrades",
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + podNamespace + `$`),
monitorapi.LocatorPodKey: regexp.MustCompile(`^` + podNamePrefix + `-[0,1]$`),
},
messageReasonRegex: regexp.MustCompile(`^` + string(messageReason) + `$`),
messageHumanRegex: regexp.MustCompile(messageHumanizedSubstring),
jira: "https://issues.redhat.com/browse/OCPBUGS-62703",
}

// Sanity check in case no `finalIntervals` are provided.
if finalIntervals == nil || len(finalIntervals) == 0 {
matcher.neverAllow = true
return matcher
}

/*
05:50:32 openshift-monitoring kubelet prometheus-k8s-1
Unhealthy
Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 58577e7deb7b8ae87b8029b9988fa268613748d0743ce989748f27e52b199ef5 is running failed: container process not found

05:53:52 (x25) openshift-monitoring kubelet prometheus-k8s-0
Unhealthy
Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

11:44:16 (x56) openshift-monitoring kubelet prometheus-k8s-0
Unhealthy
Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1
*/
testIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
return eventInterval.Locator.Type == monitorapi.LocatorTypePod &&
eventInterval.Locator.Keys[monitorapi.LocatorNamespaceKey] == podNamespace &&
strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) &&
eventInterval.Message.Reason == messageReason &&
strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this looping through all intervals is not required. It effectively duplicates the matcher logic above, and I think all you really need here is to return that matcher with the matcher.repeatThresholdOverride = 100 set, the framework should handle the rest as far as I can tell. Checking if the matcher will match anything before setting it's threshold looks unnecessary to me.


if len(testIntervals) > 0 {
/*
Readiness probes run during the lifecycle of the container, including termination.
Prometheus pods may take some time to stop, and thus result in more kubelet pings than permitted by default (20).
With a termination grace period of 600s, these pods may lead to probe errors (e.g. the web service is stopped but the process is still running), which is expected during upgrades.

To address this, set the threshold to 100 (approximately 600 (termination period) / 5 (probe interval)), to allow for a high number of readiness probe errors during the upgrade, but not so high that we would miss a real problem.
The job below hit ~60 readiness errors during the upgrade:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade/1977094149035266048, which makes sense to ignore,
However, the job below hit readiness errors 774 times during the upgrade:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1975691393640697856, which should be caught.
*/
matcher.repeatThresholdOverride = 100
} else {
matcher.neverAllow = true
}

return matcher
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ package pathologicaleventlibrary

import (
_ "embed"
"fmt"
"testing"
"time"

Expand Down Expand Up @@ -666,3 +667,117 @@ func TestPathologicalEventsTopologyAwareHintsDisabled(t *testing.T) {
})
}
}

func TestPathologicalEventsPrometheusReadinessProbeErrorsDuringUpgrades(t *testing.T) {
const namespace = "openshift-monitoring"

unhealthyReasonPathologicalMessageWithHumanMessage := func(humanMessage string, repetitionCount int) monitorapi.Message {
return monitorapi.Message{
Reason: monitorapi.UnhealthyReason,
HumanMessage: humanMessage,
Annotations: map[monitorapi.AnnotationKey]string{
monitorapi.AnnotationCount: fmt.Sprintf("%d", repetitionCount),
monitorapi.AnnotationPathological: "true",
},
}
}

openshiftMonitoringNsLocatorWithPodKey := func(pod string) monitorapi.Locator {
return monitorapi.Locator{
Type: monitorapi.LocatorTypePod,
Keys: map[monitorapi.LocatorKey]string{
monitorapi.LocatorNamespaceKey: "openshift-monitoring",
monitorapi.LocatorPodKey: pod,
},
}
}

tests := []struct {
name string
intervals []monitorapi.Interval
expectedMessage string
}{
{
name: "Readiness probe error (stopping container) on first Prometheus pod",
intervals: []monitorapi.Interval{
{
Condition: monitorapi.Condition{
Locator: openshiftMonitoringNsLocatorWithPodKey("prometheus-k8s-0"),
Message: unhealthyReasonPathologicalMessageWithHumanMessage("Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1", 100),
},
},
},
},
{
name: "Readiness probe error (terminated container) on second Prometheus pod",
intervals: []monitorapi.Interval{
{
Condition: monitorapi.Condition{
Locator: openshiftMonitoringNsLocatorWithPodKey("prometheus-k8s-1"),
Message: unhealthyReasonPathologicalMessageWithHumanMessage("Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 58577e7deb7b8ae87b8029b9988fa268613748d0743ce989748f27e52b199ef5 is running failed: container process not found", 100),
},
},
},
},
{
name: "Readiness probe error (stopping container, different human message) on second Prometheus pod",
intervals: []monitorapi.Interval{
{
Condition: monitorapi.Condition{
Locator: openshiftMonitoringNsLocatorWithPodKey("prometheus-k8s-1"),
Message: unhealthyReasonPathologicalMessageWithHumanMessage("Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1", 100),
},
},
},
},
{
name: "Readiness probe error (stopping container, different human message) on non-existent Prometheus pod should not be ignored",
intervals: []monitorapi.Interval{
{
Condition: monitorapi.Condition{
Locator: openshiftMonitoringNsLocatorWithPodKey("prometheus-k8s-2"),
Message: unhealthyReasonPathologicalMessageWithHumanMessage("Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1", 100),
},
},
},
expectedMessage: "1 events happened too frequently\n\nevent happened 100 times, something is wrong: namespace/openshift-monitoring pod/prometheus-k8s-2 - reason/Unhealthy Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (00:00:00Z) result=reject ",
},
{
name: "Readiness probe error (stopping container, different human message) on a Prometheus pod should not be ignored above the acceptable limit",
intervals: []monitorapi.Interval{
{
Condition: monitorapi.Condition{
Locator: openshiftMonitoringNsLocatorWithPodKey("prometheus-k8s-1"),
Message: unhealthyReasonPathologicalMessageWithHumanMessage("Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1", 101),
},
},
},
expectedMessage: "1 events happened too frequently\n\nevent happened 101 times, something is wrong: namespace/openshift-monitoring pod/prometheus-k8s-1 - reason/Unhealthy Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (00:00:00Z) result=reject ",
},
}

for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
events := monitorapi.Intervals(test.intervals)
evaluator := duplicateEventsEvaluator{
registry: NewUpgradePathologicalEventMatchers(nil, events),
}

testName := "events should not repeat"
junits := evaluator.testDuplicatedEvents(testName, false, events, nil, false)
jUnitName := getJUnitName(testName, namespace)
for _, junit := range junits {
if junit.Name == jUnitName {
if test.expectedMessage != "" {
require.NotNil(t, junit.FailureOutput, "expected junit to have failure output")
require.Equal(t, test.expectedMessage, junit.FailureOutput.Output)
} else {
require.Nil(t, junit.FailureOutput, "expected success but got failure output for junit: %s", junit.Name)
}

break
}
}
})
}
}