Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pkg/monitor/monitorapi/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,8 @@ const (
FailedToAuthenticateWithOpenShiftUser IntervalReason = "FailedToAuthenticateWithOpenShiftUser"
FailedContactingAPIReason IntervalReason = "FailedContactingAPI"

UnhealthyReason IntervalReason = "Unhealthy"

UpgradeStartedReason IntervalReason = "UpgradeStarted"
UpgradeVersionReason IntervalReason = "UpgradeVersion"
UpgradeRollbackReason IntervalReason = "UpgradeRollback"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,9 @@ func NewUniversalPathologicalEventMatchers(kubeConfig *rest.Config, finalInterva
twoNodeEtcdEndpointsMatcher := newTwoNodeEtcdEndpointsConfigMissingEventMatcher(finalIntervals)
registry.AddPathologicalEventMatcherOrDie(twoNodeEtcdEndpointsMatcher)

prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher := newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function indicates this is intended to relax only during upgrades, but it's added to the universal set here, not the upgrade specific set below. This should probably be moved down into that function below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this fail for (what looks to be) non-upgrade jobs, such as periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance. I should probably rename this to drop the DuringUpgrades part to not be misleading.

PLMK if you still think this should be moved to upgrade jobs exclusively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an explanation why prom would be getting killed in a non upgrade job? As far as I understand this, that would be quite unexpected and still get flagged. Let me know if you have that job run handy, I'm curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From [openshift-origin-30372-nightly-4.21-e2e-agent-ha-dualstack-conformance], it seems a liveness probe failure:

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted

Prometheus container in the affected pod seems fine, but other containers in the pod seem to be facing connection issues (possibly due to https://issues.redhat.com/browse/OCPBUGS-32021)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that https://issues.redhat.com/browse/OCPBUGS-32021 is involved here: the error logs come from the router which opens a TCP connection and then drops it after a successful connect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link to the job above didn't work seemingly, but it looks like it's probably a different symptom and would not hit your matcher as defined. I think it's probably best to make this an upgrade specific exception as this is the only area we're expecting this 10 minute delay.

Copy link
Member Author

@rexagod rexagod Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, unlike https://issues.redhat.com/browse/OCPBUGS-5916 we are indeed able to establish a successful connection after retries (though IIUC it seems the connection heals in the linked ticket as well, as there's no functionality disruption reported?).

Ah, my bad. I've appended more context to the snippet (PTAL below), and here's the event-filter link. As you can see, the readiness error will register a PathologicalNew error if not explicitly ignored (>20 pings), and AFAICT this will be caught by the matcher.

22:57:35 (x6) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Liveness probe failed: command timed out
-- | -- | -- | -- | -- | --
22:57:35 | openshift-monitoring | kubelet | prometheus-k8s-0 | Killing | Container prometheus failed liveness probe, will be restarted
22:57:39 (x7) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe failed: command timed out
23:02:09 (x55) | openshift-monitoring | kubelet | prometheus-k8s-0 | Unhealthy | Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dgoodwin dgoodwin Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep this upgrade specific unless there's a very clear explanation why this is expected and ok in a non-upgrade job, which afaict there is not. Looking at the intervals from your job run you can see there is mass disruption at the time we lose the readiness probe. Problems occurring during that time are not the kind of thing we want to hide.

This dualstack job has a known bug related to that network disruption.

registry.AddPathologicalEventMatcherOrDie(prometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher)

return registry
}

Expand Down Expand Up @@ -1171,3 +1174,62 @@ func newCrioReloadedTooOftenEventMatcher(finalInternals monitorapi.Intervals) Ev
allowIfWithinIntervals: crioReloadedIntervals,
}
}

func newPrometheusReadinessProbeErrorsDuringUpgradesPathologicalEventMatcher(finalIntervals monitorapi.Intervals) EventMatcher {
podNamePrefix := "prometheus-k8s"
podNamespace := "openshift-monitoring"
messageHumanizedSubstring := "Readiness probe errored"
messageReason := monitorapi.UnhealthyReason
matcher := &SimplePathologicalEventMatcher{
name: "PrometheusReadinessProbeErrorsDuringUpgrades",
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + podNamespace + `$`),
monitorapi.LocatorPodKey: regexp.MustCompile(`^` + podNamePrefix + `-[0,1]$`),
},
messageReasonRegex: regexp.MustCompile(`^` + string(messageReason) + `$`),
messageHumanRegex: regexp.MustCompile(messageHumanizedSubstring),
jira: "https://issues.redhat.com/browse/OCPBUGS-62703",
}

// Sanity check in case no `finalIntervals` are provided.
if finalIntervals == nil || len(finalIntervals) == 0 {
matcher.neverAllow = true
return matcher
}

/*
05:50:32 openshift-monitoring kubelet prometheus-k8s-1
Unhealthy
Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 58577e7deb7b8ae87b8029b9988fa268613748d0743ce989748f27e52b199ef5 is running failed: container process not found

05:53:52 (x25) openshift-monitoring kubelet prometheus-k8s-0
Unhealthy
Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1

11:44:16 (x56) openshift-monitoring kubelet prometheus-k8s-0
Unhealthy
Readiness probe errored and resulted in unknown state: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1
*/
testIntervals := finalIntervals.Filter(func(eventInterval monitorapi.Interval) bool {
return eventInterval.Locator.Type == monitorapi.LocatorTypePod &&
eventInterval.Locator.Keys[monitorapi.LocatorNamespaceKey] == podNamespace &&
strings.HasPrefix(eventInterval.Locator.Keys[monitorapi.LocatorPodKey], podNamePrefix) &&
eventInterval.Message.Reason == messageReason &&
strings.Contains(eventInterval.Message.HumanMessage, messageHumanizedSubstring)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this looping through all intervals is not required. It effectively duplicates the matcher logic above, and I think all you really need here is to return that matcher with the matcher.repeatThresholdOverride = 100 set, the framework should handle the rest as far as I can tell. Checking if the matcher will match anything before setting it's threshold looks unnecessary to me.


if len(testIntervals) > 0 {
// Readiness probe errors are expected during upgrades, allow a higher threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicit that readiness probes run during all the lifecycle of the container (including termination) and that Prometheus may take "some time" to stop (hence the termination grace period of 600s by default) which explains the probe errors (e.g. the web service is stopped but the process is still running).

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

// Set the threshold to 100 to allow for a high number of readiness probe errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

600 (termination period) / 5 (probe interval) = 120 but I agree that 100 is a good enough value.

// during the upgrade, but not so high that we would miss a real problem, i.e.,
// the job below (and usually) hit ~60 readiness errors during the upgrade,
// https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade/1977094149035266048,
// However, the job below hit readiness errors 774 times during the upgrade,
// https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-metal-ovn-single-node-rt-upgrade-test/1975691393640697856.
matcher.repeatThresholdOverride = 100
} else {
matcher.neverAllow = true
}

return matcher
}