-
Notifications
You must be signed in to change notification settings - Fork 425
OCPBUGS-33896: upgrade status: polish alert insights
#1787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33896: upgrade status: polish alert insights
#1787
Conversation
petr-muller
commented
May 27, 2024
- skip alerts without required labels
- add context on why we show the insight (started firing during update or is known to affect updates)
- skip alerts with info level
- explicitly mention the alert does not have a runbook
|
@petr-muller: This pull request references Jira Issue OCPBUGS-33896, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
4a964fd to
937c941
Compare
937c941 to
5e4f462
Compare
|
/retest |
|
/test e2e-aws-ovn-serial |
| var description string | ||
| switch { | ||
| case startedAt.After(alert.ActiveAt) && !allowedAlerts.Contains(alertName): | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: possibly shift this to an else: continue at the end, so we don't have to wonder if we've missed a case and left an empty description without a continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this a bit, introduced named conditions which hopefully help with readability
| if alert.Annotations.Description == "" { | ||
| description += " The alert has no description." | ||
| } else { | ||
| description += " The alert description is: " + alert.Annotations.Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the deep past (e.g. openshift/cluster-version-operator#547), alerts used message instead of the split summary/description. No worries if you don't want to address that in this pull, and checking the alert fixture data, maybe there are no relevant alerts that still do things the old way, in which case no need to handle it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! 4.16.0-rc.2 payload shows we still have a few manifests that use message so I added handling this in the code. Some of our alerts use all fields (message, description, summary) so I try to include them all, when present. Also added fixtures to exercise this case.
$ rg 'message: ' *prom*yaml
0000_90_cluster-authentication-operator_03_prometheusrule.yaml
17: message: >-
0000_90_ingress-operator_03_prometheusrules.yaml
25: message: "HAProxy reloads are failing on {{ $labels.pod }}. Router is not respecting recently created or modified routes"
34: message: "HAProxy metrics are reporting that HAProxy is down on pod {{ $labels.namespace }} / {{ $labels.pod }}"
43: message: |
54: message: |
78: message: "Ingress {{ $labels.namespace }}/{{ $labels.name }} is missing the IngressClassName for 1 day."
87: message: "Route {{ $labels.namespace }}/{{ $labels.name }} is owned by an unmanaged Ingress."
0000_50_cluster-storage-operator_12_prometheusrules.yaml
28: message: "StorageClass count check is failing (there should not be more than one default StorageClass)"
55: Events of the Pods should contain exact error message: "oc describe pod -n <pod namespace> <pod name>".
7dc7a60 to
ba9dd40
Compare
- skip alerts without required labels - add context on why we show the insight (started firing during update or is known to affect updates) - skip alerts with info level - explicitly mention the alert does not have a runbook - fix `shortDuration` for more cases, including `now`, add tests - handle also `message` annotation on alerts
ba9dd40 to
7515161
Compare
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@petr-muller: Jira Issue OCPBUGS-33896: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-33896 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-enterprise-cli-container-v4.17.0-202406010114.p0.g0bea059.assembly.stream.el9 for distgit openshift-enterprise-cli. |