-
Notifications
You must be signed in to change notification settings - Fork 426
OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
#1782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
#1782
Conversation
|
@petr-muller: This pull request references Jira Issue OCPBUGS-33896, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-33896, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| } | ||
|
|
||
| return body, err | ||
| return body, resp.StatusCode, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the != http.StatusOK check here, and return an error in the case where we failed? Having some details on why would also be interesting, but dumping HTML to the terminal might be too chatty. I wonder if we could set Accept to say "we want JSON back, but if you can't do that, text/plain" to get something more terminal-compatible? Or maybe Thanos isn't that media-type aware? Or maybe it's the router that's giving us an error (like "none of the Pods backing this Service are Ready=True", and not Thanos?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because getWithBearer sounded like a low-level HTTP method which I would not expect to have opinions on status codes (maybe something would want to call it and handle retries or something).
Router's "service is currently down" page is what I assume was happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced about high-level vs. low-level, but the 8fd03af logging gives folks calling with -v8 or higher access to the body, and that's the main think I was hoping for. And getWithBearer is internal, so we can shift things around later if we change our minds. Feel free to mark this thread resolved :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would second putting the check here. No reason really to extend the interface, but its really a nit, feel free to ignore.
As for thanos responses, it implements the Prometheus query api. There it says The API response format is JSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have put the check to getWithBearer as you advise, but while I did it, I noticed that the iteration over ingress hosts from a route does not make much sense the way it was done - the iteration aborted on failure and continued on success, so it actually required all options to succeed and then returned the last one. I guess this would not be an issue normally because there would be just a single option, but still changed the method to search for a first success, and only return with an error if all options failed with an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch 👍
5a3bf95 to
67de2a6
Compare
f827f29 to
8fd03af
Compare
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
| if err != nil { | ||
| return alertBytes, err | ||
| } | ||
| glogBody("Response Body", alertBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this log the response even for a StatusOK code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is useful - it is logged only on high verbosity levels so it should not pollute any outputs unless you ask for a lot of detail.
|
/lgtm |
8fd03af to
718e672
Compare
Previously, the method iterated over URIs from a route but instead of searching for success it actually searched until first failures, which is against the point of iterating over possible URIs in the first place. Refactor the method so that it does not return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name nnotation.
718e672 to
51487f9
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@petr-muller: Jira Issue OCPBUGS-33896: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-33896 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build openshift-enterprise-cli-container-v4.17.0-202405300213.p0.gd5b5b3a.assembly.stream.el9 for distgit openshift-enterprise-cli. |
We have seen instances where Thanos responses contained html, apparently by because it is briefly down. We can be slightly more robust to that by detecting non-OK status code right away, instead of passing content body out:
Also improve debugging by allowing to emit http call details and response body on higher verbosity settings, consistent with
client-gocalls to apiserver.Additionally, improve the error handling by refactoring the
getWithBearermethod, Previously, it iterated over URIs from a route but instead of searching for success it actually searched until first failure, which is against the point of iterating over possible URIs in the first place. Refactor the method so that it does not immediately return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name notation.