Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented May 23, 2024

We have seen instances where Thanos responses contained html, apparently by because it is briefly down. We can be slightly more robust to that by detecting non-OK status code right away, instead of passing content body out:

Ignoring alerts in 'Update Health'. Error unmarshaling alerts: %w invalid character '<' looking for beginning of value

Also improve debugging by allowing to emit http call details and response body on higher verbosity settings, consistent with client-go calls to apiserver.

Additionally, improve the error handling by refactoring the getWithBearer method, Previously, it iterated over URIs from a route but instead of searching for success it actually searched until first failure, which is against the point of iterating over possible URIs in the first place. Refactor the method so that it does not immediately return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name notation.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 23, 2024
@openshift-ci-robot
Copy link

@petr-muller: This pull request references Jira Issue OCPBUGS-33896, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @evakhoni

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

We have seen instances where Thanos responses contained html, apparently by because it is briefly down. We can be slightly more robust to that by detecting non-OK status code right away, instead of passing content body out.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from evakhoni May 23, 2024 22:05
@openshift-ci-robot
Copy link

@petr-muller: This pull request references Jira Issue OCPBUGS-33896, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @evakhoni

Details

In response to this:

We have seen instances where Thanos responses contained html, apparently by because it is briefly down. We can be slightly more robust to that by detecting non-OK status code right away, instead of passing content body out:

Ignoring alerts in 'Update Health'. Error unmarshaling alerts: %w invalid character '<' looking for beginning of value

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jan--f and rexagod May 23, 2024 22:06
}

return body, err
return body, resp.StatusCode, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the != http.StatusOK check here, and return an error in the case where we failed? Having some details on why would also be interesting, but dumping HTML to the terminal might be too chatty. I wonder if we could set Accept to say "we want JSON back, but if you can't do that, text/plain" to get something more terminal-compatible? Or maybe Thanos isn't that media-type aware? Or maybe it's the router that's giving us an error (like "none of the Pods backing this Service are Ready=True", and not Thanos?)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because getWithBearer sounded like a low-level HTTP method which I would not expect to have opinions on status codes (maybe something would want to call it and handle retries or something).

Router's "service is currently down" page is what I assume was happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced about high-level vs. low-level, but the 8fd03af logging gives folks calling with -v8 or higher access to the body, and that's the main think I was hoping for. And getWithBearer is internal, so we can shift things around later if we change our minds. Feel free to mark this thread resolved :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would second putting the check here. No reason really to extend the interface, but its really a nit, feel free to ignore.

As for thanos responses, it implements the Prometheus query api. There it says The API response format is JSON

Copy link
Member Author

@petr-muller petr-muller May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put the check to getWithBearer as you advise, but while I did it, I noticed that the iteration over ingress hosts from a route does not make much sense the way it was done - the iteration aborted on failure and continued on success, so it actually required all options to succeed and then returned the last one. I guess this would not be an issue normally because there would be just a single option, but still changed the method to search for a first success, and only return with an error if all options failed with an error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch 👍

@petr-muller petr-muller force-pushed the ocpbugs-33896-alerts-robustify-gets branch from 5a3bf95 to 67de2a6 Compare May 23, 2024 23:17
@petr-muller petr-muller force-pushed the ocpbugs-33896-alerts-robustify-gets branch 2 times, most recently from f827f29 to 8fd03af Compare May 23, 2024 23:42
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2024
if err != nil {
return alertBytes, err
}
glogBody("Response Body", alertBytes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this log the response even for a StatusOK code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is useful - it is logged only on high verbosity levels so it should not pollute any outputs unless you ask for a lot of detail.

@jan--f
Copy link

jan--f commented May 29, 2024

/lgtm
/hold
just adding a hold in case you want to address any nits. Feel free to unhold any time.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2024
@petr-muller petr-muller force-pushed the ocpbugs-33896-alerts-robustify-gets branch from 8fd03af to 718e672 Compare May 29, 2024 14:19
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2024
Previously, the method iterated over URIs from a route but instead of searching for success it actually searched until first failures, which is against the point of iterating over possible URIs in the first place.

Refactor the method so that it does not return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name nnotation.
@petr-muller petr-muller force-pushed the ocpbugs-33896-alerts-robustify-gets branch from 718e672 to 51487f9 Compare May 29, 2024 14:22
@jan--f
Copy link

jan--f commented May 29, 2024

/lgtm
/unhold

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 29, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d5b5b3a into openshift:master May 29, 2024
@openshift-ci-robot
Copy link

@petr-muller: Jira Issue OCPBUGS-33896: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-33896 has not been moved to the MODIFIED state.

Details

In response to this:

We have seen instances where Thanos responses contained html, apparently by because it is briefly down. We can be slightly more robust to that by detecting non-OK status code right away, instead of passing content body out:

Ignoring alerts in 'Update Health'. Error unmarshaling alerts: %w invalid character '<' looking for beginning of value

Also improve debugging by allowing to emit http call details and response body on higher verbosity settings, consistent with client-go calls to apiserver.

Additionally, improve the error handling by refactoring the getWithBearer method, Previously, it iterated over URIs from a route but instead of searching for success it actually searched until first failure, which is against the point of iterating over possible URIs in the first place. Refactor the method so that it does not immediately return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name notation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build openshift-enterprise-cli-container-v4.17.0-202405300213.p0.gd5b5b3a.assembly.stream.el9 for distgit openshift-enterprise-cli.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants