Skip to content

Conversation

@sallyom
Copy link
Contributor

@sallyom sallyom commented Nov 24, 2019

@sallyom sallyom force-pushed the ensure-metrics-available branch from 7578d03 to d583fb2 Compare November 26, 2019 17:47
@sallyom
Copy link
Contributor Author

sallyom commented Nov 26, 2019

@marun thanks for the review, I've updated, ready for another review. Also, see this functioning in the cluster-kube-controller-operator PR.

@sallyom sallyom force-pushed the ensure-metrics-available branch from d583fb2 to 60688ba Compare November 26, 2019 22:40
@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 27, 2019
Copy link
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the separation of concerns you are pursuing.

What do you think of adding an e2e test that exercises the code? Some of my comments (e.g. exiting polling on recoverable errors) would be challenging to discover from code alone but would become immediately apparent when executed (e.g. failing immediately when configmap doesn't contain cabundle). I'd also prefer to have these helpers be maintainable in library-go vs requiring co-developement with one or more operators.

@sallyom sallyom force-pushed the ensure-metrics-available branch 2 times, most recently from e3c1277 to dd5e01f Compare November 27, 2019 14:07
Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sallyom since these are meant as testing helpers please move them to pkg/test/library/metrics I don't want anyone accidentally figuring this one out and using it for other purpose than testing.

@sallyom sallyom force-pushed the ensure-metrics-available branch 15 times, most recently from ccef27b to be5ca46 Compare November 28, 2019 01:43
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 6, 2020
}
for _, secret := range secrets.Items {
if secret.Type != corev1.SecretTypeServiceAccountToken ||
!strings.HasPrefix(secret.Name, "prometheus-") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to specify the more explicit prometheus-k8s- prefix? Tokens sourced from secrets prefixed with both prometheus-k8s- and prometheus-operator- support querying today, but I'm not sure that will always be true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@marun
Copy link
Contributor

marun commented Jan 7, 2020

@sallyom Nice work! I've updated openshift/service-ca-operator#90 to use it and tests are passing locally. Once this PR merges I'll add another helper that checks that metrics are being collected for a given namespace (e.g. checkMetricsCollected(t, config, namespace)).

return nil, err
}

route, err := rc.RouteV1().Routes("openshift-monitoring").Get("prometheus-k8s", metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to block merge since the code works as-is., but I seem to recall @s-urbaniak mentioning that it would be preferable to query thanos-querier. However, my naive attempts to use the thanos-querier route were unsuccessful with tokens sourced from prometheus-k8s- and thanos-querier- prefixed secrets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the prometheus-k8s is route is still maintained, but effectively deprecated. Thanos Querier is gated by the same oauth-proxy sidecar as prometheus-k8s, hence the same token secrets should apply, else we have a bug on our side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-urbaniak No bug. I was requesting targets via the thanos-querier route and it wasn't succeeding, but I realize now that it was a 404 rather than an auth failure presumably because thanos doesn't support querying targets. Performing an up query instead works fine.

@sallyom Please update to use the thanos-querier route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit a snag, the thanos-querier route works for a query, but when searching for an alert, that route does not work, have to use the prometheus-k8s route for alerts (like I'm doing in the kube-controller-manager-operator PR). I added a switch and a param to the NewPrometheusClient to pass either 'query' or 'alert'.
@s-urbaniak is that expected? thanos-querier route not for Alerts? I get a 404 for /alerts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that you can pass the entire alert as a query, like so:
ALERTS{alertname="PodDisruptionBudgetAtLimit",alertstate="pending",namespace="pdbns",poddisruptionbudget="pdbname",prometheus="openshift-monitoring/k8s",service="kube-state-metrics",severity="warning"}==1
instead of using the deprecated prometheus-k8s route. From what I can find, you have to pass the full alert, since partial responses return an error w/ how we have thanos setup - that is stretching my knowledge of prometheus/thanos, correct me if you can.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sallyom do you need to pass all of the properties in that query or only those you care about? Seems quite cumbersome at first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sallyom indeed, thanos querier does not support /alerts as it is "only" responsible for the query part. It doesn't scrape targets either as it is "just" a proxy in front of prometheus, hence /targets doesn't exist here either. Your ALERTS query is fine from my point of view, but passing just ALERTS{alertname="PodDisruptionBudgetAtLimit"} should work too.

@sallyom sallyom force-pushed the ensure-metrics-available branch from b7a7638 to af5efab Compare January 7, 2020 15:16
}
host = route.Status.Ingress[0].Host
case "alert":
route, err := rc.RouteV1().Routes("openshift-monitoring").Get("prometheus-k8s", metav1.GetOptions{})
Copy link
Contributor

@marun marun Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-urbaniak This code suggests that thanos-querier doesn't support querying for alerts via the Alerts method of the prometheus client. Is there another way for tests to query for alerts similar to how an up query can replace use of the Targets method of the prometheus client? Or is testing of alerts dependent on the deprecated prometheus-k8s route?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ lol I found that out also, see above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above you can use the ALERTS query to get the desired result. Indeed, the /alerts endpoint is not accessible via Thanos Querier, as it is "just" responsible for querying metrics.

@sallyom sallyom force-pushed the ensure-metrics-available branch 4 times, most recently from 564ed58 to 6ed7cdc Compare January 7, 2020 18:46
@marun
Copy link
Contributor

marun commented Jan 8, 2020

LGTM

@mfojtik
Copy link
Contributor

mfojtik commented Jan 8, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2020
KeepAlive: 30 * time.Second,
}).DialContext,
TLSHandshakeTimeout: 10 * time.Second,
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot send a bearer token to a destination without verifying the server you're speaking to.

Copy link
Contributor

@marun marun Jan 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the potential risk(s) you are concerned with? As per the test prefix of this path, this code is intended only to support e2e testing of metrics collection.

@deads2k
Copy link
Contributor

deads2k commented Jan 8, 2020

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2020
@sallyom sallyom force-pushed the ensure-metrics-available branch from 6ed7cdc to 9b4817f Compare January 9, 2020 19:01
@sallyom sallyom force-pushed the ensure-metrics-available branch from 9b4817f to 23320a9 Compare January 9, 2020 19:03
@sallyom
Copy link
Contributor Author

sallyom commented Jan 9, 2020

@deads2k ptal, thanks! and thanks for pointing me in the direction of the correct router-ca configmap, that did the trick, I was initially using the service-ca-operator generated CA, not the correct router-ca

@deads2k
Copy link
Contributor

deads2k commented Jan 13, 2020

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, mfojtik, sallyom, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2020
@mfojtik
Copy link
Contributor

mfojtik commented Jan 13, 2020

/lgtm

@stevekuznetsov stevekuznetsov added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2020
@openshift-merge-robot openshift-merge-robot merged commit f2ca9aa into openshift:master Jan 13, 2020
bertinatto pushed a commit to bertinatto/library-go that referenced this pull request Jul 2, 2020
add helper functions for running prometheus query
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants