-
Notifications
You must be signed in to change notification settings - Fork 462
Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool #2802
Conversation
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a context.TODO() further down in here too. Seems like you might want to pivot to using cmd.Context() where you need a Context, and cmd.Context().Done() when you need a stopping <-chan struct{}. And then something higher up the stack (or here?) to trigger these components when it's time to gracefully step down? Or maybe something more detailed? For example, the cluster-version operator sets up a few separate contexts and a TERM catcher so we can:
- Catch the TERM and cancel
runContext. - Collect all the goroutines that had been using
runContext, to ensure we no longer have any goroutines managing in-cluster operands. - Cancel
postMainContext, which was used for informers and the leader lease. shutdownContext, which lets callers know when they can give up on a graceful child-goroutine reap and stop blocking on it.
Anyhow, no need to jump straight to something like that, but adding additional, decoupled shutdown channels like you're doing here seems like a step away from a unified end-goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure you want summary and description, but not message. For example, see openshift/cluster-version-operator#547, which was successfully verified by QE, and these monitoring docs, which call for summary and description, but do not mention the older message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firing warning alerts immediately seems like it's wound a bit too tightly. We can go a while before the lack of certificate rotation becomes a problem (significant fractions of a year?). So setting a large-ish for here seems useful to avoid desensitizing folks by alerting every time they briefly pause. Of course, you could have someone leaving pools paused for the bulk of that time, and then very briefly unpause, have Prom happen to scrape, and then pause again, before the pool had time to rotate out many certs. Which is one reason "are we paused now, and have we been paused for a bit?" only a rough proxy for "are we at risk of certs expiring?". And that makes picking an appropriate for difficult. But if we stick with the paused-pool angle on this, I'd expect at least for: 60m on this alert, based on our consistency guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think if alerting config allows to change severity then it should change to critical after certain number of warning alert has been fired. This is to emphasize that unpausing pool is necessary to keep cluster healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will probably want some aggregation here to avoid twitching off and on if/when the machine-config controller container rolls. Because that will give you a small gap when mcc_important_config_paused, and the new container may come up with different injected scrape labels, and both of those may shuffle the time series matching this expression, leading to new alerts (after cooking through for). I addressed the scrape-labels portion for a CVO-side alert in openshift/cluster-version-operator#643, but missed the metrics-downtime portion. Something like:
sum by (description, path, pool) (max_over_time(mcc_important_config_paused[5m])) > 0
should address both issues, but while the max_over_time(...[5m]) smearing protects you from alert-churning leader blips, it also makes the alert less responsive to the metric actually getting cleared. So something like "mcc_important_config_paused looks bad, or some signal that the machine-config controller is currently not exporting metrics" would be better.
pkg/controller/common/metrics.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think about the machine-config controller often enough to recognize the mcc acronym. But many folks browsing available metrics on their cluster will not. Can we write this out machine_config_controller_content_paused or some such?
pkg/controller/common/metrics.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benefit to the desc abbreviation over a full description?
You can also avoid landing the description in the metric itself by using path-to-description lookup logic in the alert's description. I'm not familiar enough with Prom or the alert stack to know if that's worth the trouble.
install/0000_90_machine-config-operator_01_prometheus-rules.yaml
Outdated
Show resolved
Hide resolved
install/0000_90_machine-config-operator_01_prometheus-rules.yaml
Outdated
Show resolved
Hide resolved
|
Also, I think we are not checking yet whether pool is paused or not. Ideally we should be performing file checks & alert when a pool is paused. |
02d36ac to
8fe1bcf
Compare
f85abb8 to
0c6ea9e
Compare
|
Ok I think I've got this updated with the suggested feedback and ready for review again:
|
test/helpers/utils.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to be mindful about is if you run this in an SNO context, this function will prematurely error because the control plane may go offline.
To work around that, you'd need to do something like: https://github.com/openshift/machine-config-operator/blob/master/test/e2e-single-node/sno_mcd_test.go#L355-L374 or: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-b240aacd848b27ec33331f84d7e3fc0a4bc20045f45f2cd536727396a756e97bR195-R223.
The short version is:
- Ignore the error you get from the
cs.MachineConfigPools().Get(). - Run the clock out.
- If you're still getting an error from
cs.MachineConfigPools().Get()after you've run the clock out, then fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. Thanks. I'll rearrange the test after your helpers merge. I did run the test against single node, so I know it works (aside from handling the control plane conditions you're describing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Zack that for SNO, cluster can be itself unreachable because everything is running on the same node and when node reboots, it will be unreachable for a while. However, for this particular test case I won't be worried much because applying kubelet cert shouldn't cause drain or reboot. Ideally, cluster availability shouldn't be impacted and if it does happen, something is wrong.
pkg/controller/common/metrics.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to daemon.StartMetricsListener. I wonder if there's an opportunity for consolidating the boilerplate of starting the listener, e.g.:
func StartMetricsListener(addr string, stopCh <-chan struct{}, register func() error) {
// Boilerplate
if err := register(); err != nil {
// Handle error
}
// More boilerplate
}Then the controller and daemon can call this and pass in their registration funcs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored StartMetricsListener to take a function argument as suggested, and adjusted daemon to call it from controller/common.
pkg/controller/common/metrics.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logs that an error occurred during metrics registration and then continues starting the metrics listener. Does it make sense to continue trying to start the metrics listener if registration fails? Or would it make more sense to return early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted it to return if it fails to register, but I didn't have it return the error all the way up, I assumed we didn't want the failure of metrics registration to be fatal/impact anything else aside from the metrics handler.
If I'm wrong and we do want it to be fatal for some reason, let me know :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of error-handling, I think this is fine for now; we can revisit bubbling the error further later. A further improvement would be adding an additional error line (or tweaking the existing one) to say something like, "could not start metrics listener", just so it's clear that the metrics listener is not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, a failure during metric registration is a bug in your code (you registered the same metric twice for instance) and should be treated as a hard failure. Which is why prometheus.MustRegister() exists so you don't have to worry about the potential error.
test/e2e/mcc_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to target the master MachineConfigPool? I ask because a common pattern in the e2e test suite is that we create an ephemeral infra pool for the purpose of the test, assign a single worker node to that pool, then apply the MachineConfig. The reason is because we can apply the MachineConfig to a single node (and reset when we're done) much faster than rolling it out to the entire MachineConfigPool.
Here's an example: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-32df39b541567d57b84246d0d48a44792350b5af97d060cd0e44c33949c07370R55-R96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did originally have it set up the "infra" way, but:
- I flipped it to master when I was testing against single node (because if a node is member of multiple pools, master seems to supersede it).
- There is no escaping the certificate rotation -- all unpaused pools will get it -- I don't have a way to confine a certificate rotation to certain pools (well, aside from pause, which is what I am testing)
It probably does save a little bit of time having just one node be paused, otherwise we'd have to wait once we unpause.
I'll put it back to infra (and maybe copy this test as it sits into the single node tests if we care?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this particular test we don't really need to create infra pool because cert rotation would take place for all pools in the cluster. And for test purpose it is fine to pause only master pool if it helps with SNO case too. We will need to add this test case into single node too if not already done.
0c6ea9e to
db32c8d
Compare
779b852 to
f371cf0
Compare
|
Added some more detailed comments to those functions as requested and re-pushed. |
|
/retest-required |
|
/test e2e-agnostic-upgrade |
|
/test e2e-gcp-upgrade |
| "github.com/stretchr/testify/require" | ||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add an e2e test for an alert if there's a unit test? Shouldn't the unit test be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit test is good for us checking the metric, the e2e test was to make sure that monitoring would be able to retrieve it (secrets, auth, etc were all set up right).
Long term we talked about having it talk to the thanos query endpoint to make sure monitoring actually got it (like Simon was saying here: #2802 (comment))
|
/test e2e-agnostic-upgrade |
1 similar comment
|
/test e2e-agnostic-upgrade |
|
@jkyros: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/test e2e-agnostic-upgrade |
|
/lgtm |
| func GetCertificatesFromPEMBundle(pemBytes []byte) ([]*x509.Certificate, error) { | ||
| var certs []*x509.Certificate | ||
| // There can be multiple certificates in the file | ||
| for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait sorry - do we really want an infinite loop here? could we make this a little more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future John will come back to this later to add clarifying comments which is fine with me.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi, jkyros, kikisdeliveryservice, sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/skip |
|
/hold cancel |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802
…ineConfigControllerPausedPoolKubeletCA runbook URIs The broken URIs were originally from 2c44c12 (Add plumbing for mcc metrics handler, 2022-02-25, openshift#2802)
Soft revert of openshift#2802. This should no longer be needed since the MCD will always sync the cert bundle to disk. If things go wrong, the MCD should degrade.
Soft revert of openshift#2802. This should no longer be needed since the MCD will always sync the cert bundle to disk. If things go wrong, the MCD should degrade.
- What I did
In manifests:
In controller:
References:
Added an alert that fires when the kubelect-ca certificate is pending in a paused pool:
I don't have tests yet
- How to verify it
To fire the alert:
oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}' kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operatorTOKEN=`oc sa get-token prometheus-k8s -n openshift-monitoring` POD=`oc get pods -n openshift-machine-config-operator | grep controller | awk '{print $1}'` oc rsh $POD curl -k -H "Authorization: Bearer $TOKEN" https://localhost:9001/metricsTo stop the alert:
- Description for the changelog
Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool