Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool #2802

jkyros · 2021-10-16T04:35:09Z

- What I did

Added metrics reporting to machine-config-controller because it did not previously have that capability by adding:

In manifests:
- Cluster Roles
- Cluster Role Bindings
- ServiceMonitor for metrics
- Service for metrics
- oauth-proxy container for machine-config-controller deployment
- mcc-proxy-tls secret for machine-config-controller
In controller:
- metrics handler function in machine-config-controller common
- machine config lister in node_controller and node_controller_test
References:
- For the handler I cribbed off of: 557303f
- And then to add oauth: 3ab692f

Added an alert that fires when the kubelect-ca certificate is pending in a paused pool:
- added a GaugeVec (MCCImportantConfigPaused) for important pending config
- added a map in node_controller to store the table of "important config names" : "important config files"
- added functions to check for pending files, and release the alerts once the pools have been unpaused
I don't have tests yet

- How to verify it

To fire the alert:
- build a cluster
- oc edit mcp worker
- change spec.paused: false to spec.paused: true
- Trigger a certificate rotation: oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}' kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator
- Observe prometheus metrics: TOKEN=`oc sa get-token prometheus-k8s -n openshift-monitoring` POD=`oc get pods -n openshift-machine-config-operator | grep controller | awk '{print $1}'` oc rsh $POD curl -k -H "Authorization: Bearer $TOKEN" https://localhost:9001/metrics
- look for mcc_important_config_paused
- (or watch the web UI, they show up there too)
To stop the alert:
- oc edit mcp worker
- change spec.paused: true to spec.paused: false
- once again observe prometheus metrics -- the alert will stop firing/set to 0

- Description for the changelog
Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool

openshift-ci · 2021-10-16T04:35:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

wking · 2021-10-19T05:26:48Z

cmd/machine-config-controller/start.go

I see a context.TODO() further down in here too. Seems like you might want to pivot to using cmd.Context() where you need a Context, and cmd.Context().Done() when you need a stopping <-chan struct{}. And then something higher up the stack (or here?) to trigger these components when it's time to gracefully step down? Or maybe something more detailed? For example, the cluster-version operator sets up a few separate contexts and a TERM catcher so we can:

Catch the TERM and cancel runContext.

Collect all the goroutines that had been using runContext, to ensure we no longer have any goroutines managing in-cluster operands.

Cancel postMainContext, which was used for informers and the leader lease.

shutdownContext, which lets callers know when they can give up on a graceful child-goroutine reap and stop blocking on it.

Anyhow, no need to jump straight to something like that, but adding additional, decoupled shutdown channels like you're doing here seems like a step away from a unified end-goal.

wking · 2021-10-19T05:31:39Z

install/0000_90_machine-config-operator_01_prometheus-rules.yaml

I'm pretty sure you want summary and description, but not message. For example, see openshift/cluster-version-operator#547, which was successfully verified by QE, and these monitoring docs, which call for summary and description, but do not mention the older message.

wking · 2021-10-19T05:42:20Z

install/0000_90_machine-config-operator_01_prometheus-rules.yaml

Firing warning alerts immediately seems like it's wound a bit too tightly. We can go a while before the lack of certificate rotation becomes a problem (significant fractions of a year?). So setting a large-ish for here seems useful to avoid desensitizing folks by alerting every time they briefly pause. Of course, you could have someone leaving pools paused for the bulk of that time, and then very briefly unpause, have Prom happen to scrape, and then pause again, before the pool had time to rotate out many certs. Which is one reason "are we paused now, and have we been paused for a bit?" only a rough proxy for "are we at risk of certs expiring?". And that makes picking an appropriate for difficult. But if we stick with the paused-pool angle on this, I'd expect at least for: 60m on this alert, based on our consistency guidelines.

Also, I think if alerting config allows to change severity then it should change to critical after certain number of warning alert has been fired. This is to emphasize that unpausing pool is necessary to keep cluster healthy.

wking · 2021-10-19T06:15:57Z

install/0000_90_machine-config-operator_01_prometheus-rules.yaml

you will probably want some aggregation here to avoid twitching off and on if/when the machine-config controller container rolls. Because that will give you a small gap when mcc_important_config_paused, and the new container may come up with different injected scrape labels, and both of those may shuffle the time series matching this expression, leading to new alerts (after cooking through for). I addressed the scrape-labels portion for a CVO-side alert in openshift/cluster-version-operator#643, but missed the metrics-downtime portion. Something like:

sum by (description, path, pool) (max_over_time(mcc_important_config_paused[5m])) > 0

should address both issues, but while the max_over_time(...[5m]) smearing protects you from alert-churning leader blips, it also makes the alert less responsive to the metric actually getting cleared. So something like "mcc_important_config_paused looks bad, or some signal that the machine-config controller is currently not exporting metrics" would be better.

wking · 2021-10-19T06:17:19Z

pkg/controller/common/metrics.go

I think about the machine-config controller often enough to recognize the mcc acronym. But many folks browsing available metrics on their cluster will not. Can we write this out machine_config_controller_content_paused or some such?

wking · 2021-10-19T06:19:43Z

pkg/controller/common/metrics.go

Is there a benefit to the desc abbreviation over a full description?

You can also avoid landing the description in the metric itself by using path-to-description lookup logic in the alert's description. I'm not familiar enough with Prom or the alert stack to know if that's worth the trouble.

install/0000_90_machine-config-operator_01_prometheus-rules.yaml

pkg/controller/node/node_controller.go

sinnykumari · 2021-10-21T13:07:25Z

Also, I think we are not checking yet whether pool is paused or not. Ideally we should be performing file checks & alert when a pool is paused.

jkyros · 2021-11-24T05:40:02Z

Ok I think I've got this updated with the suggested feedback and ready for review again:

Used existing context/stop channel for metrics handler
Adjusted alert name/variables to be clearer, fixed fields, added future runbook link
Smoothed out the alerting period
Broke the file diff function out into helpers and added tests
Focused on just kubelet-ca.crt
- I know previous feedback here was leaning 'one alert for all config', but subsequent discussions seemed to lean back the other way to 'just the certificate for now' when we talked about keeping state for the alerts somewhere.
- If, as you're going through this, you think we want to go back the other way, I really just want to do what we think is right.
The e2e test is of less concern right now than everything else, but I've included it.
- It works, but I really would love a more elegant way than "rsh into a pod" to check that metrics endpoint.
I'm still trying to find a more concrete answer as to how long after certificates are paused the 'bad stuff' happens so we can set the critical time threshold appropriately
- ( we supposedly have 73 days, but according to what we saw here, we seem to only have 2 weeks: https://bugzilla.redhat.com/show_bug.cgi?id=1970329#c17 )

cheesesashimi · 2021-12-08T19:03:46Z

test/helpers/utils.go

Something to be mindful about is if you run this in an SNO context, this function will prematurely error because the control plane may go offline.

To work around that, you'd need to do something like: https://github.com/openshift/machine-config-operator/blob/master/test/e2e-single-node/sno_mcd_test.go#L355-L374 or: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-b240aacd848b27ec33331f84d7e3fc0a4bc20045f45f2cd536727396a756e97bR195-R223.

The short version is:

Ignore the error you get from the cs.MachineConfigPools().Get().

Run the clock out.

If you're still getting an error from cs.MachineConfigPools().Get() after you've run the clock out, then fail.

Excellent. Thanks. I'll rearrange the test after your helpers merge. I did run the test against single node, so I know it works (aside from handling the control plane conditions you're describing).

Agree with Zack that for SNO, cluster can be itself unreachable because everything is running on the same node and when node reboots, it will be unreachable for a while. However, for this particular test case I won't be worried much because applying kubelet cert shouldn't cause drain or reboot. Ideally, cluster availability shouldn't be impacted and if it does happen, something is wrong.

cheesesashimi · 2021-12-08T19:20:22Z

pkg/controller/common/metrics.go

This looks very similar to daemon.StartMetricsListener. I wonder if there's an opportunity for consolidating the boilerplate of starting the listener, e.g.:

func StartMetricsListener(addr string, stopCh <-chan struct{}, register func() error) { // Boilerplate if err := register(); err != nil { // Handle error } // More boilerplate }

Then the controller and daemon can call this and pass in their registration funcs.

I've refactored StartMetricsListener to take a function argument as suggested, and adjusted daemon to call it from controller/common.

cheesesashimi · 2021-12-08T19:22:55Z

pkg/controller/common/metrics.go

This logs that an error occurred during metrics registration and then continues starting the metrics listener. Does it make sense to continue trying to start the metrics listener if registration fails? Or would it make more sense to return early?

I adjusted it to return if it fails to register, but I didn't have it return the error all the way up, I assumed we didn't want the failure of metrics registration to be fatal/impact anything else aside from the metrics handler.

If I'm wrong and we do want it to be fatal for some reason, let me know :)

In terms of error-handling, I think this is fine for now; we can revisit bubbling the error further later. A further improvement would be adding an additional error line (or tweaking the existing one) to say something like, "could not start metrics listener", just so it's clear that the metrics listener is not running.

In practice, a failure during metric registration is a bug in your code (you registered the same metric twice for instance) and should be treated as a hard failure. Which is why prometheus.MustRegister() exists so you don't have to worry about the potential error.

cheesesashimi · 2021-12-08T20:29:50Z

test/e2e/mcc_test.go

Does this have to target the master MachineConfigPool? I ask because a common pattern in the e2e test suite is that we create an ephemeral infra pool for the purpose of the test, assign a single worker node to that pool, then apply the MachineConfig. The reason is because we can apply the MachineConfig to a single node (and reset when we're done) much faster than rolling it out to the entire MachineConfigPool.

Here's an example: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-32df39b541567d57b84246d0d48a44792350b5af97d060cd0e44c33949c07370R55-R96

So I did originally have it set up the "infra" way, but:

I flipped it to master when I was testing against single node (because if a node is member of multiple pools, master seems to supersede it).

There is no escaping the certificate rotation -- all unpaused pools will get it -- I don't have a way to confine a certificate rotation to certain pools (well, aside from pause, which is what I am testing)

It probably does save a little bit of time having just one node be paused, otherwise we'd have to wait once we unpause.

I'll put it back to infra (and maybe copy this test as it sits into the single node tests if we care?)

I think for this particular test we don't really need to create infra pool because cert rotation would take place for all pools in the cluster. And for test purpose it is fine to pause only master pool if it helps with SNO case too. We will need to add this test case into single node too if not already done.

pkg/controller/common/helpers.go

pkg/controller/node/node_controller.go

test/e2e/mcc_test.go

jkyros · 2022-02-28T17:00:15Z

Added some more detailed comments to those functions as requested and re-pushed.

jkyros · 2022-03-03T16:31:39Z

/retest-required

jkyros · 2022-03-03T18:48:39Z

/test e2e-agnostic-upgrade

kikisdeliveryservice · 2022-03-03T20:35:04Z

/test e2e-gcp-upgrade
/test e2e-vsphere-upgrade
/test e2e-agnostic-upgrade

kikisdeliveryservice · 2022-03-03T20:38:24Z

test/e2e/mcc_test.go

+	"github.com/stretchr/testify/require"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+


Why add an e2e test for an alert if there's a unit test? Shouldn't the unit test be sufficient?

The unit test is good for us checking the metric, the e2e test was to make sure that monitoring would be able to retrieve it (secrets, auth, etc were all set up right).

Long term we talked about having it talk to the thanos query endpoint to make sure monitoring actually got it (like Simon was saying here: #2802 (comment))

jkyros · 2022-03-10T22:11:57Z

/test e2e-agnostic-upgrade

jkyros · 2022-03-11T05:38:12Z

/test e2e-agnostic-upgrade

openshift-ci · 2022-03-11T06:04:28Z

@jkyros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-workers-rhel8	`f371cf0`	link	false	`/test e2e-aws-workers-rhel8`
ci/prow/e2e-aws-serial	`f371cf0`	link	false	`/test e2e-aws-serial`
ci/prow/e2e-aws-workers-rhel7	`f371cf0`	link	false	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-aws-upgrade-single-node	`f371cf0`	link	false	`/test e2e-aws-upgrade-single-node`
ci/prow/e2e-aws-single-node	`f371cf0`	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-metal-ipi	`f371cf0`	link	false	`/test e2e-metal-ipi`
ci/prow/e2e-gcp-op-single-node	`f371cf0`	link	false	`/test e2e-gcp-op-single-node`
ci/prow/e2e-ovn-step-registry	`f371cf0`	link	false	`/test e2e-ovn-step-registry`
ci/prow/e2e-aws-disruptive	`f371cf0`	link	false	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jkyros · 2022-03-12T06:13:28Z

/test e2e-agnostic-upgrade

cheesesashimi · 2022-03-16T17:46:30Z

/lgtm
/approve

kikisdeliveryservice · 2022-03-16T18:04:49Z

pkg/controller/common/helpers.go

+func GetCertificatesFromPEMBundle(pemBytes []byte) ([]*x509.Certificate, error) {
+	var certs []*x509.Certificate
+	// There can be multiple certificates in the file
+	for {


wait sorry - do we really want an infinite loop here? could we make this a little more explicit?

Future John will come back to this later to add clarifying comments which is fine with me.

kikisdeliveryservice · 2022-03-16T18:58:39Z

/lgtm

openshift-ci · 2022-03-16T18:59:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, jkyros, kikisdeliveryservice, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kikisdeliveryservice · 2022-03-16T19:00:57Z

/skip

kikisdeliveryservice · 2022-03-16T19:01:18Z

/hold cancel

openshift-bot · 2022-03-16T23:08:12Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802

…ineConfigControllerPausedPoolKubeletCA runbook URIs The broken URIs were originally from 2c44c12 (Add plumbing for mcc metrics handler, 2022-02-25, openshift#2802)

Soft revert of openshift#2802. This should no longer be needed since the MCD will always sync the cert bundle to disk. If things go wrong, the MCD should degrade.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2021

openshift-ci bot requested review from sinnykumari and yuqi-zhang October 16, 2021 04:35

wking reviewed Oct 19, 2021

View reviewed changes

sinnykumari reviewed Oct 20, 2021

View reviewed changes

install/0000_90_machine-config-operator_01_prometheus-rules.yaml Outdated Show resolved Hide resolved

sinnykumari reviewed Oct 20, 2021

View reviewed changes

install/0000_90_machine-config-operator_01_prometheus-rules.yaml Outdated Show resolved Hide resolved

kikisdeliveryservice reviewed Oct 20, 2021

View reviewed changes