Skip to content

Conversation

@jkyros
Copy link
Member

@jkyros jkyros commented Oct 16, 2021

- What I did

  1. Added metrics reporting to machine-config-controller because it did not previously have that capability by adding:
  • In manifests:

    • Cluster Roles
    • Cluster Role Bindings
    • ServiceMonitor for metrics
    • Service for metrics
    • oauth-proxy container for machine-config-controller deployment
    • mcc-proxy-tls secret for machine-config-controller
  • In controller:

    • metrics handler function in machine-config-controller common
    • machine config lister in node_controller and node_controller_test
  • References:

    • For the handler I cribbed off of: 557303f
    • And then to add oauth: 3ab692f
  1. Added an alert that fires when the kubelect-ca certificate is pending in a paused pool:

    • added a GaugeVec (MCCImportantConfigPaused) for important pending config
    • added a map in node_controller to store the table of "important config names" : "important config files"
    • added functions to check for pending files, and release the alerts once the pools have been unpaused
  2. I don't have tests yet

- How to verify it

  1. To fire the alert:

    • build a cluster
    • oc edit mcp worker
    • change spec.paused: false to spec.paused: true
    • Trigger a certificate rotation: oc patch secret -p='{"metadata": {"annotations": {"auth.openshift.io/certificate-not-after": null}}}' kube-apiserver-to-kubelet-signer -n openshift-kube-apiserver-operator
    • Observe prometheus metrics: TOKEN=`oc sa get-token prometheus-k8s -n openshift-monitoring` POD=`oc get pods -n openshift-machine-config-operator | grep controller | awk '{print $1}'` oc rsh $POD curl -k -H "Authorization: Bearer $TOKEN" https://localhost:9001/metrics
    • look for mcc_important_config_paused
    • (or watch the web UI, they show up there too)
  2. To stop the alert:

    • oc edit mcp worker
    • change spec.paused: true to spec.paused: false
    • once again observe prometheus metrics -- the alert will stop firing/set to 0

- Description for the changelog
Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2021

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2021
Copy link
Member

@wking wking Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a context.TODO() further down in here too. Seems like you might want to pivot to using cmd.Context() where you need a Context, and cmd.Context().Done() when you need a stopping <-chan struct{}. And then something higher up the stack (or here?) to trigger these components when it's time to gracefully step down? Or maybe something more detailed? For example, the cluster-version operator sets up a few separate contexts and a TERM catcher so we can:

  1. Catch the TERM and cancel runContext.
  2. Collect all the goroutines that had been using runContext, to ensure we no longer have any goroutines managing in-cluster operands.
  3. Cancel postMainContext, which was used for informers and the leader lease.
  4. shutdownContext, which lets callers know when they can give up on a graceful child-goroutine reap and stop blocking on it.

Anyhow, no need to jump straight to something like that, but adding additional, decoupled shutdown channels like you're doing here seems like a step away from a unified end-goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure you want summary and description, but not message. For example, see openshift/cluster-version-operator#547, which was successfully verified by QE, and these monitoring docs, which call for summary and description, but do not mention the older message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firing warning alerts immediately seems like it's wound a bit too tightly. We can go a while before the lack of certificate rotation becomes a problem (significant fractions of a year?). So setting a large-ish for here seems useful to avoid desensitizing folks by alerting every time they briefly pause. Of course, you could have someone leaving pools paused for the bulk of that time, and then very briefly unpause, have Prom happen to scrape, and then pause again, before the pool had time to rotate out many certs. Which is one reason "are we paused now, and have we been paused for a bit?" only a rough proxy for "are we at risk of certs expiring?". And that makes picking an appropriate for difficult. But if we stick with the paused-pool angle on this, I'd expect at least for: 60m on this alert, based on our consistency guidelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think if alerting config allows to change severity then it should change to critical after certain number of warning alert has been fired. This is to emphasize that unpausing pool is necessary to keep cluster healthy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will probably want some aggregation here to avoid twitching off and on if/when the machine-config controller container rolls. Because that will give you a small gap when mcc_important_config_paused, and the new container may come up with different injected scrape labels, and both of those may shuffle the time series matching this expression, leading to new alerts (after cooking through for). I addressed the scrape-labels portion for a CVO-side alert in openshift/cluster-version-operator#643, but missed the metrics-downtime portion. Something like:

sum by (description, path, pool) (max_over_time(mcc_important_config_paused[5m])) > 0

should address both issues, but while the max_over_time(...[5m]) smearing protects you from alert-churning leader blips, it also makes the alert less responsive to the metric actually getting cleared. So something like "mcc_important_config_paused looks bad, or some signal that the machine-config controller is currently not exporting metrics" would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think about the machine-config controller often enough to recognize the mcc acronym. But many folks browsing available metrics on their cluster will not. Can we write this out machine_config_controller_content_paused or some such?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit to the desc abbreviation over a full description?

You can also avoid landing the description in the metric itself by using path-to-description lookup logic in the alert's description. I'm not familiar enough with Prom or the alert stack to know if that's worth the trouble.

@sinnykumari
Copy link
Contributor

sinnykumari commented Oct 21, 2021

Also, I think we are not checking yet whether pool is paused or not. Ideally we should be performing file checks & alert when a pool is paused.

@jkyros jkyros force-pushed the mco-74-controller-alert-certificate branch from 02d36ac to 8fe1bcf Compare November 16, 2021 05:37
@jkyros jkyros force-pushed the mco-74-controller-alert-certificate branch 2 times, most recently from f85abb8 to 0c6ea9e Compare November 24, 2021 05:35
@jkyros
Copy link
Member Author

jkyros commented Nov 24, 2021

Ok I think I've got this updated with the suggested feedback and ready for review again:

  • Used existing context/stop channel for metrics handler

  • Adjusted alert name/variables to be clearer, fixed fields, added future runbook link

  • Smoothed out the alerting period

  • Broke the file diff function out into helpers and added tests

  • Focused on just kubelet-ca.crt

    • I know previous feedback here was leaning 'one alert for all config', but subsequent discussions seemed to lean back the other way to 'just the certificate for now' when we talked about keeping state for the alerts somewhere.
    • If, as you're going through this, you think we want to go back the other way, I really just want to do what we think is right.
  • The e2e test is of less concern right now than everything else, but I've included it.

    • It works, but I really would love a more elegant way than "rsh into a pod" to check that metrics endpoint.
  • I'm still trying to find a more concrete answer as to how long after certificates are paused the 'bad stuff' happens so we can set the critical time threshold appropriately

Copy link
Member

@cheesesashimi cheesesashimi Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to be mindful about is if you run this in an SNO context, this function will prematurely error because the control plane may go offline.

To work around that, you'd need to do something like: https://github.com/openshift/machine-config-operator/blob/master/test/e2e-single-node/sno_mcd_test.go#L355-L374 or: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-b240aacd848b27ec33331f84d7e3fc0a4bc20045f45f2cd536727396a756e97bR195-R223.

The short version is:

  1. Ignore the error you get from the cs.MachineConfigPools().Get().
  2. Run the clock out.
  3. If you're still getting an error from cs.MachineConfigPools().Get() after you've run the clock out, then fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Thanks. I'll rearrange the test after your helpers merge. I did run the test against single node, so I know it works (aside from handling the control plane conditions you're describing).

Copy link
Contributor

@sinnykumari sinnykumari Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Zack that for SNO, cluster can be itself unreachable because everything is running on the same node and when node reboots, it will be unreachable for a while. However, for this particular test case I won't be worried much because applying kubelet cert shouldn't cause drain or reboot. Ideally, cluster availability shouldn't be impacted and if it does happen, something is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to daemon.StartMetricsListener. I wonder if there's an opportunity for consolidating the boilerplate of starting the listener, e.g.:

func StartMetricsListener(addr string, stopCh <-chan struct{}, register func() error) {
  // Boilerplate
  if err := register(); err != nil {
    // Handle error
  }
  // More boilerplate
}

Then the controller and daemon can call this and pass in their registration funcs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored StartMetricsListener to take a function argument as suggested, and adjusted daemon to call it from controller/common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs that an error occurred during metrics registration and then continues starting the metrics listener. Does it make sense to continue trying to start the metrics listener if registration fails? Or would it make more sense to return early?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted it to return if it fails to register, but I didn't have it return the error all the way up, I assumed we didn't want the failure of metrics registration to be fatal/impact anything else aside from the metrics handler.

If I'm wrong and we do want it to be fatal for some reason, let me know :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of error-handling, I think this is fine for now; we can revisit bubbling the error further later. A further improvement would be adding an additional error line (or tweaking the existing one) to say something like, "could not start metrics listener", just so it's clear that the metrics listener is not running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, a failure during metric registration is a bug in your code (you registered the same metric twice for instance) and should be treated as a hard failure. Which is why prometheus.MustRegister() exists so you don't have to worry about the potential error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to target the master MachineConfigPool? I ask because a common pattern in the e2e test suite is that we create an ephemeral infra pool for the purpose of the test, assign a single worker node to that pool, then apply the MachineConfig. The reason is because we can apply the MachineConfig to a single node (and reset when we're done) much faster than rolling it out to the entire MachineConfigPool.

Here's an example: https://github.com/openshift/machine-config-operator/pull/2795/files#diff-32df39b541567d57b84246d0d48a44792350b5af97d060cd0e44c33949c07370R55-R96

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did originally have it set up the "infra" way, but:

  1. I flipped it to master when I was testing against single node (because if a node is member of multiple pools, master seems to supersede it).
  2. There is no escaping the certificate rotation -- all unpaused pools will get it -- I don't have a way to confine a certificate rotation to certain pools (well, aside from pause, which is what I am testing)

It probably does save a little bit of time having just one node be paused, otherwise we'd have to wait once we unpause.

I'll put it back to infra (and maybe copy this test as it sits into the single node tests if we care?)

Copy link
Contributor

@sinnykumari sinnykumari Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this particular test we don't really need to create infra pool because cert rotation would take place for all pools in the cluster. And for test purpose it is fine to pause only master pool if it helps with SNO case too. We will need to add this test case into single node too if not already done.

@jkyros jkyros force-pushed the mco-74-controller-alert-certificate branch from 0c6ea9e to db32c8d Compare December 10, 2021 05:29
@jkyros jkyros marked this pull request as ready for review December 10, 2021 05:30
@jkyros jkyros force-pushed the mco-74-controller-alert-certificate branch from 779b852 to f371cf0 Compare February 28, 2022 16:59
@jkyros
Copy link
Member Author

jkyros commented Feb 28, 2022

Added some more detailed comments to those functions as requested and re-pushed.

@jkyros
Copy link
Member Author

jkyros commented Mar 3, 2022

/retest-required

@jkyros
Copy link
Member Author

jkyros commented Mar 3, 2022

/test e2e-agnostic-upgrade

@kikisdeliveryservice
Copy link
Contributor

/test e2e-gcp-upgrade
/test e2e-vsphere-upgrade
/test e2e-agnostic-upgrade

"github.com/stretchr/testify/require"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add an e2e test for an alert if there's a unit test? Shouldn't the unit test be sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test is good for us checking the metric, the e2e test was to make sure that monitoring would be able to retrieve it (secrets, auth, etc were all set up right).

Long term we talked about having it talk to the thanos query endpoint to make sure monitoring actually got it (like Simon was saying here: #2802 (comment))

@jkyros
Copy link
Member Author

jkyros commented Mar 10, 2022

/test e2e-agnostic-upgrade

1 similar comment
@jkyros
Copy link
Member Author

jkyros commented Mar 11, 2022

/test e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2022

@jkyros: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-workers-rhel8 f371cf0 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-serial f371cf0 link false /test e2e-aws-serial
ci/prow/e2e-aws-workers-rhel7 f371cf0 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-upgrade-single-node f371cf0 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-single-node f371cf0 link false /test e2e-aws-single-node
ci/prow/e2e-metal-ipi f371cf0 link false /test e2e-metal-ipi
ci/prow/e2e-gcp-op-single-node f371cf0 link false /test e2e-gcp-op-single-node
ci/prow/e2e-ovn-step-registry f371cf0 link false /test e2e-ovn-step-registry
ci/prow/e2e-aws-disruptive f371cf0 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jkyros
Copy link
Member Author

jkyros commented Mar 12, 2022

/test e2e-agnostic-upgrade

@cheesesashimi
Copy link
Member

/lgtm
/approve

func GetCertificatesFromPEMBundle(pemBytes []byte) ([]*x509.Certificate, error) {
var certs []*x509.Certificate
// There can be multiple certificates in the file
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait sorry - do we really want an infinite loop here? could we make this a little more explicit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future John will come back to this later to add clarifying comments which is fine with me.

@kikisdeliveryservice
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, jkyros, kikisdeliveryservice, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Contributor

/skip

@kikisdeliveryservice
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 16, 2022
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 57267b7 into openshift:master Mar 17, 2022
openshift-merge-robot added a commit that referenced this pull request Mar 21, 2022
Make our resourcemerge fork update a container's Resources.Requests, un-revert #2802
wking added a commit to wking/machine-config-operator that referenced this pull request Dec 17, 2022
…ineConfigControllerPausedPoolKubeletCA runbook URIs

The broken URIs were originally from 2c44c12 (Add plumbing for mcc
metrics handler, 2022-02-25, openshift#2802)
yuqi-zhang added a commit to yuqi-zhang/machine-config-operator that referenced this pull request Mar 5, 2023
Soft revert of
openshift#2802.

This should no longer be needed since the MCD will always sync the cert
bundle to disk. If things go wrong, the MCD should degrade.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Mar 15, 2023
Soft revert of
openshift#2802.

This should no longer be needed since the MCD will always sync the cert
bundle to disk. If things go wrong, the MCD should degrade.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4.11 approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.