Skip to content

Conversation

@jkyros
Copy link
Member

@jkyros jkyros commented Mar 26, 2022

This is for MCO-119 and corresponds to the alert added with openshift/machine-config-operator#2802

This is also the machine-config-operator's first runbook, so this sets up the OWNERS file.

On one hand, I feel like this is too verbose, on the other hand, given how often the situation for this alert has to be explained, I wanted to include enough "why" so the recipient of the alert fully understood the ramifications.

urldecode the ignition file contents, assumes you have python2):

```console
oc get mc rendered-worker-f318c7173a1785bc3bf846c558a2ad49 -o jsonpath='{.spec.config.storage.files[?(@.path=="/etc/kubernetes/kubelet-ca.crt")].contents.source}' | python2 -c "import urllib; print(urllib.unquote(raw_input()))" | openssl x509 -text -noout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Python 2 is end-of-life upstream (I dunno what the state is for RHEL). Can we use Python 3? Testing against CI assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial/1506535707059949568/artifacts/e2e-aws-serial/gather-extra/artifacts/machineconfigs.json | jq -r '.items[] | select(.metadata.name | startswith("rendered-worker-")).spec.config.storage.files[] | select(.path == "/etc/kubernetes/kubelet-ca.crt").contents.source' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | openssl x509 -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 383882405144258650 (0x553d3037a25685a)
        Signature Algorithm: sha256WithRSAEncryption
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely we can use Python3. Thank you. I did waver on that and leaned toward what I thought was "lowest common denominator". ( I feel old now. Python 3 isn't "new" anymore 😆 )

Aware that machineconfig extraction UX is not the greatest. Long term I have some code I would like to get into oc to extract a file from a machineconfig so users don't have mess with inline python scripts and ridiculous jsonpath expressions.

@kikisdeliveryservice
Copy link

What is the procedure for getting this merged? I left a small change, but is there someone else who should be reviewing this?

@kikisdeliveryservice
Copy link

while we're here @mburke5678 would you mind taking a quick look as well?

- Once a certificate rotation happens, nodes in the paused pool no longer have
have the most recent `kube-apiserver-to-kubelet-signer` (it is in
`kubelet-ca.crt`, which is stuck behind pause)
- So once `kubelet-client` rotates and gets signed with the most recent signer,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might swap once for when in both bullets to avoid confusion and potential translation issues.

nodes in the paused pool cannot verify `kubelet-client` and therefore do not
trust it.

Aside from these symptoms, your nodes should still be okay until

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than these symptoms, your nodes should work normally until the kube-apiserver-to-kubelet-signer expires.

oc -n openshift-kube-apiserver-operator describe secret kube-apiserver-to-kubelet-signer
```

The `kubelet-client` cert (the one that is responsible for `oc logs`, etc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/The/For the

Checking the expiry dates in the MachineConfig yourself is less pleasant because
the bundle is encoded.

You can use something like this to decode it if it's urlencoded:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bundle is URL-encoded, use the following command with the desired MC to decode it:

oc -n openshift-machine-config-operator get mcp $MCP_NAME -o jsonpath='{.status.configuration.name}'
```

Checking the expiry dates in the MachineConfig yourself is less pleasant because
Copy link

@mburke5678 mburke5678 May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the following commands to check the expiry dates, based on how the kublet-ca.crt bundle is encoded.

oc -n openshift-kube-apiserver describe secret/kubelet-client
```

You can also check the certificates that are present in the bundle on one of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bundle/kublet-ca.crt bundle

oc get mc rendered-worker-bc1470f2331a3136999e0b49d85e1e21 -o jsonpath='{.spec.config.storage.files[?(@.path=="/etc/kubernetes/kubelet-ca.crt")].contents.source}' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | openssl x509 -text -noout
```

Or something like this it if is base64 + gzipped:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bundle is base64-encoded and gzipped, use the following command with the desired MC to decode it:

- yuqi-zhang
- cheesesashimi
- jkyros
- raisaat

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • mburke

Copy link
Member Author

@jkyros jkyros May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, wasn't trying to leave you out 😄

EDIT: apparently I did leave your numbers out the first time.

@mburke5678
Copy link

@jkyros I made a few small comments. Feel free to accept or ignore them as you see fit.

cc @kikisdeliveryservice

@openshift-ci openshift-ci bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label May 10, 2022
@jkyros
Copy link
Member Author

jkyros commented May 10, 2022

@mburke5678 thanks for taking a look. I like your wordings better and the translation issues hadn't occurred to me, thank you!

As per specified procedure, adding owners file matching OWNERS file from
machine-config-operator repository.
@openshift-ci openshift-ci bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label May 10, 2022
@jkyros
Copy link
Member Author

jkyros commented May 10, 2022

@ravitri , @NautiluX would you be able to take a look at this, or is there something else I need to do to get this approved?

@kikisdeliveryservice
Copy link

@jkyros I made a few small comments. Feel free to accept or ignore them as you see fit.

cc @kikisdeliveryservice

Thanks for the great review @mburke5678 !! Thanks for updating @jkyros! Teamwork! :)

This is the first runbook for the machine-config-operator. It is a
runbook for the alert MachineConfigControllerPausedPoolKubeletCA
that fires when a machine config pool is paused and the
kube-apiserver-to-kubelet-signer certificate is nearing its expiry.
@jkyros
Copy link
Member Author

jkyros commented May 17, 2022

Errors have been fixed, thanks @sinnykumari for taking a look.

@kikisdeliveryservice
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2022
@kikisdeliveryservice
Copy link

/assign @NautiluX

@kikisdeliveryservice
Copy link

@NautiluX @ravitri Could you PTAL so we can merge this?

@ravitri
Copy link
Member

ravitri commented Jun 22, 2022

@jkyros / @kikisdeliveryservice - Sorry for the delay in reviewing this. This is a great runbook with lot of details 👍. So far, the references we have had were the following:

I think the runbook majorly covers the points mentioned in the given two references as such and we can update these references with the runbook document link ahead too. Overall, LGTM but I will wait for @NautiluX to provide his feedback too. We'll merge it soon after that 👍

@NautiluX
Copy link
Contributor

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 22, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jkyros, kikisdeliveryservice, NautiluX

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 22, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 22, 2022

@jkyros: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 39cfc38 into openshift:master Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants