-
Notifications
You must be signed in to change notification settings - Fork 137
Runbook for MachineConfigControllerPausedPoolKubeletCA, initial machine-config-operator setup #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| urldecode the ignition file contents, assumes you have python2): | ||
|
|
||
| ```console | ||
| oc get mc rendered-worker-f318c7173a1785bc3bf846c558a2ad49 -o jsonpath='{.spec.config.storage.files[?(@.path=="/etc/kubernetes/kubelet-ca.crt")].contents.source}' | python2 -c "import urllib; print(urllib.unquote(raw_input()))" | openssl x509 -text -noout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Python 2 is end-of-life upstream (I dunno what the state is for RHEL). Can we use Python 3? Testing against CI assets:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-serial/1506535707059949568/artifacts/e2e-aws-serial/gather-extra/artifacts/machineconfigs.json | jq -r '.items[] | select(.metadata.name | startswith("rendered-worker-")).spec.config.storage.files[] | select(.path == "/etc/kubernetes/kubelet-ca.crt").contents.source' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | openssl x509 -text -noout
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 383882405144258650 (0x553d3037a25685a)
Signature Algorithm: sha256WithRSAEncryption
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely we can use Python3. Thank you. I did waver on that and leaned toward what I thought was "lowest common denominator". ( I feel old now. Python 3 isn't "new" anymore 😆 )
Aware that machineconfig extraction UX is not the greatest. Long term I have some code I would like to get into oc to extract a file from a machineconfig so users don't have mess with inline python scripts and ridiculous jsonpath expressions.
|
What is the procedure for getting this merged? I left a small change, but is there someone else who should be reviewing this? |
|
while we're here @mburke5678 would you mind taking a quick look as well? |
| - Once a certificate rotation happens, nodes in the paused pool no longer have | ||
| have the most recent `kube-apiserver-to-kubelet-signer` (it is in | ||
| `kubelet-ca.crt`, which is stuck behind pause) | ||
| - So once `kubelet-client` rotates and gets signed with the most recent signer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might swap once for when in both bullets to avoid confusion and potential translation issues.
| nodes in the paused pool cannot verify `kubelet-client` and therefore do not | ||
| trust it. | ||
|
|
||
| Aside from these symptoms, your nodes should still be okay until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than these symptoms, your nodes should work normally until the kube-apiserver-to-kubelet-signer expires.
| oc -n openshift-kube-apiserver-operator describe secret kube-apiserver-to-kubelet-signer | ||
| ``` | ||
|
|
||
| The `kubelet-client` cert (the one that is responsible for `oc logs`, etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/The/For the
| Checking the expiry dates in the MachineConfig yourself is less pleasant because | ||
| the bundle is encoded. | ||
|
|
||
| You can use something like this to decode it if it's urlencoded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the bundle is URL-encoded, use the following command with the desired MC to decode it:
| oc -n openshift-machine-config-operator get mcp $MCP_NAME -o jsonpath='{.status.configuration.name}' | ||
| ``` | ||
|
|
||
| Checking the expiry dates in the MachineConfig yourself is less pleasant because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the following commands to check the expiry dates, based on how the kublet-ca.crt bundle is encoded.
| oc -n openshift-kube-apiserver describe secret/kubelet-client | ||
| ``` | ||
|
|
||
| You can also check the certificates that are present in the bundle on one of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bundle/kublet-ca.crt bundle
| oc get mc rendered-worker-bc1470f2331a3136999e0b49d85e1e21 -o jsonpath='{.spec.config.storage.files[?(@.path=="/etc/kubernetes/kubelet-ca.crt")].contents.source}' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | openssl x509 -text -noout | ||
| ``` | ||
|
|
||
| Or something like this it if is base64 + gzipped: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the bundle is base64-encoded and gzipped, use the following command with the desired MC to decode it:
| - yuqi-zhang | ||
| - cheesesashimi | ||
| - jkyros | ||
| - raisaat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- mburke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, wasn't trying to leave you out 😄
EDIT: apparently I did leave your numbers out the first time.
|
@jkyros I made a few small comments. Feel free to accept or ignore them as you see fit. |
|
@mburke5678 thanks for taking a look. I like your wordings better and the translation issues hadn't occurred to me, thank you! |
As per specified procedure, adding owners file matching OWNERS file from machine-config-operator repository.
Thanks for the great review @mburke5678 !! Thanks for updating @jkyros! Teamwork! :) |
alerts/machine-config-operator/MachineConfigControllerPausedPoolKubeletCA.md
Outdated
Show resolved
Hide resolved
alerts/machine-config-operator/MachineConfigControllerPausedPoolKubeletCA.md
Outdated
Show resolved
Hide resolved
This is the first runbook for the machine-config-operator. It is a runbook for the alert MachineConfigControllerPausedPoolKubeletCA that fires when a machine config pool is paused and the kube-apiserver-to-kubelet-signer certificate is nearing its expiry.
|
Errors have been fixed, thanks @sinnykumari for taking a look. |
|
/lgtm |
|
/assign @NautiluX |
|
@jkyros / @kikisdeliveryservice - Sorry for the delay in reviewing this. This is a great runbook with lot of details 👍. So far, the references we have had were the following: I think the runbook majorly covers the points mentioned in the given two references as such and we can update these references with the runbook document link ahead too. Overall, LGTM but I will wait for @NautiluX to provide his feedback too. We'll merge it soon after that 👍 |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jkyros, kikisdeliveryservice, NautiluX The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jkyros: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This is for MCO-119 and corresponds to the alert added with openshift/machine-config-operator#2802
This is also the machine-config-operator's first runbook, so this sets up the OWNERS file.
On one hand, I feel like this is too verbose, on the other hand, given how often the situation for this alert has to be explained, I wanted to include enough "why" so the recipient of the alert fully understood the ramifications.