-
Notifications
You must be signed in to change notification settings - Fork 530
service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193
Conversation
|
I think documentation for support is enough. This should be a rare edge that won't repeat. /approve /assign @sttts |
|
@deads2k Yes to this event won't repeat, but not sure 'rare' is the right word. This requirement affects every 4.x cluster deployed today that is upgraded to a release supporting automated rotation. |
|
I've updated this PR to ensure pod restart before expiry of the pre-rotation CA. tl;dr The only mechanism to ensure services are using unexpired key material is pod restart, and the only dependable trigger for pod restart is a cluster upgrade. Since upgrades must occur every 12 months, ensuring upgrade before expiry of key material dictates a CA duration greater than 24 months to cover the worst case. |
| material from the current CA risks breaking trust with key material | ||
| issued by the previous CA. | ||
| - The total CA duration can thus be computed as follows: | ||
| - *D* = *M* + *R* > 2 * *I* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot follow why there is this +.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added an extra step that hopefully explains this.
| Red Hat is 12 months. | ||
| - At most 3 OCP releases are supported at one time, and 3 releases are | ||
| expected in a given year. | ||
| - When the minimum CA duration, *M*, is reached, automatic rotation will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "minimum" in this duration? If the validity of the CA is V and we rotate at >=50%, what is meant with M? 0.5*V ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I've updated 'minimum CA duration' to 'minimum remaining CA duration'. Does that make it clearer? And should I update D to be V (variable names are hard, single-letter variable names are harder).
| - When the minimum CA duration, *M*, is reached, automatic rotation will | ||
| be triggered. *M* must be greater than *I* to ensure that an upgrade | ||
| occurs before the expiry of the pre-rotation CA. | ||
| - The interval between automated rotations, *R*, must also be greater |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the difference to M?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to specify that R should be greater than or equal to M. Given that the proposed duration is large, it makes sense to minimize R by simplifying to R = M.
|
@sttts Updated with a fixup commit, PTAL. |
| - The maximum interval, *I*, between upgrades of a cluster supported by | ||
| Red Hat is 12 months. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's been whispers of an LTS version, wouldn't that break this assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik there will be no such thing as an LTS release that doesn't still require an upgrade at least once a year if only to a patch release. Any upgrade will do, not just between minor versions.
@smarterclayton Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention of LTS and upgrade to patch release in fixup #2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't require that customers, "upgrade", but in general customers should be updating production clusters more than once a year; given that we are providing them (at least with micro-updates) weekly updates.
| - T+26m - CA-1 expires. No impact because of the restart at time of upgrade | ||
| - If the service CA validity duration (entire, not remaining) is less than | ||
| 26 months, then automated rotation should be triggered and the cluster | ||
| administrator will need to manually restart all pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/should/will? Wouldn't it be easier to backport the rotation to a n-1 version so that no manual intervention is necessary and we're sure all pods contain the current certs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing as 'no manual intervention' - one-time manual restart is required for any cluster upgrading to a release that supports automated rotation. There is no other way to ensure that the pre-rotation CA won't expire before a subsequent upgrade because it's remaining duration is by definition going to be less than the minimum upgrade interval of 12 months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated will to should in fixup #2
| - The requirement to manually restart pods after a cluster is upgraded to | ||
| a release that supports automated CA rotation is a one-time thing. All | ||
| subsequent upgrades and rotations will ensure restart before expiry | ||
| without manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not include restarting control-plane pods that should be rotated by their operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it make sense to differentiate between pods running user workloads and control plane pods if manual restart involves restarting all pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be capable of rotating the certs for their payloads themselves, but I suppose it's safer to restart all anyway...
| the expiry of the pre-rotation CA. | ||
| - *R* must be greater than or equal to the minimum remaining CA | ||
| duration, *M*, to ensure that an upgrade occurs before a subsequent | ||
| rotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this I don't understand.
With D as the validity of the CA, and M as the minimum remaining CA duration, we have D = M + R. M only has an influence on the validity of the previous CA after rotation. With R >= M we enforce that the previous CA will expire before yet another rotation takes place. This means that the cross-signing will break earlier than it could be (= the moment of the following rotation). Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only requirement is that an upgrade/restart occurs after a rotation. Once that occurs, there is no harm in either a subsequent rotation or the expiry of the pre-rotation CA.
|
|
|
PR implementing the proposed change: openshift/service-ca-operator#106 |
|
Should we be discussing metrics and/or alerts that warn users of CA validity comparted to pod start time? In short, if the CA is newer than a pods start time, users should be warned, that a pod restart (to pick up new cert material) is required.
|
| 12 months to 26 months and the minimum CA duration should be extended to 13 | ||
| months. These values ensure that pods will be guaranteed to be restarted in | ||
| a cluster supported by Red Hat before the expiry of the pre-rotation CA. | ||
| - The timing of upgrades is key to determining CA duration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note here that says that all pods in the openshift control plane automatically reload certificates, keys, and ca-bundles without a pod restart. I've received questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@marun do we have e2e tests that check
|
|
/lgtm hold for minor update, test link, and squash. |
Added links to the code in the testing section. |
|
@deads2k Updated, PTAL |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, marun The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @sttts @deads2k @stlaz @mfojtik