service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193

marun · 2020-01-28T17:41:54Z

deads2k · 2020-01-28T20:47:53Z

I think documentation for support is enough. This should be a rare edge that won't repeat.

/approve

marun · 2020-01-28T21:56:33Z

@deads2k Yes to this event won't repeat, but not sure 'rare' is the right word. This requirement affects every 4.x cluster deployed today that is upgraded to a release supporting automated rotation.

marun · 2020-02-03T06:38:23Z

I've updated this PR to ensure pod restart before expiry of the pre-rotation CA.

tl;dr The only mechanism to ensure services are using unexpired key material is pod restart, and the only dependable trigger for pod restart is a cluster upgrade. Since upgrades must occur every 12 months, ensuring upgrade before expiry of key material dictates a CA duration greater than 24 months to cover the worst case.

sttts · 2020-02-03T08:55:15Z

enhancements/authentication/automated-service-ca-rotation.md

+        material from the current CA risks breaking trust with key material
+        issued by the previous CA.
+    - The total CA duration can thus be computed as follows:
+      - *D* = *M* + *R* > 2 * *I*


I cannot follow why there is this +.

I've added an extra step that hopefully explains this.

sttts · 2020-02-03T09:05:16Z

enhancements/authentication/automated-service-ca-rotation.md

+      Red Hat is 12 months.
+      - At most 3 OCP releases are supported at one time, and 3 releases are
+        expected in a given year.
+    - When the minimum CA duration, *M*, is reached, automatic rotation will


what is "minimum" in this duration? If the validity of the CA is V and we rotate at >=50%, what is meant with M? 0.5*V ?

Correct. I've updated 'minimum CA duration' to 'minimum remaining CA duration'. Does that make it clearer? And should I update D to be V (variable names are hard, single-letter variable names are harder).

sttts · 2020-02-03T09:06:47Z

enhancements/authentication/automated-service-ca-rotation.md

+    - When the minimum CA duration, *M*, is reached, automatic rotation will
+      be triggered. *M* must be greater than *I* to ensure that an upgrade
+      occurs before the expiry of the pre-rotation CA.
+    - The interval between automated rotations, *R*, must also be greater


where is the difference to M?

Updated to specify that R should be greater than or equal to M. Given that the proposed duration is large, it makes sense to minimize R by simplifying to R = M.

marun · 2020-02-03T16:34:49Z

@sttts Updated with a fixup commit, PTAL.

stlaz · 2020-02-04T08:35:01Z

enhancements/authentication/automated-service-ca-rotation.md

+    - The maximum interval, *I*, between upgrades of a cluster supported by
+      Red Hat is 12 months.


there's been whispers of an LTS version, wouldn't that break this assumption?

afaik there will be no such thing as an LTS release that doesn't still require an upgrade at least once a year if only to a patch release. Any upgrade will do, not just between minor versions.

@smarterclayton Can you confirm?

Added mention of LTS and upgrade to patch release in fixup #2

We don't require that customers, "upgrade", but in general customers should be updating production clusters more than once a year; given that we are providing them (at least with micro-updates) weekly updates.

stlaz · 2020-02-04T08:48:19Z

enhancements/authentication/automated-service-ca-rotation.md

+      - T+26m - CA-1 expires. No impact because of the restart at time of upgrade
+  - If the service CA validity duration (entire, not remaining) is less than
+    26 months, then automated rotation should be triggered and the cluster
+    administrator will need to manually restart all pods.


s/should/will? Wouldn't it be easier to backport the rotation to a n-1 version so that no manual intervention is necessary and we're sure all pods contain the current certs?

There is no such thing as 'no manual intervention' - one-time manual restart is required for any cluster upgrading to a release that supports automated rotation. There is no other way to ensure that the pre-rotation CA won't expire before a subsequent upgrade because it's remaining duration is by definition going to be less than the minimum upgrade interval of 12 months.

updated will to should in fixup #2

stlaz · 2020-02-04T08:51:09Z

enhancements/authentication/automated-service-ca-rotation.md

+    - The requirement to manually restart pods after a cluster is upgraded to
+      a release that supports automated CA rotation is a one-time thing. All
+      subsequent upgrades and rotations will ensure restart before expiry
+      without manual intervention.


This should not include restarting control-plane pods that should be rotated by their operators

Why would it make sense to differentiate between pods running user workloads and control plane pods if manual restart involves restarting all pods?

They should be capable of rotating the certs for their payloads themselves, but I suppose it's safer to restart all anyway...

sttts · 2020-02-04T11:35:16Z

enhancements/authentication/automated-service-ca-rotation.md

+        the expiry of the pre-rotation CA.
+      - *R* must be greater than or equal to the minimum remaining CA
+        duration, *M*, to ensure that an upgrade occurs before a subsequent
+        rotation.


this I don't understand.

With D as the validity of the CA, and M as the minimum remaining CA duration, we have D = M + R. M only has an influence on the validity of the previous CA after rotation. With R >= M we enforce that the previous CA will expire before yet another rotation takes place. This means that the cross-signing will break earlier than it could be (= the moment of the following rotation). Is that intentional?

The only requirement is that an upgrade/restart occurs after a rotation. Once that occurs, there is no harm in either a subsequent rotation or the expiry of the pre-rotation CA.

marun · 2020-02-05T06:17:26Z

Fixup #3 updates the section on rotation on upgrade to reflect that upgrade to a release supporting automated rotation will ensure an initial rotation at time of upgrade.

marun · 2020-02-05T06:20:06Z

PR implementing the proposed change: openshift/service-ca-operator#106

sferich888 · 2020-02-05T19:07:28Z

Should we be discussing metrics and/or alerts that warn users of CA validity comparted to pod start time?

In short, if the CA is newer than a pods start time, users should be warned, that a pod restart (to pick up new cert material) is required.

This may need to be a new enhancement?

deads2k · 2020-02-07T13:33:49Z

enhancements/authentication/automated-service-ca-rotation.md

+  12 months to 26 months and the minimum CA duration should be extended to 13
+  months. These values ensure that pods will be guaranteed to be restarted in
+  a cluster supported by Red Hat before the expiry of the pre-rotation CA.
+  - The timing of upgrades is key to determining CA duration:


Add a note here that says that all pods in the openshift control plane automatically reload certificates, keys, and ca-bundles without a pod restart. I've received questions.

deads2k · 2020-02-07T13:34:59Z

@marun do we have e2e tests that check

old service ca, old serving cert
old service ca, new serving cert
new service ca, old serving cert
new service ca, new serving cert
if so, can you link them, if not can you create a bug?

deads2k · 2020-02-07T13:35:23Z

/lgtm
/hold

hold for minor update, test link, and squash.

marun · 2020-02-07T14:41:52Z

@marun do we have e2e tests that check

old service ca, old serving cert

old service ca, new serving cert

new service ca, old serving cert

new service ca, new serving cert
if so, can you link them, if not can you create a bug?

Added links to the code in the testing section.

marun · 2020-02-07T14:42:02Z

@deads2k Updated, PTAL

deads2k · 2020-02-07T15:21:11Z

/lgtm

openshift-ci-robot · 2020-02-07T15:21:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, marun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from deads2k, mfojtik, stlaz and sttts January 28, 2020 17:41

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 28, 2020

openshift-ci-robot assigned sttts Jan 28, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2020

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 3, 2020

marun changed the title ~~service-ca-rotation: Ensure initial rotation and manual pod restart~~ service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA Feb 3, 2020

sttts reviewed Feb 3, 2020

View reviewed changes

stlaz reviewed Feb 4, 2020

View reviewed changes

sttts reviewed Feb 4, 2020

View reviewed changes

marun mentioned this pull request Feb 5, 2020

Update CA duration to 26 months to ensure upgrade after rotation openshift/service-ca-operator#106

Merged

deads2k reviewed Feb 7, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020

openshift-ci-robot assigned deads2k Feb 7, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020

service-ca-rotation: Ensure initial rotation and manual pod restart

17d6fd7

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020

deads2k removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020

openshift-merge-robot merged commit 4a86556 into openshift:master Feb 7, 2020

marun deleted the initial-service-ca-rotation branch February 7, 2020 21:12

		- The maximum interval, I, between upgrades of a cluster supported by
		Red Hat is 12 months.

service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193

service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA #193

Uh oh!

Conversation

marun commented Jan 28, 2020

Uh oh!

deads2k commented Jan 28, 2020

Uh oh!

marun commented Jan 28, 2020

Uh oh!

marun commented Feb 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun commented Feb 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sttts Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marun commented Feb 5, 2020

Uh oh!

marun commented Feb 5, 2020

Uh oh!

sferich888 commented Feb 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deads2k commented Feb 7, 2020

Uh oh!

deads2k commented Feb 7, 2020

Uh oh!

marun commented Feb 7, 2020

Uh oh!

marun commented Feb 7, 2020

Uh oh!

deads2k commented Feb 7, 2020

Uh oh!

openshift-ci-robot commented Feb 7, 2020

Uh oh!

marun Feb 4, 2020 •

edited

Loading

marun Feb 4, 2020 •

edited

Loading

sttts Feb 4, 2020 •

edited

Loading

marun Feb 4, 2020 •

edited

Loading