Skip to content

Conversation

@marun
Copy link
Contributor

@marun marun commented Jan 28, 2020

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 28, 2020
@deads2k
Copy link
Contributor

deads2k commented Jan 28, 2020

I think documentation for support is enough. This should be a rare edge that won't repeat.

/approve

/assign @sttts

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2020
@marun
Copy link
Contributor Author

marun commented Jan 28, 2020

@deads2k Yes to this event won't repeat, but not sure 'rare' is the right word. This requirement affects every 4.x cluster deployed today that is upgraded to a release supporting automated rotation.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 3, 2020
@marun marun changed the title service-ca-rotation: Ensure initial rotation and manual pod restart service-ca-rotation: Ensure pod restart before expiry of pre-rotation CA Feb 3, 2020
@marun
Copy link
Contributor Author

marun commented Feb 3, 2020

I've updated this PR to ensure pod restart before expiry of the pre-rotation CA.

tl;dr The only mechanism to ensure services are using unexpired key material is pod restart, and the only dependable trigger for pod restart is a cluster upgrade. Since upgrades must occur every 12 months, ensuring upgrade before expiry of key material dictates a CA duration greater than 24 months to cover the worst case.

material from the current CA risks breaking trust with key material
issued by the previous CA.
- The total CA duration can thus be computed as follows:
- *D* = *M* + *R* > 2 * *I*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot follow why there is this +.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an extra step that hopefully explains this.

Red Hat is 12 months.
- At most 3 OCP releases are supported at one time, and 3 releases are
expected in a given year.
- When the minimum CA duration, *M*, is reached, automatic rotation will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "minimum" in this duration? If the validity of the CA is V and we rotate at >=50%, what is meant with M? 0.5*V ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I've updated 'minimum CA duration' to 'minimum remaining CA duration'. Does that make it clearer? And should I update D to be V (variable names are hard, single-letter variable names are harder).

- When the minimum CA duration, *M*, is reached, automatic rotation will
be triggered. *M* must be greater than *I* to ensure that an upgrade
occurs before the expiry of the pre-rotation CA.
- The interval between automated rotations, *R*, must also be greater
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the difference to M?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to specify that R should be greater than or equal to M. Given that the proposed duration is large, it makes sense to minimize R by simplifying to R = M.

@marun
Copy link
Contributor Author

marun commented Feb 3, 2020

@sttts Updated with a fixup commit, PTAL.

Comment on lines +115 to +119
- The maximum interval, *I*, between upgrades of a cluster supported by
Red Hat is 12 months.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's been whispers of an LTS version, wouldn't that break this assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik there will be no such thing as an LTS release that doesn't still require an upgrade at least once a year if only to a patch release. Any upgrade will do, not just between minor versions.

@smarterclayton Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mention of LTS and upgrade to patch release in fixup #2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require that customers, "upgrade", but in general customers should be updating production clusters more than once a year; given that we are providing them (at least with micro-updates) weekly updates.

- T+26m - CA-1 expires. No impact because of the restart at time of upgrade
- If the service CA validity duration (entire, not remaining) is less than
26 months, then automated rotation should be triggered and the cluster
administrator will need to manually restart all pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/should/will? Wouldn't it be easier to backport the rotation to a n-1 version so that no manual intervention is necessary and we're sure all pods contain the current certs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as 'no manual intervention' - one-time manual restart is required for any cluster upgrading to a release that supports automated rotation. There is no other way to ensure that the pre-rotation CA won't expire before a subsequent upgrade because it's remaining duration is by definition going to be less than the minimum upgrade interval of 12 months.

Copy link
Contributor Author

@marun marun Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated will to should in fixup #2

Comment on lines +166 to +174
- The requirement to manually restart pods after a cluster is upgraded to
a release that supports automated CA rotation is a one-time thing. All
subsequent upgrades and rotations will ensure restart before expiry
without manual intervention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not include restarting control-plane pods that should be rotated by their operators

Copy link
Contributor Author

@marun marun Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it make sense to differentiate between pods running user workloads and control plane pods if manual restart involves restarting all pods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be capable of rotating the certs for their payloads themselves, but I suppose it's safer to restart all anyway...

the expiry of the pre-rotation CA.
- *R* must be greater than or equal to the minimum remaining CA
duration, *M*, to ensure that an upgrade occurs before a subsequent
rotation.
Copy link
Contributor

@sttts sttts Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this I don't understand.

With D as the validity of the CA, and M as the minimum remaining CA duration, we have D = M + R. M only has an influence on the validity of the previous CA after rotation. With R >= M we enforce that the previous CA will expire before yet another rotation takes place. This means that the cross-signing will break earlier than it could be (= the moment of the following rotation). Is that intentional?

Copy link
Contributor Author

@marun marun Feb 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only requirement is that an upgrade/restart occurs after a rotation. Once that occurs, there is no harm in either a subsequent rotation or the expiry of the pre-rotation CA.

@marun
Copy link
Contributor Author

marun commented Feb 5, 2020

Fixup #3 updates the section on rotation on upgrade to reflect that upgrade to a release supporting automated rotation will ensure an initial rotation at time of upgrade.

@marun
Copy link
Contributor Author

marun commented Feb 5, 2020

PR implementing the proposed change: openshift/service-ca-operator#106

@sferich888
Copy link

Should we be discussing metrics and/or alerts that warn users of CA validity comparted to pod start time?

In short, if the CA is newer than a pods start time, users should be warned, that a pod restart (to pick up new cert material) is required.

  • This may need to be a new enhancement?

12 months to 26 months and the minimum CA duration should be extended to 13
months. These values ensure that pods will be guaranteed to be restarted in
a cluster supported by Red Hat before the expiry of the pre-rotation CA.
- The timing of upgrades is key to determining CA duration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note here that says that all pods in the openshift control plane automatically reload certificates, keys, and ca-bundles without a pod restart. I've received questions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@deads2k
Copy link
Contributor

deads2k commented Feb 7, 2020

@marun do we have e2e tests that check

  1. old service ca, old serving cert
  2. old service ca, new serving cert
  3. new service ca, old serving cert
  4. new service ca, new serving cert
    if so, can you link them, if not can you create a bug?

@deads2k
Copy link
Contributor

deads2k commented Feb 7, 2020

/lgtm
/hold

hold for minor update, test link, and squash.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020
@marun
Copy link
Contributor Author

marun commented Feb 7, 2020

@marun do we have e2e tests that check

  1. old service ca, old serving cert
  2. old service ca, new serving cert
  3. new service ca, old serving cert
  4. new service ca, new serving cert
    if so, can you link them, if not can you create a bug?

Added links to the code in the testing section.

@marun
Copy link
Contributor Author

marun commented Feb 7, 2020

@deads2k Updated, PTAL

@deads2k
Copy link
Contributor

deads2k commented Feb 7, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, marun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deads2k deads2k removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2020
@openshift-merge-robot openshift-merge-robot merged commit 4a86556 into openshift:master Feb 7, 2020
@marun marun deleted the initial-service-ca-rotation branch February 7, 2020 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants