-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Support --option on upgrade tests to abort in progress #22726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support --option on upgrade tests to abort in progress #22726
Conversation
|
/retest |
|
Will be adding a disruption e2e |
|
/retest |
1 similar comment
|
/retest |
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
To better stress test upgrades, add disruption elements for aborting an upgrade part of the way through as well as rebooting random masters. --options=abort-at=PERCENT will cause the upgrade to stop and roll back to the previous version when PERCENT of operators have been upgraded. 100 will be after the upgrade is complete, while 'random' will be at a randomly chosen percent. --options=disrupt-reboot=POLICY causes random periodic reboots of masters during upgradse. If set to 'graceful' the reboot allows clean shutdown. If set to 'force' the machines immediate exit (to simulate power loss).
For 4.2->4.3 and 4.3->4.4. I've left off 4.1->4.2, since 4.1 is pretty old and stable. I've left off 4.4->4.5, because we haven't built a 4.5 nightly yet [1]. This should help catch breakage like the ephemeral-storage request that broke 4.2 -> * updates [2], but didn't turn up in CI because we don't have any jobs testing nightly -> updates. After this commit we'll have: * endurance-upgrade-aws-4.3 I'm not really clear on what this does. Seems to use the template from 39e69e2 (add long lived cluster management job and e2e test, 2019-06-12, openshift#3887). Seems to use 4.3-ci -> self updates? I dunno. * release-openshift-origin-installer-e2e-aws-upgrade-4.3 Lets the release controller or ci-operator or some such choose the source and target version. * release-openshift-origin-installer-e2e-aws-upgrade-fips-4.3 4.3-ci penultimate -> 4.3-ci latest on AWS with FIPS enabled. * release-openshift-origin-installer-e2e-azure-upgrade-4.3 4.3-ci penultimate -> 4.3-ci latest on Azure * release-openshift-origin-installer-e2e-gcp-upgrade-4.3 4.3-ci penultimate -> 4.3-ci latest on GCP * release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 4.2-stable -> 4.3-ci on AWS. * release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3 4.2-nightly -> 4.3-nightly on AWS. New in this commit. * release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 4.2-stable -> 4.3-ci on AWS with TEST_OPTIONS=abort-at=99. For more on abort-at, see openshift/origin@a53efd5e27 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift/origin#22726). * release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3 4.3-ci penultimate -> 4.3-ci latest on AWS with TEST_OPTIONS=abort-at=random. and similarly for 4.4. I'm not entirely clear on how the release informer jobs ingest the version being considered for promotion, maybe these new jobs will end up just being vanilla periodics. But that's probably fine, because all we need is some sort of signal in CI to show that 4.2-nightly -> 4.3 (or whatever) is broken before we give that 4.2 nightly a stable name like 4.2.13 (or whatever). Even if these do run as 4.3 promotion informers, breakage like [2] happened in the 4.2 nightly. So you could still have: 1. 4.2 PR lands and breaks 4.2 -> 4.3. 2. Associated 4.2 nightly promotion goes through all green. 3. Some subsequent 4.3 change lands, and the informing job fails because of the 4.2 change from step 1. But again, as long as we have some kind of signal (like the one added by this commit), the release admins should hear about it and know that they need the breakage triaged before they give a nightly a stable name and sign the release. [1]: https://openshift-release.svc.ci.openshift.org/#4.5.0-0.nightly [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1786315#c2
It was added in a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726) but never used.
durationToSoftFailure was added in 4447a19 (allow longer upgrade times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out [1]. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on. An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction. [1]: openshift/cluster-version-operator#514 (comment)
durationToSoftFailure was added in 4447a19 (allow longer upgrade times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out [1]. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on. An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction. [1]: openshift/cluster-version-operator#514 (comment)
durationToSoftFailure was added in 4447a19 (allow longer upgrade times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out [1]. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on. An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction. [1]: openshift/cluster-version-operator#514 (comment)
durationToSoftFailure was added in 4447a19 (allow longer upgrade times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out [1]. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on. An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction. [1]: openshift/cluster-version-operator#514 (comment)
Allow a test to specify that you want an upgrade to abort
part way through via
instructs the upgrade test to abort after it reaches 10% of
the test. This allows us to test disruption.