Support --option on upgrade tests to abort in progress #22726

smarterclayton · 2019-04-30T20:41:47Z

Allow a test to specify that you want an upgrade to abort
part way through via

openshift-tests run-upgrade all ... --option abort-at=10

instructs the upgrade test to abort after it reaches 10% of
the test. This allows us to test disruption.

smarterclayton · 2019-05-01T02:54:06Z

/retest

smarterclayton · 2019-05-01T02:54:23Z

Will be adding a disruption e2e

smarterclayton · 2019-05-01T03:33:08Z

/retest

smarterclayton · 2019-05-01T03:34:40Z

/retest

openshift-ci-robot · 2019-05-01T17:41:21Z

New changes are detected. LGTM label has been removed.

openshift-ci-robot · 2019-05-01T17:41:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cmd/OWNERS~~ [smarterclayton]
~~pkg/OWNERS~~ [smarterclayton]
~~test/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

To better stress test upgrades, add disruption elements for aborting an upgrade part of the way through as well as rebooting random masters. --options=abort-at=PERCENT will cause the upgrade to stop and roll back to the previous version when PERCENT of operators have been upgraded. 100 will be after the upgrade is complete, while 'random' will be at a randomly chosen percent. --options=disrupt-reboot=POLICY causes random periodic reboots of masters during upgradse. If set to 'graceful' the reboot allows clean shutdown. If set to 'force' the machines immediate exit (to simulate power loss).

For 4.2->4.3 and 4.3->4.4. I've left off 4.1->4.2, since 4.1 is pretty old and stable. I've left off 4.4->4.5, because we haven't built a 4.5 nightly yet [1]. This should help catch breakage like the ephemeral-storage request that broke 4.2 -> * updates [2], but didn't turn up in CI because we don't have any jobs testing nightly -> updates. After this commit we'll have: * endurance-upgrade-aws-4.3 I'm not really clear on what this does. Seems to use the template from 39e69e2 (add long lived cluster management job and e2e test, 2019-06-12, openshift#3887). Seems to use 4.3-ci -> self updates? I dunno. * release-openshift-origin-installer-e2e-aws-upgrade-4.3 Lets the release controller or ci-operator or some such choose the source and target version. * release-openshift-origin-installer-e2e-aws-upgrade-fips-4.3 4.3-ci penultimate -> 4.3-ci latest on AWS with FIPS enabled. * release-openshift-origin-installer-e2e-azure-upgrade-4.3 4.3-ci penultimate -> 4.3-ci latest on Azure * release-openshift-origin-installer-e2e-gcp-upgrade-4.3 4.3-ci penultimate -> 4.3-ci latest on GCP * release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 4.2-stable -> 4.3-ci on AWS. * release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3 4.2-nightly -> 4.3-nightly on AWS. New in this commit. * release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 4.2-stable -> 4.3-ci on AWS with TEST_OPTIONS=abort-at=99. For more on abort-at, see openshift/origin@a53efd5e27 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift/origin#22726). * release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3 4.3-ci penultimate -> 4.3-ci latest on AWS with TEST_OPTIONS=abort-at=random. and similarly for 4.4. I'm not entirely clear on how the release informer jobs ingest the version being considered for promotion, maybe these new jobs will end up just being vanilla periodics. But that's probably fine, because all we need is some sort of signal in CI to show that 4.2-nightly -> 4.3 (or whatever) is broken before we give that 4.2 nightly a stable name like 4.2.13 (or whatever). Even if these do run as 4.3 promotion informers, breakage like [2] happened in the 4.2 nightly. So you could still have: 1. 4.2 PR lands and breaks 4.2 -> 4.3. 2. Associated 4.2 nightly promotion goes through all green. 3. Some subsequent 4.3 change lands, and the informing job fails because of the 4.2 change from step 1. But again, as long as we have some kind of signal (like the one added by this commit), the release admins should hear about it and know that they need the breakage triaged before they give a nightly a stable name and sign the release. [1]: https://openshift-release.svc.ci.openshift.org/#4.5.0-0.nightly [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1786315#c2

It was added in a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726) but never used.

durationToSoftFailure was added in 4447a19 (allow longer upgrade times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out [1]. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on. An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction. [1]: openshift/cluster-version-operator#514 (comment)

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 30, 2019

openshift-ci-robot requested review from deads2k and ericavonb April 30, 2019 20:41

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2019

smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 1, 2019

smarterclayton force-pushed the disruption branch from 840ad2d to e390198 Compare May 1, 2019 17:41

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 1, 2019

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2019

smarterclayton force-pushed the disruption branch from e390198 to a0cbf3e Compare May 1, 2019 20:27

smarterclayton force-pushed the disruption branch from a0cbf3e to a53efd5 Compare May 3, 2019 00:45

smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019

openshift-merge-robot merged commit d35551f into openshift:master May 3, 2019

wking mentioned this pull request Jan 2, 2020

ci-operator/jobs/openshift/release: Add 4.y-nightly -> 4.(y+1)-nightly openshift/release#6542

Merged

wking added a commit to wking/origin that referenced this pull request Jun 3, 2020

test/e2e/upgrade: Drop unused "errControlledAbort"

0e9907d

It was added in a53efd5 (Support --options on upgrade tests to abort in progress, 2019-04-29, openshift#22726) but never used.

wking mentioned this pull request Mar 17, 2021

test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs #25977

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support --option on upgrade tests to abort in progress #22726

Support --option on upgrade tests to abort in progress #22726

Uh oh!

smarterclayton commented Apr 30, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

openshift-ci-robot commented May 1, 2019

Uh oh!

openshift-ci-robot commented May 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support --option on upgrade tests to abort in progress #22726

Support --option on upgrade tests to abort in progress #22726

Uh oh!

Conversation

smarterclayton commented Apr 30, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

smarterclayton commented May 1, 2019

Uh oh!

openshift-ci-robot commented May 1, 2019

Uh oh!

openshift-ci-robot commented May 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants