Skip to content

Conversation

@smarterclayton
Copy link
Contributor

Allow a test to specify that you want an upgrade to abort
part way through via

openshift-tests run-upgrade all ... --option abort-at=10

instructs the upgrade test to abort after it reaches 10% of
the test. This allows us to test disruption.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 30, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2019
@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 1, 2019
@smarterclayton
Copy link
Contributor Author

Will be adding a disruption e2e

@smarterclayton
Copy link
Contributor Author

/retest

1 similar comment
@smarterclayton
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 1, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2019
To better stress test upgrades, add disruption elements for aborting
an upgrade part of the way through as well as rebooting random masters.

--options=abort-at=PERCENT will cause the upgrade to stop and roll back
to the previous version when PERCENT of operators have been upgraded.
100 will be after the upgrade is complete, while 'random' will be at
a randomly chosen percent.

--options=disrupt-reboot=POLICY causes random periodic reboots of
masters during upgradse. If set to 'graceful' the reboot allows clean
shutdown. If set to 'force' the machines immediate exit (to simulate
power loss).
@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label May 3, 2019
@openshift-merge-robot openshift-merge-robot merged commit d35551f into openshift:master May 3, 2019
wking added a commit to wking/openshift-release that referenced this pull request Jan 2, 2020
For 4.2->4.3 and 4.3->4.4.  I've left off 4.1->4.2, since 4.1 is
pretty old and stable.  I've left off 4.4->4.5, because we haven't
built a 4.5 nightly yet [1].  This should help catch breakage like the
ephemeral-storage request that broke 4.2 -> * updates [2], but didn't
turn up in CI because we don't have any jobs testing nightly ->
updates.  After this commit we'll have:

* endurance-upgrade-aws-4.3
  I'm not really clear on what this does.  Seems to use the template
  from 39e69e2 (add long lived cluster management job and e2e test,
  2019-06-12, openshift#3887).  Seems to use 4.3-ci -> self updates?  I dunno.

* release-openshift-origin-installer-e2e-aws-upgrade-4.3
  Lets the release controller or ci-operator or some such choose the
  source and target version.

* release-openshift-origin-installer-e2e-aws-upgrade-fips-4.3
  4.3-ci penultimate -> 4.3-ci latest on AWS with FIPS enabled.

* release-openshift-origin-installer-e2e-azure-upgrade-4.3
  4.3-ci penultimate -> 4.3-ci latest on Azure

* release-openshift-origin-installer-e2e-gcp-upgrade-4.3
  4.3-ci penultimate -> 4.3-ci latest on GCP

* release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3
  4.2-stable -> 4.3-ci on AWS.

* release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3
  4.2-nightly -> 4.3-nightly on AWS.  New in this commit.

* release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3
  4.2-stable -> 4.3-ci on AWS with TEST_OPTIONS=abort-at=99.  For more
  on abort-at, see openshift/origin@a53efd5e27 (Support --options on
  upgrade tests to abort in progress, 2019-04-29,
  openshift/origin#22726).

* release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.3
  4.3-ci penultimate -> 4.3-ci latest on AWS with TEST_OPTIONS=abort-at=random.

and similarly for 4.4.

I'm not entirely clear on how the release informer jobs ingest the
version being considered for promotion, maybe these new jobs will end
up just being vanilla periodics.  But that's probably fine, because
all we need is some sort of signal in CI to show that 4.2-nightly ->
4.3 (or whatever) is broken before we give that 4.2 nightly a stable
name like 4.2.13 (or whatever).  Even if these do run as 4.3 promotion
informers, breakage like [2] happened in the 4.2 nightly.  So you
could still have:

1. 4.2 PR lands and breaks 4.2 -> 4.3.
2. Associated 4.2 nightly promotion goes through all green.
3. Some subsequent 4.3 change lands, and the informing job fails
   because of the 4.2 change from step 1.

But again, as long as we have some kind of signal (like the one added
by this commit), the release admins should hear about it and know that
they need the breakage triaged before they give a nightly a stable
name and sign the release.

[1]: https://openshift-release.svc.ci.openshift.org/#4.5.0-0.nightly
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1786315#c2
wking added a commit to wking/origin that referenced this pull request Jun 3, 2020
It was added in a53efd5 (Support --options on upgrade tests to
abort in progress, 2019-04-29, openshift#22726) but never used.
wking added a commit to wking/origin that referenced this pull request Mar 17, 2021
durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)
DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request Mar 2, 2022
durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)
DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request Mar 3, 2022
durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)
DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request Mar 4, 2022
durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants