Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Mar 17, 2021

durationToSoftFailure was added in 4447a19 (#25411), but didn't get the 2x on rollbacks we'e been adding to maximumDuration since a53efd5 (#22726). That's recently been causing the cluster-version operator's A->B->A rollback CI jobs to time out. This commit catches durationToSoftFailure up with the "2x on rollbacks" approach, and also mentions "aborted" in messages for those types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat rollbacks as two separate hops (one for A->B, and another for B->A). But that would be a more involved restructuring, and since we already had the 2x maximumDuration precedent in place, I haven't gone in that direction.

/assign @deads2k @smarterclayton

durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)
@smarterclayton
Copy link
Contributor

lgtm, unfortunately can't use cluster bot to test rollback right now. did you get a chance to manually test?

@wking
Copy link
Member Author

wking commented Mar 17, 2021

I've floated openshift/release#16911. If/when that lands, we can launch the test locally here to confirm this is working as expected.

@wking
Copy link
Member Author

wking commented Mar 17, 2021

openshift/release#16911 landed. Pull in a rollback CI job:

/test e2e-aws-upgrade

@wking
Copy link
Member Author

wking commented Mar 18, 2021

The AWS job will fail on an API-server hiccup, but here's the new code:

$ curl -s 'https://prow.ci.openshift.org/log?container=test&id=1372297649784360960&job=pull-ci-openshift-origin-master-e2e-aws-upgrade' | grep -1 'Starting upgrade\|aborted'

Mar 17 22:31:53.446: INFO: Starting upgrade to version= image=registry.build02.ci.openshift.org/ci-op-0nlcgytg/release@sha256:5e4c3bd76ea87ddef0581a8d0595659bba78a9ca326e9e540a62e0fac0e17163 attempt=afb37aec-a6c3-43fe-b582-d077fe50ba44
STEP: continuously hitting pods through the service's LoadBalancer
Mar 17 22:31:53.718: INFO: Upgrade will be aborted and the cluster will roll back to the current version after 100% of operators have upgraded
Mar 17 22:31:53.982: INFO: cluster upgrade is Progressing: Working towards registry.build02.ci.openshift.org/ci-op-0nlcgytg/release@sha256:5e4c3bd76ea87ddef0581a8d0595659bba78a9ca326e9e540a62e0fac0e17163: downloading update
--
Mar 18 00:25:44.427: INFO: cluster upgrade is Progressing: Working towards 4.8.0-0.ci.test-2021-03-17-212401-ci-op-0nlcgytg: 643 of 669 done (96% complete)
Mar 18 00:25:54.426: INFO: Completed aborted upgrade to registry.build02.ci.openshift.org/ci-op-0nlcgytg/release@sha256:2d4154a229980aa89214cbd1c9ea3ebbc0b65992d7e3e18a8eb4312f4a4e4868
Mar 18 00:25:54.520: INFO: Waiting on pools to be upgraded

@wking
Copy link
Member Author

wking commented Mar 18, 2021

Also, 22:31 -> 23:27 -> 00:25 are 1h legs respectively, durationToSoftFailure starts at 75m, and we did not fail a should be fast test-case, so we know that the doubling part is working too.

@smarterclayton
Copy link
Contributor

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2021

@wking: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade a23e3ea link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9aaacb2 into openshift:master Mar 19, 2021
@wking wking deleted the rollback-took-too-long branch March 19, 2021 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants