pkg/cvo/internal/operatorstatus: Change nested message #514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

openshift-merge-robot merged 1 commit into openshift:master from jottofar:diff-reason

Mar 17, 2021

Contributor

jottofar commented Feb 4, 2021

to a different message for cluster operators that are available, not degraded, but not yet finished updating

openshift-ci-robot requested review from LalatenduMohanty and wking

February 4, 2021 16:08

openshift-ci-robot added the approved label

jottofar force-pushed the diff-reason branch from 43aed97 to 174fd54 Compare

February 4, 2021 16:33

Contributor Author

jottofar commented Feb 4, 2021

/test unit

Contributor Author

jottofar commented Feb 8, 2021

/retest

1 similar comment

Contributor Author

jottofar commented Feb 16, 2021

/retest

wking reviewed

View reviewed changes

pkg/cvo/internal/operatorstatus.go Outdated Show resolved Hide resolved

jottofar force-pushed the diff-reason branch from 174fd54 to 4d9982e Compare

February 16, 2021 21:54


          pkg/cvo/internal/operatorstatus: Change nested message

ce1eda1

jottofar force-pushed the diff-reason branch from 4d9982e to ce1eda1 Compare

February 16, 2021 22:03

Contributor Author

jottofar commented Feb 16, 2021

/retest

Contributor Author

jottofar commented Feb 17, 2021

/test e2e-agnostic

1 similar comment

Contributor Author

jottofar commented Feb 17, 2021

/test e2e-agnostic

Member

wking commented Feb 24, 2021

/lgtm

openshift-ci-robot assigned wking

openshift-ci-robot added the lgtm label

Contributor

openshift-ci-robot commented Feb 24, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

11 similar comments

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 25, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 26, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 26, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 26, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 26, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Feb 26, 2021

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor Author

jottofar commented Mar 16, 2021

/test e2e-agnostic-upgrade

Member

wking commented Mar 17, 2021

Most recent update job failed:

pods should never transition back to pending, which is not us (it's being worked in rhbz#1933760).

cluster upgrade should be fast, Upgrade took too long: 125.8349849658, which is from finish splitting each part of the upgrade into distinct junit origin#25417. But:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/514/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1371802866682957824/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2021-03-16T14:40:36Z 2021-03-16T15:44:25Z Completed 4.8.0-0.ci.test-2021-03-16-123757-ci-op-y8j8drkp
2021-03-16T13:38:31Z 2021-03-16T14:40:36Z Partial 4.8.0-0.ci.test-2021-03-16-124133-ci-op-y8j8drkp
2021-03-16T13:05:46Z 2021-03-16T13:33:50Z Completed 4.8.0-0.ci.test-2021-03-16-123757-ci-op-y8j8drkp

So a ~33 minute hop and an ~1h2m hop. That's under both too-long caps, so must be a bug in their hop-detection logic. To keep the CVO moving while we work on fixing that test:

/override ci/prow/e2e-agnostic-upgrade

Contributor

openshift-ci-robot commented Mar 17, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Most recent update job failed:
pods should never transition back to pending, which is not us (it's being worked in rhbz#1933760).
cluster upgrade should be fast, Upgrade took too long: 125.8349849658, which is from finish splitting each part of the upgrade into distinct junit origin#25417. But:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/514/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1371802866682957824/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
2021-03-16T14:40:36Z 2021-03-16T15:44:25Z Completed 4.8.0-0.ci.test-2021-03-16-123757-ci-op-y8j8drkp
2021-03-16T13:38:31Z 2021-03-16T14:40:36Z Partial 4.8.0-0.ci.test-2021-03-16-124133-ci-op-y8j8drkp
2021-03-16T13:05:46Z 2021-03-16T13:33:50Z Completed 4.8.0-0.ci.test-2021-03-16-123757-ci-op-y8j8drkp
So a ~33 minute hop and an ~1h2m hop. That's under both too-long caps, so must be a bug in their hop-detection logic. To keep the CVO moving while we work on fixing that test:

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/origin that referenced this pull request


          test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs

a23e3ea

durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)

wking mentioned this pull request

test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs openshift/origin#25977

Merged

openshift-merge-robot merged commit b15b12e into openshift:master

wking mentioned this pull request

pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

Merged

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

ccf5c8b

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

2b9381f

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking mentioned this pull request

pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

Merged

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

df5b839

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

deffc41

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

89008cd

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

7e1f7f6

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

a6f7cb2

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

d7f0dc9

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

4c92b1a

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

a518161

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

6beb1d6

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

7e743a6

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

2c890a7

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

dd62b5b

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request


          test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs

7e5a42a

durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)

DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request


          test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs

1b35a9f

durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)

DavidHurta pushed a commit to DavidHurta/origin that referenced this pull request


          test/e2e/upgrade: Relax 'too long' soft timeout for rollback jobs

395a299

durationToSoftFailure was added in 4447a19 (allow longer upgrade
times to run tests, but continue to fail at 75 minutes, 2020-08-12, openshift#25411),
but didn't get the 2x on rollbacks we'e been adding to maximumDuration
since a53efd5 (Support --options on upgrade tests to abort in
progress, 2019-04-29, openshift#22726).  That's recently been causing the
cluster-version operator's A->B->A rollback CI jobs to time out [1].
This commit catches durationToSoftFailure up with the "2x on
rollbacks" approach, and also mentions "aborted" in messages for those
types of tests, to help remind folks what's going on.

An alternative approach would be to teach clusterUpgrade to treat
rollbacks as two separate hops (one for A->B, and another for B->A).
But that would be a more involved restructuring, and since we already
had the 2x maximumDuration precedent in place, I haven't gone in that
direction.

[1]: openshift/cluster-version-operator#514 (comment)

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

098a037

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

59b65d4

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

fe9cad3

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

wking added a commit to wking/cluster-version-operator that referenced this pull request


          pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason

28df4d9

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the
operators that have not yet deployed, 2019-04-09, openshift#158).  And the
not-available filtering is from bdd4545 (status: Hide generic
operator status in favor of more specific errors, 2019-05-19, openshift#192).
But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested
message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from
ClusterOperatorNotAvailable to ClusterOperatorUpdating.  And we want
to avoid uncollapsed errors like:

  Multiple errors are preventing progress:
  * Cluster operator machine-api is updating versions
  * Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar
situations.  This commit drops the filtering, because cluster
operators are important.  It does sort those errors to the end of the
list though, so the first error is the non-ClusterOperator error.

TestCVO_ParallelError no longer tests the consolidated error message,
because the consolidation is now restricted to ClusterOperator
resources.  I tried moving the
pkg/cvo/testdata/paralleltest/release-manifests manifests to
ClusterOperator, but then the test struggled with:

  I0802 16:04:18.133935    2005 sync_worker.go:945] Unable to precreate resource clusteroperator

so now TestCVO_ParallelError is excercising the fact that
non-ClusterOperator failures are not aggregated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels