pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

wking · 2021-05-27T00:02:55Z

Builds on #575; you may want to review that first.

newClusterOperatorsNotAvailable is from c2ac20f (#158). And the not-available filtering is from bdd4545 (#192). But in ce1eda1 (#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like:

Multiple errors are preventing progress:
* Cluster operator machine-api is updating versions
* Cluster operator openshift-apiserver is updating versions

where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error.

vrutkovs · 2021-05-27T08:42:33Z

/uncc
/cc @jottofar

openshift-bot · 2021-09-21T20:44:49Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-21T21:13:12Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

jottofar · 2021-10-25T18:37:07Z

LGTM but needs rebase.

/remove-lifecycle rotten

wking · 2021-10-26T18:54:52Z

Rebased onto the updated #575 tip with 7e743a6 -> 2c890a7.

jottofar · 2022-05-24T18:56:58Z

/remove-lifecycle stale

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step

…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.

To make it easier to understand what each case is about.

newClusterOperatorsNotAvailable is from c2ac20f (status: Report the operators that have not yet deployed, 2019-04-09, openshift#158). And the not-available filtering is from bdd4545 (status: Hide generic operator status in favor of more specific errors, 2019-05-19, openshift#192). But in ce1eda1 (pkg/cvo/internal/operatorstatus: Change nested message, 2021-02-04, openshift#514), we moved "waiting on status.versions" from ClusterOperatorNotAvailable to ClusterOperatorUpdating. And we want to avoid uncollapsed errors like: Multiple errors are preventing progress: * Cluster operator machine-api is updating versions * Cluster operator openshift-apiserver is updating versions where we are waiting on multiple ClusterOperator which are in similar situations. This commit drops the filtering, because cluster operators are important. It does sort those errors to the end of the list though, so the first error is the non-ClusterOperator error. TestCVO_ParallelError no longer tests the consolidated error message, because the consolidation is now restricted to ClusterOperator resources. I tried moving the pkg/cvo/testdata/paralleltest/release-manifests manifests to ClusterOperator, but then the test struggled with: I0802 16:04:18.133935 2005 sync_worker.go:945] Unable to precreate resource clusteroperator so now TestCVO_ParallelError is excercising the fact that non-ClusterOperator failures are not aggregated.

jottofar · 2022-08-04T15:59:58Z

/test e2e-agnostic-upgrade

jottofar · 2022-08-04T19:37:22Z

/lgtm

openshift-ci · 2022-08-04T19:37:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-08-04T20:33:34Z

/retest-required

Remaining retests: 2 against base HEAD 9d57cfc and 8 for PR HEAD 28df4d9 in total

openshift-ci-robot · 2022-08-05T00:56:46Z

/retest-required

Remaining retests: 1 against base HEAD 9d57cfc and 7 for PR HEAD 28df4d9 in total

openshift-ci-robot · 2022-08-05T05:13:17Z

/retest-required

Remaining retests: 0 against base HEAD 9d57cfc and 6 for PR HEAD 28df4d9 in total

openshift-ci · 2022-08-05T09:09:42Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/golangci-lint	`dd62b5b`	link	true	`/test golangci-lint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2022-08-05T16:23:45Z

Image registry remains available using new connections has been really dead, and it's not just this PR.

/override ci/prow/e2e-agnostic-upgrade

openshift-ci · 2022-08-05T16:25:53Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Image registry remains available using new connections has been really dead, and it's not just this PR.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-08-05T16:31:18Z

/hold

I'm taking a closer look at the error.

wking · 2022-08-05T16:46:27Z

Looks like I didn't get my hold on in time. But same error in a parallel pull discussed here, so I'm back to being confident that I'm not introducing this failure mode with this PR.

/hold cancel

openshift-ci bot requested review from smarterclayton and vrutkovs May 27, 2021 00:03

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2021

wking force-pushed the ClusterOperatorUpdating-aggregation branch 8 times, most recently from 4c92b1a to a518161 Compare May 27, 2021 05:11

openshift-ci bot requested review from jottofar and removed request for vrutkovs May 27, 2021 08:42

wking force-pushed the ClusterOperatorUpdating-aggregation branch 2 times, most recently from 6beb1d6 to 7e743a6 Compare May 27, 2021 20:16

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2021

wking mentioned this pull request Oct 20, 2021

Bug 1951835: Propagate Degraded to update status #662

Merged

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 21, 2021

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 25, 2021

wking force-pushed the ClusterOperatorUpdating-aggregation branch from 7e743a6 to 2c890a7 Compare October 26, 2021 18:54

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2021

wking force-pushed the ClusterOperatorUpdating-aggregation branch 2 times, most recently from 1807269 to dd62b5b Compare October 29, 2021 01:05

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2022

wking added 3 commits August 2, 2022 12:47

pkg/cvo: Names for Test_SyncWorker_apply sub-tests

a287c8f

To make it easier to understand what each case is about.

wking force-pushed the ClusterOperatorUpdating-aggregation branch from dd62b5b to 098a037 Compare August 2, 2022 19:47

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 2, 2022

wking force-pushed the ClusterOperatorUpdating-aggregation branch 2 times, most recently from 59b65d4 to fe9cad3 Compare August 2, 2022 21:39

wking force-pushed the ClusterOperatorUpdating-aggregation branch from fe9cad3 to 28df4d9 Compare August 3, 2022 04:21

openshift-ci bot assigned jottofar Aug 4, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2022

openshift-ci bot merged commit 0f8c533 into openshift:master Aug 5, 2022

wking deleted the ClusterOperatorUpdating-aggregation branch August 5, 2022 16:45

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2022

This was referenced Aug 5, 2022

Bug 1927168: pkg/payload/task: Fail fast for UpdateError #573

Closed

pkg/payload/task: Disambiguate the ClusterOperatorNotAvailable Progressing message #575

Closed

pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

Uh oh!

Conversation

wking commented May 27, 2021

Uh oh!

vrutkovs commented May 27, 2021

Uh oh!

openshift-bot commented Sep 21, 2021

Uh oh!

openshift-bot commented Oct 21, 2021

Uh oh!

jottofar commented Oct 25, 2021

Uh oh!

wking commented Oct 26, 2021

Uh oh!

jottofar commented May 24, 2022

Uh oh!

jottofar commented Aug 4, 2022

Uh oh!

jottofar commented Aug 4, 2022

Uh oh!

openshift-ci bot commented Aug 4, 2022

Uh oh!

openshift-ci-robot commented Aug 4, 2022

Uh oh!

openshift-ci-robot commented Aug 5, 2022

Uh oh!

openshift-ci-robot commented Aug 5, 2022

Uh oh!

openshift-ci bot commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Aug 5, 2022

Uh oh!

openshift-ci bot commented Aug 5, 2022

Uh oh!

wking commented Aug 5, 2022

Uh oh!

wking commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

openshift-ci bot commented Aug 5, 2022 •

edited

Loading

wking commented Aug 5, 2022 •

edited

Loading