Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

jottofar · 2021-10-28T20:06:38Z

Previously CVO used a single goroutine SyncWorker.Start to both load an update (if available) and to apply the currently loaded release. If after loading an update preconditions are not met (e.g. verification failure) application of the currently loaded release is blocked and therefore any changes that might occur to the currently loaded release are also not applied.

This change remedies that by breaking out the "load update" logic into a new mew method loadUpdatedPayload. This new method is invoked from the goroutine SyncWorker.Update. Previously when this routine recognized that an update was available it would sync with SyncWorker.Startwhich would then load the update and apply it (if no errors occur). With this change SyncWorker.Update will first load the update and only if no errors occur syncs with SyncWorker.Start which causes it to start applying the update. Otherwise SyncWorker.Start continues applying the currently loaded release.

Also changed the payload updating event and report messaging to accurately reflect what is really being done at each step.

openshift-ci · 2021-10-28T20:06:45Z

@jottofar: This pull request references Bugzilla bug 1822752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.0) matches configured target release for branch (4.10.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1822752: pkg/cvo: Separate payload load from payload apply

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jottofar · 2021-11-03T19:29:49Z

/test e2e-agnostic

wking · 2022-02-11T22:06:01Z

Nothing remotely close to this code has landed since the previous round of CI:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-operator

openshift-ci · 2022-02-11T22:06:32Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic, ci/prow/e2e-agnostic-operator

Details

In response to this:

Nothing remotely close to this code has landed since the previous round of CI:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-02-11T22:10:12Z

@jottofar: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#683

Bugzilla bug 1822752 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1822752: pkg/cvo: Separate payload load from payload apply

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-02-11T22:10:26Z

@jottofar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jottofar · 2022-02-15T15:54:51Z

don't think we're too worried about a few minutes offset as folks transition into this new code. Anyhow, looks good to me, although it will also be good to see logs from the CVO as we update out of the patched release into other places

From loki logs, we loaded new payload at 23:48:21 then that CVO instance shutdown as expected and new CVO running 4.11.0-0.ci.test-2022-02-04-230202-ci-op-lb302rqg-latest came up. The ReleaseAccepted is really when the "current" CVO accepted the release but even so, it only differs by seconds.

2022-02-04 23:48:21 | I0204 23:48:21.604372 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci.test-2022-02-04-225838-ci-op-lb302rqg-initial ... 2022-02-04 23:48:21 | I0204 23:48:21.604546 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PreconditionsPassed' preconditions passed for payload loaded version="4.11.0-0.ci.test-2022-02-04-230202-ci-op-lb302rqg-latest"...

Bug 1822752: pkg/cvo: Fix ups from separating load from apply #683

since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. As a result, no attempt is made to recheck precondition failures which may have been resolved and therefore result in a successful payload load. To take the specific https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug failure as an example, because no recheck is made on the RecentEtcdBackup precondition CVO does not detect that the Etcd backup has been completed and it's safe to continue with the update. In addition, as a result of change openshift#683, the etcd operator must check ReleaseAccepted!=true rather than Failing=true to trigger the start of the backup.

Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766

When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413

…erReleaseNotAccepted We grew a ReleaseAccepted condition in 7221c93 (pkg/cvo: Separate payload load from payload apply, 2021-10-28, openshift#683), which landed in 4.11 [1] and was backported to 4.10.8 [2]. However, in order to notice a ReleaseAccepted!=True condition, users would need to be checking 'oc adm upgrade' [3] or watching the web-console interface [4]. With this change, we add an alert, so admins can have push-notification to supplement those polling approaches. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1822752#c49 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991#c7 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=2065507 [4]: https://issues.redhat.com//browse/OCPBUGS-3069

When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413

openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 28, 2021

openshift-ci bot requested review from LalatenduMohanty and vrutkovs October 28, 2021 20:07

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2021

jottofar force-pushed the bug-1822752 branch 5 times, most recently from 7872ae8 to 3383d4b Compare November 2, 2021 20:08

jottofar force-pushed the bug-1822752 branch 16 times, most recently from 03da71e to 55b965f Compare November 10, 2021 20:22

jottofar force-pushed the bug-1822752 branch 2 times, most recently from c476904 to cd0916d Compare November 11, 2021 14:36

openshift-merge-robot merged commit 5d414c3 into openshift:master Feb 11, 2022

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Feb 21, 2022

pkg/cvo: Fix ups from separating load from apply openshift#683

0895293

jottofar mentioned this pull request Feb 21, 2022

Bug 1822752: pkg/cvo: Fix ups from separating load from apply #683 #745

Merged

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Feb 21, 2022

pkg/cvo: Fix ups from separating load from apply openshift#683

3af0020

openshift-merge-robot added a commit that referenced this pull request Feb 22, 2022

Merge pull request #745 from jottofar/bug-1822752-fixup

0e9bc4e

Bug 1822752: pkg/cvo: Fix ups from separating load from apply #683

wking pushed a commit to wking/cluster-version-operator that referenced this pull request Mar 17, 2022

pkg/cvo: Fix ups from separating load from apply openshift#683

d5ff46c

wking mentioned this pull request Mar 17, 2022

Bug 2064991: pkg/cvo: Separate payload load from payload apply #753

Merged

jottofar mentioned this pull request Apr 12, 2022

Bug 2070854: syncWorkerStatus: Avoid saving stale status values #759

Merged

jottofar mentioned this pull request Jun 6, 2022

Bug 2091806: pkg/cvo: Separate payload load from payload apply #786

Merged

DavidHurta mentioned this pull request Aug 3, 2022

Bug 2006611: Upgrade takes too much time when upgrading via --to-image #808

Merged

jottofar mentioned this pull request Oct 7, 2022

Bug 2090680: pkg/cvo/updatepayload.go: timeout payload retrieval #846

Merged

wking mentioned this pull request Feb 23, 2023

install/0000_90_cluster-version-operator_02_servicemonitor: Add ClusterReleaseNotAccepted #906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

Uh oh!

jottofar commented Oct 28, 2021 •

edited

Loading

Uh oh!

openshift-ci bot commented Oct 28, 2021

Uh oh!

jottofar commented Nov 3, 2021

Uh oh!

wking commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

jottofar commented Feb 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

Bug 1822752: pkg/cvo: Separate payload load from payload apply #683

Uh oh!

Conversation

jottofar commented Oct 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Oct 28, 2021

Uh oh!

jottofar commented Nov 3, 2021

Uh oh!

wking commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

openshift-ci bot commented Feb 11, 2022

Uh oh!

jottofar commented Feb 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jottofar commented Oct 28, 2021 •

edited

Loading