-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1822752: pkg/cvo: Separate payload load from payload apply #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jottofar: This pull request references Bugzilla bug 1822752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7872ae8 to
3383d4b
Compare
|
/test e2e-agnostic |
03da71e to
55b965f
Compare
c476904 to
cd0916d
Compare
|
Nothing remotely close to this code has landed since the previous round of CI: /override ci/prow/e2e-agnostic |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic, ci/prow/e2e-agnostic-operator DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@jottofar: All pull requests linked via external trackers have merged: Bugzilla bug 1822752 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@jottofar: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
From loki logs, we loaded new payload at 23:48:21 then that CVO instance shutdown as expected and new CVO running
|
Bug 1822752: pkg/cvo: Fix ups from separating load from apply #683
since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. As a result, no attempt is made to recheck precondition failures which may have been resolved and therefore result in a successful payload load. To take the specific https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug failure as an example, because no recheck is made on the RecentEtcdBackup precondition CVO does not detect that the Etcd backup has been completed and it's safe to continue with the update. In addition, as a result of change openshift#683, the etcd operator must check ReleaseAccepted!=true rather than Failing=true to trigger the start of the backup.
since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. As a result, no attempt is made to recheck precondition failures which may have been resolved and therefore result in a successful payload load. To take the specific https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug failure as an example, because no recheck is made on the RecentEtcdBackup precondition CVO does not detect that the Etcd backup has been completed and it's safe to continue with the update. In addition, as a result of change openshift#683, the etcd operator must check ReleaseAccepted!=true rather than Failing=true to trigger the start of the backup.
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
…erReleaseNotAccepted We grew a ReleaseAccepted condition in 7221c93 (pkg/cvo: Separate payload load from payload apply, 2021-10-28, openshift#683), which landed in 4.11 [1] and was backported to 4.10.8 [2]. However, in order to notice a ReleaseAccepted!=True condition, users would need to be checking 'oc adm upgrade' [3] or watching the web-console interface [4]. With this change, we add an alert, so admins can have push-notification to supplement those polling approaches. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1822752#c49 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991#c7 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=2065507 [4]: https://issues.redhat.com//browse/OCPBUGS-3069
When we separated payload load from payload apply (openshift#683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. This commit creates a new context with a reasonable timeout for RetrievePayload regardless of "force". [1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605 [2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
Previously CVO used a single goroutine
SyncWorker.Startto both load an update (if available) and to apply the currently loaded release. If after loading an update preconditions are not met (e.g. verification failure) application of the currently loaded release is blocked and therefore any changes that might occur to the currently loaded release are also not applied.This change remedies that by breaking out the "load update" logic into a new mew method
loadUpdatedPayload. This new method is invoked from the goroutineSyncWorker.Update. Previously when this routine recognized that an update was available it would sync withSyncWorker.Startwhich would then load the update and apply it (if no errors occur). With this changeSyncWorker.Updatewill first load the update and only if no errors occur syncs withSyncWorker.Startwhich causes it to start applying the update. OtherwiseSyncWorker.Startcontinues applying the currently loaded release.Also changed the payload updating event and report messaging to accurately reflect what is really being done at each step.