Skip to content

Conversation

@jottofar
Copy link
Contributor

@jottofar jottofar commented Oct 28, 2021

Previously CVO used a single goroutine SyncWorker.Start to both load an update (if available) and to apply the currently loaded release. If after loading an update preconditions are not met (e.g. verification failure) application of the currently loaded release is blocked and therefore any changes that might occur to the currently loaded release are also not applied.

This change remedies that by breaking out the "load update" logic into a new mew method loadUpdatedPayload. This new method is invoked from the goroutine SyncWorker.Update. Previously when this routine recognized that an update was available it would sync with SyncWorker.Startwhich would then load the update and apply it (if no errors occur). With this change SyncWorker.Update will first load the update and only if no errors occur syncs with SyncWorker.Start which causes it to start applying the update. Otherwise SyncWorker.Start continues applying the currently loaded release.

Also changed the payload updating event and report messaging to accurately reflect what is really being done at each step.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2021

@jottofar: This pull request references Bugzilla bug 1822752, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1822752: pkg/cvo: Separate payload load from payload apply

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 28, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2021
@jottofar jottofar force-pushed the bug-1822752 branch 5 times, most recently from 7872ae8 to 3383d4b Compare November 2, 2021 20:08
@jottofar
Copy link
Contributor Author

jottofar commented Nov 3, 2021

/test e2e-agnostic

@jottofar jottofar force-pushed the bug-1822752 branch 16 times, most recently from 03da71e to 55b965f Compare November 10, 2021 20:22
@jottofar jottofar force-pushed the bug-1822752 branch 2 times, most recently from c476904 to cd0916d Compare November 11, 2021 14:36
@wking
Copy link
Member

wking commented Feb 11, 2022

Nothing remotely close to this code has landed since the previous round of CI:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 11, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic, ci/prow/e2e-agnostic-operator

Details

In response to this:

Nothing remotely close to this code has landed since the previous round of CI:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 5d414c3 into openshift:master Feb 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 11, 2022

@jottofar: All pull requests linked via external trackers have merged:

Bugzilla bug 1822752 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1822752: pkg/cvo: Separate payload load from payload apply

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 11, 2022

@jottofar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jottofar
Copy link
Contributor Author

don't think we're too worried about a few minutes offset as folks transition into this new code. Anyhow, looks good to me, although it will also be good to see logs from the CVO as we update out of the patched release into other places

From loki logs, we loaded new payload at 23:48:21 then that CVO instance shutdown as expected and new CVO running 4.11.0-0.ci.test-2022-02-04-230202-ci-op-lb302rqg-latest came up. The ReleaseAccepted is really when the "current" CVO accepted the release but even so, it only differs by seconds.

2022-02-04 23:48:21 | I0204 23:48:21.604372 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci.test-2022-02-04-225838-ci-op-lb302rqg-initial ... 2022-02-04 23:48:21 | I0204 23:48:21.604546 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PreconditionsPassed' preconditions passed for payload loaded version="4.11.0-0.ci.test-2022-02-04-230202-ci-op-lb302rqg-latest"...

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Feb 21, 2022
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Feb 21, 2022
openshift-merge-robot added a commit that referenced this pull request Feb 22, 2022
Bug 1822752: pkg/cvo: Fix ups from separating load from apply #683
wking pushed a commit to wking/cluster-version-operator that referenced this pull request Mar 17, 2022
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Apr 28, 2022
since this will cause equalSyncWork to continually return
true and no reattempt to load the payload will occur until
the desired update is updated by the user again, e.g.
image change or change to force.

As a result, no attempt is made to recheck precondition
failures which may have been resolved and therefore
result in a successful payload load.

To take the specific
https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug
failure as an example, because no recheck is made on
the RecentEtcdBackup precondition CVO does not detect
that the Etcd backup has been completed and it's safe
to continue with the update. In addition, as a result
of change
openshift#683,
the etcd operator must check ReleaseAccepted!=true rather
than Failing=true to trigger the start of the backup.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Apr 28, 2022
since this will cause equalSyncWork to continually return
true and no reattempt to load the payload will occur until
the desired update is updated by the user again, e.g.
image change or change to force.

As a result, no attempt is made to recheck precondition
failures which may have been resolved and therefore
result in a successful payload load.

To take the specific
https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug
failure as an example, because no recheck is made on
the RecentEtcdBackup precondition CVO does not detect
that the Etcd backup has been completed and it's safe
to continue with the update. In addition, as a result
of change
openshift#683,
the etcd operator must check ReleaseAccepted!=true rather
than Failing=true to trigger the start of the backup.
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 3, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 3, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 4, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 4, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 23, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 24, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 29, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 29, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
petr-muller pushed a commit to petr-muller/cluster-version-operator that referenced this pull request Feb 3, 2023
When we separated payload load from payload apply (openshift#683) the context
used for the retrieval changed as well. It went from one that was
constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained
shutdownContext [2]. However if "force" is specified we explicitly set a
2 minute timeout in RetrievePayload. This commit creates a new context
with a reasonable timeout for RetrievePayload regardless of "force".

[1]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605

[2]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 23, 2023
…erReleaseNotAccepted

We grew a ReleaseAccepted condition in 7221c93 (pkg/cvo: Separate
payload load from payload apply, 2021-10-28, openshift#683), which landed in
4.11 [1] and was backported to 4.10.8 [2].  However, in order to
notice a ReleaseAccepted!=True condition, users would need to be
checking 'oc adm upgrade' [3] or watching the web-console interface
[4].  With this change, we add an alert, so admins can have
push-notification to supplement those polling approaches.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1822752#c49
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=2064991#c7
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=2065507
[4]: https://issues.redhat.com//browse/OCPBUGS-3069
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Mar 20, 2023
When we separated payload load from payload apply (openshift#683) the context
used for the retrieval changed as well. It went from one that was
constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained
shutdownContext [2]. However if "force" is specified we explicitly set a
2 minute timeout in RetrievePayload. This commit creates a new context
with a reasonable timeout for RetrievePayload regardless of "force".

[1]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605

[2]
https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants