Bug 2072389: Do not save desired update on load failures #766

jottofar · 2022-04-18T20:10:59Z

since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. This was accomplished by moving work update, including the desired version, after the desired version payload has been successfully loaded.

See commit message for more details on precondition failures in general and the RecentEtcdBackup precondition specifically.

openshift-ci · 2022-04-18T20:11:13Z

@jottofar: This pull request references Bugzilla bug 2072389, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

WIP: Bug 2072389: Recheck upgrade payload precondition failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jottofar · 2022-04-19T21:30:01Z

/retitle Bug 2072389: Do not save desired update on load failures

openshift-ci · 2022-04-19T21:33:48Z

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-04-20T19:01:19Z

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jottofar · 2022-04-26T12:56:32Z

/retest

wking · 2022-04-27T02:31:20Z

pkg/cvo/sync_worker.go

 				Image:   work.Desired.Image,
 			},
 		}
+		w.work = work


setting this before the w.loadUpdatedPayload feels weird. Why do we need this?

It is weird. Wanted to get Capabilities saved since they can change independent of payload. Changed such that I only save capabilities and not all of work.

wking · 2022-04-27T02:32:28Z

pkg/cvo/sync_worker.go

+	w.work.Capabilities = work.Capabilities

 	if !versionEqual && oldDesired == nil {
 		klog.Infof("Propagating initial target version %v to sync worker loop in state %s.", desired, state)


we may want to shift this log block to after the w.loadUpdatedPayload too. We want to keep the Ignoring detected version change... bailout block before the w.loadUpdatedPayload.

openshift-ci · 2022-04-28T19:53:44Z

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. As a result, no attempt is made to recheck precondition failures which may have been resolved and therefore result in a successful payload load. To take the specific https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug failure as an example, because no recheck is made on the RecentEtcdBackup precondition CVO does not detect that the Etcd backup has been completed and it's safe to continue with the update. In addition, as a result of change openshift#683, the etcd operator must check ReleaseAccepted!=true rather than Failing=true to trigger the start of the backup.

wking

/lgtm

openshift-ci · 2022-04-28T20:16:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2022-04-29T00:48:05Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2022-04-29T01:12:05Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2022-04-29T02:52:03Z

@jottofar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-04-29T02:53:07Z

@jottofar: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#766

Bugzilla bug 2072389 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-05-05T05:37:25Z

/cherrypick release-4.10

openshift-cherrypick-robot · 2022-05-05T05:38:02Z

@wking: #766 failed to apply on top of branch "release-4.10":

Applying: Do not save desired update on load failures
Using index info to reconstruct a base tree...
M	pkg/cvo/cvo_scenarios_test.go
M	pkg/cvo/sync_worker.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/sync_worker.go
CONFLICT (content): Merge conflict in pkg/cvo/sync_worker.go
Auto-merging pkg/cvo/cvo_scenarios_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Do not save desired update on load failures
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766

openshift-ci bot requested review from LalatenduMohanty and wking April 18, 2022 20:11

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2022

jottofar force-pushed the bug-2072389 branch 2 times, most recently from 939fca7 to f9e9808 Compare April 19, 2022 21:29

openshift-ci bot changed the title ~~WIP: Bug 2072389: Recheck upgrade payload precondition failures~~ Bug 2072389: Do not save desired update on load failures Apr 19, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2022

openshift-ci bot requested a review from shellyyang1989 April 19, 2022 21:33

jottofar force-pushed the bug-2072389 branch 3 times, most recently from 27e8319 to d2215c1 Compare April 20, 2022 16:24

jottofar force-pushed the bug-2072389 branch from d2215c1 to 29e4974 Compare April 20, 2022 19:02

wking reviewed Apr 27, 2022

View reviewed changes

jottofar force-pushed the bug-2072389 branch 2 times, most recently from 9020991 to d4252f0 Compare April 28, 2022 19:51

jottofar force-pushed the bug-2072389 branch from d4252f0 to 51cb979 Compare April 28, 2022 20:08

wking approved these changes Apr 28, 2022

View reviewed changes

openshift-ci bot assigned wking Apr 28, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2022

openshift-merge-robot merged commit 118e938 into openshift:master Apr 29, 2022

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request May 9, 2022

Bug 2083370: back-port openshift#766 to 4.10

1ee22b6

openshift-ci bot mentioned this pull request May 9, 2022

Bug 2083370: Do not save desired update on load failures #776

Merged

DavidHurta mentioned this pull request Aug 3, 2022

Bug 2006611: Upgrade takes too much time when upgrading via --to-image #808

Merged

Bug 2072389: Do not save desired update on load failures #766

Bug 2072389: Do not save desired update on load failures #766

Uh oh!

Conversation

jottofar commented Apr 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Apr 18, 2022

Uh oh!

jottofar commented Apr 19, 2022

Uh oh!

openshift-ci bot commented Apr 19, 2022

Uh oh!

openshift-ci bot commented Apr 20, 2022

Uh oh!

jottofar commented Apr 26, 2022

Uh oh!

wking Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

jottofar Apr 28, 2022

Choose a reason for hiding this comment

Uh oh!

wking Apr 27, 2022

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 28, 2022

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 28, 2022

Uh oh!

openshift-bot commented Apr 29, 2022

Uh oh!

openshift-bot commented Apr 29, 2022

Uh oh!

openshift-ci bot commented Apr 29, 2022

Uh oh!

openshift-ci bot commented Apr 29, 2022

Uh oh!

wking commented May 5, 2022

Uh oh!

openshift-cherrypick-robot commented May 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jottofar commented Apr 18, 2022 •

edited

Loading