Skip to content

Conversation

@jottofar
Copy link
Contributor

@jottofar jottofar commented Apr 18, 2022

since this will cause equalSyncWork to continually return true and no reattempt to load the payload will occur until the desired update is updated by the user again, e.g. image change or change to force. This was accomplished by moving work update, including the desired version, after the desired version payload has been successfully loaded.

See commit message for more details on precondition failures in general and the RecentEtcdBackup precondition specifically.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 18, 2022

@jottofar: This pull request references Bugzilla bug 2072389, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

WIP: Bug 2072389: Recheck upgrade payload precondition failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2022
@jottofar jottofar force-pushed the bug-2072389 branch 2 times, most recently from 939fca7 to f9e9808 Compare April 19, 2022 21:29
@jottofar
Copy link
Contributor Author

/retitle Bug 2072389: Do not save desired update on load failures

@openshift-ci openshift-ci bot changed the title WIP: Bug 2072389: Recheck upgrade payload precondition failures Bug 2072389: Do not save desired update on load failures Apr 19, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 19, 2022

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from shellyyang1989 April 19, 2022 21:33
@jottofar jottofar force-pushed the bug-2072389 branch 3 times, most recently from 27e8319 to d2215c1 Compare April 20, 2022 16:24
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 20, 2022

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jottofar
Copy link
Contributor Author

/retest

Image: work.Desired.Image,
},
}
w.work = work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting this before the w.loadUpdatedPayload feels weird. Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird. Wanted to get Capabilities saved since they can change independent of payload. Changed such that I only save capabilities and not all of work.

w.work.Capabilities = work.Capabilities

if !versionEqual && oldDesired == nil {
klog.Infof("Propagating initial target version %v to sync worker loop in state %s.", desired, state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to shift this log block to after the w.loadUpdatedPayload too. We want to keep the Ignoring detected version change... bailout block before the w.loadUpdatedPayload.

@jottofar jottofar force-pushed the bug-2072389 branch 2 times, most recently from 9020991 to d4252f0 Compare April 28, 2022 19:51
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 28, 2022

@jottofar: This pull request references Bugzilla bug 2072389, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

since this will cause equalSyncWork to continually return
true and no reattempt to load the payload will occur until
the desired update is updated by the user again, e.g.
image change or change to force.

As a result, no attempt is made to recheck precondition
failures which may have been resolved and therefore
result in a successful payload load.

To take the specific
https://bugzilla.redhat.com/show_bug.cgi?id=2072389 bug
failure as an example, because no recheck is made on
the RecentEtcdBackup precondition CVO does not detect
that the Etcd backup has been completed and it's safe
to continue with the update. In addition, as a result
of change
openshift#683,
the etcd operator must check ReleaseAccepted!=true rather
than Failing=true to trigger the start of the backup.
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 28, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 29, 2022

@jottofar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 118e938 into openshift:master Apr 29, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 29, 2022

@jottofar: All pull requests linked via external trackers have merged:

Bugzilla bug 2072389 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2072389: Do not save desired update on load failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member

wking commented May 5, 2022

/cherrypick release-4.10

@openshift-cherrypick-robot

@wking: #766 failed to apply on top of branch "release-4.10":

Applying: Do not save desired update on load failures
Using index info to reconstruct a base tree...
M	pkg/cvo/cvo_scenarios_test.go
M	pkg/cvo/sync_worker.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/sync_worker.go
CONFLICT (content): Merge conflict in pkg/cvo/sync_worker.go
Auto-merging pkg/cvo/cvo_scenarios_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Do not save desired update on load failures
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request May 9, 2022
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 3, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 3, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 4, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 4, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 23, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 24, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 29, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
DavidHurta added a commit to DavidHurta/cluster-version-operator that referenced this pull request Aug 29, 2022
Synchronize the CVO's Upgradeable status more often for a given period
of time when and after the precondition checks start to fail. We don't
want to check this more frequently forever in the case of the
precondition checks failing because of a bigger problem that wasn't
quickly solved at the start of the upgrade by itself.

A precondition check can be failing for at least
`optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`.
The `Upgradeable==false` could have been potentially already resolved
(operator reporting `Upgradeable==false` just momentarily) during the
duration resulting in necessary waiting for up to
`optr.minimumUpdateCheckInterval` for the next synchronization.
Synchronize the upgradeable status again in case of failing
precondition checks to speed up initializing an upgrade stuck on
a precondition check that potentially has been solved. We don't want to
check this forever in case of precondition checks failing for a long
time due to a bigger problem.

This commit is part of a fix for the bug [1].

The bug was caused by slow syncing of the CVO's Upgradeable status when
a precondition check fails and less frequent running of the precondition
checks. The frequency of precondition checks was fixed by [2] fixing the
Etcd backup halting an upgrade for a prolonged time [3]. The problem of
the `Upgradeable==false` thanks to the
`ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was
fixed by the [3]. However, the main root cause of the
`ErrorCheckingOperatorCompatibility` error probably still remained.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611

[2] openshift#683

[3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348

[4] openshift#766
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants