-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 2006611: Upgrade takes too much time when upgrading via --to-image #808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2006611: Upgrade takes too much time when upgrading via --to-image #808
Conversation
|
Skipping CI for Draft Pull Request. |
|
/uncc jottofar LalatenduMohanty |
5b4e42d to
0982489
Compare
5450137 to
0bc2e76
Compare
|
I will edit the title and remove the |
|
/retest |
09dbec4 to
1a5d167
Compare
|
@Davoska: This pull request references Bugzilla bug 2006611, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/bugzilla refresh |
|
@Davoska: This pull request references Bugzilla bug 2006611, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
1 similar comment
|
/retest |
|
@Davoska: This pull request references Bugzilla bug 2006611, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
2fe68ca to
2f174bf
Compare
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
pkg/cvo/upgradeable.go
Outdated
| func shouldSyncUpgradeableDueToPreconditionChecks(optr *Operator, config *configv1.ClusterVersion, u *upgradeable) bool { | ||
| cond := resourcemerge.FindOperatorStatusCondition(config.Status.Conditions, DesiredReleaseAccepted) | ||
| if cond != nil && cond.Reason == "PreconditionChecks" && cond.Status == configv1.ConditionFalse && | ||
| hasPassedDurationSinceTime(u.At, optr.statusInterval) && !hasPassedDurationSinceTime(cond.LastTransitionTime.Time, optr.minimumUpdateCheckInterval) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optr.statusInterval
Update the comment where this is declared now that we're also using it to limit frequency of upgradeable checks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like a separate enough thing, let's just declare a new property for it, or hard-code it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we don't even need the statusInterval too-fast guard at all. Looking for Started syncing upgradeable... in a recent CI run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1557180107716562944/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-69fc9586b6-h8fjp_cluster-version-operator.log | grep 'Started syncing upgradeable' | tail
I0810 04:21:46.656158 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:21:46.656147659 +0000 UTC m=+7759.515637877)
I0810 04:22:37.695673 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:22:37.695665773 +0000 UTC m=+7810.555155995)
I0810 04:26:10.068836 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:26:10.068828676 +0000 UTC m=+8022.928318900)
I0810 04:29:42.441410 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:29:42.44139783 +0000 UTC m=+8235.300888061)
I0810 04:33:14.814429 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:33:14.814418303 +0000 UTC m=+8447.673908518)
I0810 04:34:53.026298 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:34:53.026289695 +0000 UTC m=+8545.885779917)
I0810 04:38:25.399818 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:38:25.399809113 +0000 UTC m=+8758.259299332)
I0810 04:41:57.772956 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:41:57.772946757 +0000 UTC m=+8970.632436978)
I0810 04:45:30.147801 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:45:30.147790368 +0000 UTC m=+9183.007280583)
I0810 04:49:02.520685 1 cvo.go:615] Started syncing upgradeable "openshift-cluster-version/version" (2022-08-10 04:49:02.520678086 +0000 UTC m=+9395.380168305)I haven't dug into code to be clearer on the trigger, but it looks like it can already be ~minutes between triggers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in fact:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1557180107716562944/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-69fc9586b6-h8fjp_cluster-version-operator.log | grep -o 'Started syncing upgradeable\|Upgradeable conditions were recently checked' | sort | uniq -c
59 Started syncing upgradeable
34 Upgradeable conditions were recently checkedEven if we completely drop the guards, we're still only ~doubling the number of times we'd sync. If the idea is to noticeably lower latency, we might need to go figure out that trigger so we can trigger more frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with dropping the guards. If it backfires we can comeback and fix it. I suggested @Davoska this to make the code future proof i.e. to make sure we do not end-up like a hot loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will still prefer to merge this as it is but we can comeback later to go figure out that trigger so we can trigger more frequently. as @wking suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a bad understanding of the times when the upgradeableSync() gets called.
As of this moment the upgradeableSync() only gets called by the worker when a key is added to the optr.upgradeableQueue.
cluster-version-operator/pkg/cvo/cvo.go
Lines 388 to 392 in 2f174bf
| go func() { | |
| defer utilruntime.HandleCrash() | |
| wait.UntilWithContext(runContext, func(runContext context.Context) { optr.worker(runContext, optr.upgradeableQueue, optr.upgradeableSync) }, time.Second) | |
| resultChannel <- asyncResult{name: "upgradeable"} | |
| }() |
The only time a key is added to the optr.upgradeableQueue is by the functions defined in the eventHandler() for the cluster version in the AddFunc and UpdateFunc.
cluster-version-operator/pkg/cvo/cvo.go
Lines 462 to 479 in 2f174bf
| func (optr *Operator) eventHandler() cache.ResourceEventHandler { | |
| workQueueKey := optr.queueKey() | |
| return cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) { | |
| optr.queue.Add(workQueueKey) | |
| optr.availableUpdatesQueue.Add(workQueueKey) | |
| optr.upgradeableQueue.Add(workQueueKey) | |
| }, | |
| UpdateFunc: func(old, new interface{}) { | |
| optr.queue.Add(workQueueKey) | |
| optr.availableUpdatesQueue.Add(workQueueKey) | |
| optr.upgradeableQueue.Add(workQueueKey) | |
| }, | |
| DeleteFunc: func(obj interface{}) { | |
| optr.queue.Add(workQueueKey) | |
| }, | |
| } | |
| } |
Originally the equation for calculating the next time of executing the entire body of upgradeableSync() would be something like: optr.minimumUpdateCheckInterval + DURATION_TO_NEXT_UPDATE_OF_THE_CLUSTER_VERSION
A problem could arise in the future again in the case where the cluster version is updated less often and more frequent checks for the Upgradeable are needed. Thus dropping the guard still wouldn't be sufficient as Trevor pointed out and it would allow potential unnecessary hot-looping.
However, the time-sensitive part of the code remains the same - initializing an upgrade.
A potential idea could be something like running periodically (every ~15 seconds):
optr.upgradeableQueue.Add(optr.queueKey())at the beginning of an upgrade in a case, the CVO's Upgradeable is False.
In this case, it would be guaranteed to run the upgradeableSync() minimally every 15 seconds, and in the worst case 30 seconds (15 seconds of the too-fast guard + 15 seconds of the periodic run).
But if we presume there is a sufficient number of updates to the cluster version at the beginning of an upgrade so that the keys get added to the upgradeableQueue we could potentially ignore this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking and I talked about this and we think the solution in this PR is doing the right thing.
In https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/cvo.go#L397 It is calling optr.worker() every second. If there is any work then it gets processed by processNextWorkItem() but if there is not then we just continue checking every second till we get work. So with this change if there is a change in upgradeable then with in 1 second CVO will start the process of reconciliation.
|
/retest |
| } | ||
|
|
||
| return optr.syncUpgradeable() | ||
| return optr.syncUpgradeable(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CVO using the in-cluster ClusterVersion to pass state from one goroutine to another feels sticky... I'm not set against it, but it would be nice if we could find a way to plumb the data we need through without having to pass this thing we just got from the cluster here. syncUpgradeable is already an Operator method, so it should have access to a lot of state already...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had seen this, but somehow did not spend time figuring out if we can avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wasn't keen on it either. However, the syncAvailableUpdates() in pkg/cvo/availableupdates.go uses the cluster version as a parameter also (https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/availableupdates.go#L31), so I ended up using here as well. (Now I see the commit db150e6 for that change was from 4 years ago).
Another alternative I can think of is calling the
config, err := optr.cvLister.Get(optr.name)in syncUpgradeable() but I'll try to think of a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is a bad idea to pass config in optr.syncUpgradeable(config) because upgradeableSync() is aleady in a go-routine and it is calling the optr.cvLister.Get(optr.name) . So I do not think there is much difference between upgradeableSync() or SyncUpgradeable() calling optr.cvLister.Get(optr.name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I don't think it matters if we shift the Get around. What I'd like long-term is no get at all in this pathway. Looking here, here, and here, something like optr.getUpgradeable().Conditions will get you the current conditions without having to leave local CVO memory. Although a lock guarding access to a pointer property seems a bit suspect...
2f174bf to
5aac7a3
Compare
Synchronize the CVO's Upgradeable status more often for a given period of time when and after the precondition checks start to fail. We don't want to check this more frequently forever in the case of the precondition checks failing because of a bigger problem that wasn't quickly solved at the start of the upgrade by itself. A precondition check can be failing for at least `optr.minimumUpdateCheckInterval` time because of `Upgradeable==false`. The `Upgradeable==false` could have been potentially already resolved (operator reporting `Upgradeable==false` just momentarily) during the duration resulting in necessary waiting for up to `optr.minimumUpdateCheckInterval` for the next synchronization. Synchronize the upgradeable status again in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem. This commit is part of a fix for the bug [1]. The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by [2] fixing the Etcd backup halting an upgrade for a prolonged time [3]. The problem of the `Upgradeable==false` thanks to the `ErrorCheckingOperatorCompatibility` caused by the OLM operator [1] was fixed by the [3]. However, the main root cause of the `ErrorCheckingOperatorCompatibility` error probably still remained. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2006611 [2] openshift#683 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2072348 [4] openshift#766
5aac7a3 to
bd05174
Compare
|
/unhold |
|
/retest |
LalatenduMohanty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Davoska, LalatenduMohanty The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test unit |
|
@Davoska: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@Davoska: All pull requests linked via external trackers have merged: Bugzilla bug 2006611 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This pull request will synchronize the Upgradeable status more frequently when the precondition checks on the payload start failing and will improve the logging regarding synchronization of the upgradeable status. The more frequent synchronization will happen for a given period after the precondition checks start to fail.
TL;DR:
A precondition check originally could be failing for at least
optr.minimumUpdateCheckIntervaltime because ofUpgradeable==false. TheUpgradeable==falsecould have been potentially already resolved (cluster operator reportingUpgradeable==falsejust momentarily) during the duration resulting in necessary waiting for up tooptr.minimumUpdateCheckIntervalfor the next synchronization. Synchronize the upgradeable status more frequently in case of failing precondition checks to speed up initializing an upgrade stuck on a precondition check that potentially has been solved. We don't want to check this forever in case of precondition checks failing for a long time due to a bigger problem.The bug was caused by slow syncing of the CVO's Upgradeable status when a precondition check fails and less frequent running of the precondition checks. The frequency of precondition checks was fixed by #683 fixing the Etcd backup halting an upgrade for a prolonged time (https://bugzilla.redhat.com/show_bug.cgi?id=2072348). The problem of the
Upgradeable==falsethanks to theErrorCheckingOperatorCompatibilitycaused by the OLM operator (https://bugzilla.redhat.com/show_bug.cgi?id=2006611) was fixed by the #766. However, the main root cause of theErrorCheckingOperatorCompatibilityerror probably remains. This pull request will synchronize the CVO's Upgradeable status more often foroptr.minimumUpdateCheckIntervalafter theReleaseAcceptedreports false due to the reasonPreconditionChecks.The fixes mentioned above exist in releases as follows:
Bug 1822752: pkg/cvo: Separate payload load from payload apply:
4.8and newerBug 2072389: Do not save desired update on load failures:
4.9and newerFor those more curious:
The behavior of the problem:
Precondition checks are run on a new payload before initializing an upgrade https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/cvo.go#L842-L847. These checks are categorized into two categories: EtcdBackup and Upgradeable.
The Upgradeable precondition check is only checking the CVO's Upgradeable status https://github.com/openshift/cluster-version-operator/blob/master/pkg/payload/precondition/clusterversion/upgradeable.go#L66-L74.
The CVO's Upgradeable status originally synchronizes only every
optr.minimumUpdateCheckIntervalhttps://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/upgradeable.go#L38-L50.A problem can occur when a
Upgradeable==falseis reported for the CVO's status during a synchronization before an upgrade. A precondition check registers theUpgradeable==falseand halts the upgrade, and then regularly checks if theUpgradeable==falsestill exists. However, the CVO's Upgradeable status synchronizes less often which can result in necessary waiting time for the user when initializing an upgrade.This is normally an expected behavior. However, when a cluster operator reports
Upgradeable==falsejust momentarily (for example, for 10 seconds at the start of the upgrade) and it gets registered by the CVO and thus the CVO's Upgradeable status gets set to false, then the upgrade can be halted for up to 3 minutes waiting for the next synchronization of the CVO's Upgradeable status. Making it seems as like the CVO is slow when applying an upgrade.Bug 2072348 - Upgrade Precondition "EtcdRecentBackup" failed because of "ControllerStarted"
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2072348 (marked as a duplicate of this bug):
Running Etcd backup at the start of an upgrade could hold up the upgrade for up to three minutes (slow checking of the Etcd backup status). This was fixed by #683 allowing to run the precondition checks more frequently when initializing an upgrade (checking the Etcd backup status more frequently).
Bug 2006611 - CVO resolves the version takes a long time sometimes when upgrading via
--to-imageLink: https://bugzilla.redhat.com/show_bug.cgi?id=2006611 (this bug):
The problem can also arise by the CVO reporting
Upgradeable==falsewith the reasonErrorCheckingOperatorCompatibilitywhen a cluster operator reportsUpgradeable==false. Most notably by the OLM operator. In the past, when thestatus.desired.version(when upgrading via--to-image) got the value""at the start of an upgrade. If this value was empty for a prolonged time (~30 seconds), the OLM operator reportedUpgradeable==false. If this got picked up by the CVO, then the next synchronization of the CVO's Upgradeable would happen in 2--3 minutes. Making it seems like the upgrade is slow.This behavior was fixed by the #766, which ensured the
status.desired.versiondoesn't get the empty value and gets the new version's value only after the update is successfully loaded.Conclusion:
This behavior was fortunately fixed to an extent during the development. However, the main cause of the issues mentioned above was slow syncing at the start of an upgrade when a precondition check fails. These behaviors were fixed during development, but the root cause remains, to my knowledge.