OCPBUGS-5505: Set upgradeability check throttling period to 2m #882

petr-muller · 2023-01-09T14:54:14Z

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

openshift-ci-robot · 2023-01-09T14:54:22Z

@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.13.0) matches configured target version for branch (4.13.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 1m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-09T15:13:11Z

/cc @wking

petr-muller · 2023-01-09T18:34:58Z

/test e2e-agnostic-upgrade-out-of-change
^^^ failed to install

/override e2e-agnostic-upgrade-into-change
^^^ passed everything except the following, which is unrelated

: [sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
{
  cache-kube-api-reused-connections was unreachable during disruption testing for at least 2s of 1h35m24s (maxAllowed=1s):  
}

openshift-ci · 2023-01-09T18:35:17Z

@petr-muller: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

e2e-agnostic-upgrade-into-change

Only the following failed contexts/checkruns were expected:

ci/prow/e2e-agnostic
ci/prow/e2e-agnostic-operator
ci/prow/e2e-agnostic-upgrade-into-change
ci/prow/e2e-agnostic-upgrade-out-of-change
ci/prow/gofmt
ci/prow/images
ci/prow/lint
ci/prow/unit
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-into-change
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-out-of-change
pull-ci-openshift-cluster-version-operator-master-gofmt
pull-ci-openshift-cluster-version-operator-master-images
pull-ci-openshift-cluster-version-operator-master-lint
pull-ci-openshift-cluster-version-operator-master-unit

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/test e2e-agnostic-upgrade-out-of-change
^^^ failed to install

/override e2e-agnostic-upgrade-into-change
^^^ passed everything except the following, which is unrelated
: [sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
{
 cache-kube-api-reused-connections was unreachable during disruption testing for at least 2s of 1h35m24s (maxAllowed=1s):  
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-09T18:36:20Z

/override ci/prow/e2e-agnostic-upgrade-into-change

openshift-ci · 2023-01-09T18:37:00Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-10T13:22:25Z

/retest

wking · 2023-01-10T18:26:20Z

pkg/cvo/upgradeable.go

-// the minimumUpdateCheckInterval since the last synchronization or the precondition
+// the checkThrottlePeriod since the last synchronization or the precondition
 // checks on the payload are failing for less than minimumUpdateCheckInterval, and it has
 // been more than the minimumUpgradeableCheckInterval since the last synchronization.


Do these two minimumUpdateCheckInterval refs need to be updated too? I see shouldSyncUpgradeableDueToPreconditionChecks later in this file still using minimumUpdateCheckInterval, but if minimumUpdateCheckInterval is going to be scoped to upstream Cincinnati-side stuff, I'd expect us to be using an Upgradeable-specific knob for Upgradeable-side stuff. Maybe checkThrottlePeriod should be specific to Upgradeable throttling, and we want a third variable for whatever shouldSyncUpgradeableDueToPreconditionChecks is doing? Or maybe they can both share checkThrottlePeriod? Or something?

@Davoska writes great PR descriptions so #808 gives some context about why the code looks like this. IIUC then the intent is "throttle less aggressively under certain conditions while we'd be throttled otherwise":

The CVO's Upgradeable status originally synchronizes only every optr.minimumUpdateCheckInterval https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/upgradeable.go#L38-L50.

A problem can occur when a Upgradeable==false is reported for the CVO's status during a synchronization before an upgrade. A precondition check registers the Upgradeable==false and halts the upgrade, and then regularly checks if the Upgradeable==false still exists. However, the CVO's Upgradeable status synchronizes less often which can result in necessary waiting time for the user when initializing an upgrade.

It uses !hasPassedDurationSinceTime(cond.LastTransitionTime.Time, optr.minimumUpdateCheckInterval) to implement the "check more often after preconditions were changed" period. It feels like a third name would really be cleanest here, I don't think there's a reason why it needs to be coupled with minimumUpdateCheckInterval.

I also think there are some refactoring opportunities here - seems like we can have logic that returns the throttling period suitable for the current condition and pass that to RecentlyChanged.

/cc @Davoska
I'd appreciate your insights on this, I'm touching your code

pkg/cvo/upgradeable.go

petr-muller · 2023-01-11T21:01:57Z

@wking PTAL. Encouraged by your "some day I'd like to break pkg/cvo up into more, smaller sub-packages" comment, I went for a slightly larger refactor of the related code and decoupled the logic from cvo.go better. I extracted the code that drives the "is our Upgradeable recent enough?" question to a dedicated method that is now tested and bundled the upgradeable-related intervals into a helper struct.

petr-muller · 2023-01-12T15:14:22Z

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
{  fail [github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:142]: Jan 11 22:21:26.078: too many pods were waiting: ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-mwnnm,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-vpmpl,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-z574b Ginkgo exit error 1: exit with code 1}


disruption_tests: [sig-network-edge] Verify DNS availability during and after upgrade success
{Jan 11 22:21:26.078: too many pods were waiting: ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-mwnnm,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-vpmpl,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-z574b

Looks unrelated
/retest

petr-muller · 2023-01-13T12:18:20Z

openshift-api-reused-connections was unreachable during disruption testing for at least 4s of 1h39m12s (maxAllowed=2s):
oauth-api-reused-connections was unreachable during disruption testing for at least 4s of 1h39m12s (maxAllowed=2s):

Unrelated
/retest

petr-muller · 2023-01-13T14:35:43Z

/override ci/prow/e2e-agnostic-upgrade-into-change
3rd attempt failed to install, previous two attempts did not indicate a problem with CVO

openshift-ci · 2023-01-13T14:36:00Z

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change
3rd attempt failed to install, previous two attempts did not indicate a problem with CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LalatenduMohanty · 2023-01-13T16:43:48Z

pkg/cvo/upgradeable.go

 const (
 	adminAckGateFmt             string = "^ack-[4-5][.]([0-9]{1,})-[^-]"
 	upgradeableAdminAckRequired        = configv1.ClusterStatusConditionType("UpgradeableAdminAckRequired")
+	checkThrottlePeriod                = time.Minute


What is the downside of reducing the checkThrottlePeriod further?

Good question, I'm not entirely sure. Abhinav's message from 0452814 that introduced the throttling does not help too much about why exactly minUpdateCheckInterval was chosen:

Adds another long running sync like the availableUpdates that updates the operator with upgradeable conditions every minUpdateChekInterval.

minUpdateCheckInterval itself happened in this beast #45 and there's too much content there to be useful, the commit message says just:

Avoid reconciling too often (exit before applying) when no spec change recorded

I'm not sure about how expensive the upgradeability checks are - that's one of the two reasons I can imagine throttling is useful (the other is suppressing user-observable flapping in busy periods).

But the refactoring on this PR should make experimenting with this easier ;)

Previously, the throttling reused the `minimumUpdateCheckInterval` value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst. Hardcode the throttling to 2m (lower bound of previous behavior) to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

Refactor the code that handles throttling upgradeability checks. Create a new method that computes the duration for which the existing `Upgradeable` status is considered recent enough to not be synced, and simply pass this duration to the `RecentlyChanged` method. The new method is now unit tested, too. Upgradeable-related intervals are now uncoupled to unrelated sync intervals and are grouped in a new struct.

openshift-ci-robot · 2023-01-13T18:10:06Z

@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.13.0) matches configured target version for branch (4.13.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @evakhoni

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking

/lgtm

openshift-ci · 2023-01-13T18:16:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-01-13T20:50:47Z

/retest-required

Remaining retests: 0 against base HEAD a849b95 and 2 for PR HEAD cc94c95 in total

petr-muller · 2023-01-15T22:27:32Z

/retest

petr-muller · 2023-01-16T08:33:51Z

/retest

openshift-ci · 2023-01-16T11:38:52Z

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-01-16T11:42:30Z

@petr-muller: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#882

Jira Issue OCPBUGS-5505 has been moved to the MODIFIED state.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-16T14:20:03Z

/jira cherry-pick release-4.12

petr-muller · 2023-01-16T14:23:25Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2023-01-16T14:24:10Z

@petr-muller: new pull request created: #884

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-16T14:29:07Z

/cherry-pick release-4.11

openshift-cherrypick-robot · 2023-01-16T14:29:52Z

@petr-muller: #882 failed to apply on top of branch "release-4.11":

Applying: OCPBUGS-5505: Set upgradeability check throttling period to 2m
Using index info to reconstruct a base tree...
M	pkg/cvo/upgradeable.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/upgradeable.go
CONFLICT (content): Merge conflict in pkg/cvo/upgradeable.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-5505: Set upgradeability check throttling period to 2m
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/cvo/upgradeable.go

openshift-ci bot requested review from LalatenduMohanty, jiajliu and jottofar January 9, 2023 14:54

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2023

openshift-ci bot requested a review from wking January 9, 2023 15:13

wking reviewed Jan 10, 2023

View reviewed changes

pkg/cvo/upgradeable.go Outdated Show resolved Hide resolved

openshift-ci bot requested review from DavidHurta and removed request for jottofar January 11, 2023 16:06

petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch from 1613831 to cf47bce Compare January 11, 2023 21:05

LalatenduMohanty reviewed Jan 13, 2023

View reviewed changes

petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch 2 times, most recently from 3b8735d to 8d50fb3 Compare January 13, 2023 18:04

petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch from 8d50fb3 to cc94c95 Compare January 13, 2023 18:05

petr-muller changed the title ~~OCPBUGS-5505: Set upgradeability check throttling period to 1m~~ OCPBUGS-5505: Set upgradeability check throttling period to 2m Jan 13, 2023

openshift-ci bot requested a review from evakhoni January 13, 2023 18:10

wking approved these changes Jan 13, 2023

View reviewed changes

openshift-ci bot assigned wking Jan 13, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2023

openshift-merge-robot merged commit 3f3204c into openshift:master Jan 16, 2023

openshift-cherrypick-robot mentioned this pull request Jan 16, 2023

[release-4.12] OCPBUGS-5879: Set upgradeability check throttling period to 2m #884

Merged

petr-muller mentioned this pull request Jan 16, 2023

[release-4.11] OCPBUGS-5882: Set upgradeability check throttling period to 2m #885

Merged

petr-muller deleted the ocpbugs-5505-determinize-upgradeability-check-throttle branch January 16, 2023 15:07

evakhoni reviewed Jan 17, 2023

View reviewed changes

pkg/cvo/upgradeable.go Show resolved Hide resolved

OCPBUGS-5505: Set upgradeability check throttling period to 2m #882

OCPBUGS-5505: Set upgradeability check throttling period to 2m #882

Uh oh!

Conversation

petr-muller commented Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 9, 2023

Uh oh!

petr-muller commented Jan 9, 2023

Uh oh!

petr-muller commented Jan 9, 2023

Uh oh!

openshift-ci bot commented Jan 9, 2023

Uh oh!

petr-muller commented Jan 9, 2023

Uh oh!

openshift-ci bot commented Jan 9, 2023

Uh oh!

petr-muller commented Jan 10, 2023

Uh oh!

wking Jan 10, 2023

Choose a reason for hiding this comment

Uh oh!

petr-muller Jan 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petr-muller commented Jan 11, 2023

Uh oh!

petr-muller commented Jan 12, 2023

Uh oh!

petr-muller commented Jan 13, 2023

Uh oh!

petr-muller commented Jan 13, 2023

Uh oh!

openshift-ci bot commented Jan 13, 2023

Uh oh!

LalatenduMohanty Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

petr-muller Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

petr-muller Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jan 13, 2023

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jan 13, 2023

Uh oh!

openshift-ci-robot commented Jan 13, 2023

Uh oh!

petr-muller commented Jan 15, 2023

Uh oh!

petr-muller commented Jan 16, 2023

Uh oh!

openshift-ci bot commented Jan 16, 2023

Uh oh!

openshift-ci-robot commented Jan 16, 2023

Uh oh!

petr-muller commented Jan 16, 2023

Uh oh!

petr-muller commented Jan 16, 2023

Uh oh!

openshift-cherrypick-robot commented Jan 16, 2023

Uh oh!

petr-muller commented Jan 16, 2023

Uh oh!

openshift-cherrypick-robot commented Jan 16, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

petr-muller commented Jan 9, 2023 •

edited

Loading