Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Jan 9, 2023

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 9, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 1m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 9, 2023
@petr-muller
Copy link
Member Author

/cc @wking

@openshift-ci openshift-ci bot requested a review from wking January 9, 2023 15:13
@petr-muller
Copy link
Member Author

/test e2e-agnostic-upgrade-out-of-change
^^^ failed to install

/override e2e-agnostic-upgrade-into-change
^^^ passed everything except the following, which is unrelated

: [sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
{
  cache-kube-api-reused-connections was unreachable during disruption testing for at least 2s of 1h35m24s (maxAllowed=1s):  
}

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 9, 2023

@petr-muller: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • e2e-agnostic-upgrade-into-change

Only the following failed contexts/checkruns were expected:

  • ci/prow/e2e-agnostic
  • ci/prow/e2e-agnostic-operator
  • ci/prow/e2e-agnostic-upgrade-into-change
  • ci/prow/e2e-agnostic-upgrade-out-of-change
  • ci/prow/gofmt
  • ci/prow/images
  • ci/prow/lint
  • ci/prow/unit
  • pull-ci-openshift-cluster-version-operator-master-e2e-agnostic
  • pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator
  • pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-into-change
  • pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade-out-of-change
  • pull-ci-openshift-cluster-version-operator-master-gofmt
  • pull-ci-openshift-cluster-version-operator-master-images
  • pull-ci-openshift-cluster-version-operator-master-lint
  • pull-ci-openshift-cluster-version-operator-master-unit

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/test e2e-agnostic-upgrade-out-of-change
^^^ failed to install

/override e2e-agnostic-upgrade-into-change
^^^ passed everything except the following, which is unrelated

: [sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
{
 cache-kube-api-reused-connections was unreachable during disruption testing for at least 2s of 1h35m24s (maxAllowed=1s):  
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/override ci/prow/e2e-agnostic-upgrade-into-change

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 9, 2023

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/retest

// the minimumUpdateCheckInterval since the last synchronization or the precondition
// the checkThrottlePeriod since the last synchronization or the precondition
// checks on the payload are failing for less than minimumUpdateCheckInterval, and it has
// been more than the minimumUpgradeableCheckInterval since the last synchronization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two minimumUpdateCheckInterval refs need to be updated too? I see shouldSyncUpgradeableDueToPreconditionChecks later in this file still using minimumUpdateCheckInterval, but if minimumUpdateCheckInterval is going to be scoped to upstream Cincinnati-side stuff, I'd expect us to be using an Upgradeable-specific knob for Upgradeable-side stuff. Maybe checkThrottlePeriod should be specific to Upgradeable throttling, and we want a third variable for whatever shouldSyncUpgradeableDueToPreconditionChecks is doing? Or maybe they can both share checkThrottlePeriod? Or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Davoska writes great PR descriptions so #808 gives some context about why the code looks like this. IIUC then the intent is "throttle less aggressively under certain conditions while we'd be throttled otherwise":

The CVO's Upgradeable status originally synchronizes only every optr.minimumUpdateCheckInterval https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/upgradeable.go#L38-L50.

A problem can occur when a Upgradeable==false is reported for the CVO's status during a synchronization before an upgrade. A precondition check registers the Upgradeable==false and halts the upgrade, and then regularly checks if the Upgradeable==false still exists. However, the CVO's Upgradeable status synchronizes less often which can result in necessary waiting time for the user when initializing an upgrade.

It uses !hasPassedDurationSinceTime(cond.LastTransitionTime.Time, optr.minimumUpdateCheckInterval) to implement the "check more often after preconditions were changed" period. It feels like a third name would really be cleanest here, I don't think there's a reason why it needs to be coupled with minimumUpdateCheckInterval.

I also think there are some refactoring opportunities here - seems like we can have logic that returns the throttling period suitable for the current condition and pass that to RecentlyChanged.

/cc @Davoska
I'd appreciate your insights on this, I'm touching your code

@openshift-ci openshift-ci bot requested review from DavidHurta and removed request for jottofar January 11, 2023 16:06
@petr-muller
Copy link
Member Author

@wking PTAL. Encouraged by your "some day I'd like to break pkg/cvo up into more, smaller sub-packages" comment, I went for a slightly larger refactor of the related code and decoupled the logic from cvo.go better. I extracted the code that drives the "is our Upgradeable recent enough?" question to a dedicated method that is now tested and bundled the upgradeable-related intervals into a helper struct.

@petr-muller petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch from 1613831 to cf47bce Compare January 11, 2023 21:05
@petr-muller
Copy link
Member Author

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
{  fail [github.com/openshift/origin/test/e2e/upgrade/dns/dns.go:142]: Jan 11 22:21:26.078: too many pods were waiting: ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-mwnnm,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-vpmpl,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-z574b Ginkgo exit error 1: exit with code 1}


disruption_tests: [sig-network-edge] Verify DNS availability during and after upgrade success
{Jan 11 22:21:26.078: too many pods were waiting: ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-mwnnm,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-vpmpl,ns/e2e-check-for-dns-availability-6290 pod/dns-test-96fbb3e7-a67f-4371-9fad-6fcfca906eeb-z574b

Looks unrelated
/retest

@petr-muller
Copy link
Member Author

openshift-api-reused-connections was unreachable during disruption testing for at least 4s of 1h39m12s (maxAllowed=2s):
oauth-api-reused-connections was unreachable during disruption testing for at least 4s of 1h39m12s (maxAllowed=2s):

Unrelated
/retest

@petr-muller
Copy link
Member Author

/override ci/prow/e2e-agnostic-upgrade-into-change
3rd attempt failed to install, previous two attempts did not indicate a problem with CVO

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2023

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change

Details

In response to this:

/override ci/prow/e2e-agnostic-upgrade-into-change
3rd attempt failed to install, previous two attempts did not indicate a problem with CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

const (
adminAckGateFmt string = "^ack-[4-5][.]([0-9]{1,})-[^-]"
upgradeableAdminAckRequired = configv1.ClusterStatusConditionType("UpgradeableAdminAckRequired")
checkThrottlePeriod = time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the downside of reducing the checkThrottlePeriod further?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'm not entirely sure. Abhinav's message from 0452814 that introduced the throttling does not help too much about why exactly minUpdateCheckInterval was chosen:

Adds another long running sync like the availableUpdates that updates the operator with upgradeable conditions every minUpdateChekInterval.

minUpdateCheckInterval itself happened in this beast #45 and there's too much content there to be useful, the commit message says just:

  • Avoid reconciling too often (exit before applying) when no spec change recorded

I'm not sure about how expensive the upgradeability checks are - that's one of the two reasons I can imagine throttling is useful (the other is suppressing user-observable flapping in busy periods).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the refactoring on this PR should make experimenting with this easier ;)

Previously, the throttling reused the `minimumUpdateCheckInterval` value
which is derived from the full CVO minimum sync period. This value is
set between 2m and 4m at CVO startup. This period is unecessarily long
and bad for UX, things happen with a delay and our own testcase expects
upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m (lower bound of previous behavior) to
prevent flapping on flurries but allow changes to propagate
deterministically faster. We will still get a bit of non-determinisim
from sync periods and requeueing, so this change should not cause any
periodic API-hammering.
@petr-muller petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch 2 times, most recently from 3b8735d to 8d50fb3 Compare January 13, 2023 18:04
Refactor the code that handles throttling upgradeability checks. Create
a new method that computes the duration for which the existing
`Upgradeable` status is considered recent enough to not be synced, and
simply pass this duration to the `RecentlyChanged` method. The new
method is now unit tested, too. Upgradeable-related intervals are now
uncoupled to unrelated sync intervals and are grouped in a new struct.
@petr-muller petr-muller force-pushed the ocpbugs-5505-determinize-upgradeability-check-throttle branch from 8d50fb3 to cc94c95 Compare January 13, 2023 18:05
@petr-muller petr-muller changed the title OCPBUGS-5505: Set upgradeability check throttling period to 1m OCPBUGS-5505: Set upgradeability check throttling period to 2m Jan 13, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @evakhoni

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from evakhoni January 13, 2023 18:10
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a849b95 and 2 for PR HEAD cc94c95 in total

@petr-muller
Copy link
Member Author

/retest

1 similar comment
@petr-muller
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 16, 2023

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 3f3204c into openshift:master Jan 16, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-5505 has been moved to the MODIFIED state.

Details

In response to this:

Previously, the throttling reused the minimumUpdateCheckInterval value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.

Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.

I'd like to backport this change to 4.11 where it causes CI flakes in [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged test.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/jira cherry-pick release-4.12

@petr-muller
Copy link
Member Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@petr-muller: new pull request created: #884

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/cherry-pick release-4.11

@openshift-cherrypick-robot

@petr-muller: #882 failed to apply on top of branch "release-4.11":

Applying: OCPBUGS-5505: Set upgradeability check throttling period to 2m
Using index info to reconstruct a base tree...
M	pkg/cvo/upgradeable.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/cvo/upgradeable.go
CONFLICT (content): Merge conflict in pkg/cvo/upgradeable.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-5505: Set upgradeability check throttling period to 2m
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller deleted the ocpbugs-5505-determinize-upgradeability-check-throttle branch January 16, 2023 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants