-
Notifications
You must be signed in to change notification settings - Fork 213
OCPBUGS-5505: Set upgradeability check throttling period to 2m #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-5505: Set upgradeability check throttling period to 2m #882
Conversation
|
@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cc @wking |
|
/test e2e-agnostic-upgrade-out-of-change /override e2e-agnostic-upgrade-into-change |
|
@petr-muller: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/override ci/prow/e2e-agnostic-upgrade-into-change |
|
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
pkg/cvo/upgradeable.go
Outdated
| // the minimumUpdateCheckInterval since the last synchronization or the precondition | ||
| // the checkThrottlePeriod since the last synchronization or the precondition | ||
| // checks on the payload are failing for less than minimumUpdateCheckInterval, and it has | ||
| // been more than the minimumUpgradeableCheckInterval since the last synchronization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these two minimumUpdateCheckInterval refs need to be updated too? I see shouldSyncUpgradeableDueToPreconditionChecks later in this file still using minimumUpdateCheckInterval, but if minimumUpdateCheckInterval is going to be scoped to upstream Cincinnati-side stuff, I'd expect us to be using an Upgradeable-specific knob for Upgradeable-side stuff. Maybe checkThrottlePeriod should be specific to Upgradeable throttling, and we want a third variable for whatever shouldSyncUpgradeableDueToPreconditionChecks is doing? Or maybe they can both share checkThrottlePeriod? Or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Davoska writes great PR descriptions so #808 gives some context about why the code looks like this. IIUC then the intent is "throttle less aggressively under certain conditions while we'd be throttled otherwise":
The CVO's Upgradeable status originally synchronizes only every optr.minimumUpdateCheckInterval https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/upgradeable.go#L38-L50.
A problem can occur when a Upgradeable==false is reported for the CVO's status during a synchronization before an upgrade. A precondition check registers the Upgradeable==false and halts the upgrade, and then regularly checks if the Upgradeable==false still exists. However, the CVO's Upgradeable status synchronizes less often which can result in necessary waiting time for the user when initializing an upgrade.
It uses !hasPassedDurationSinceTime(cond.LastTransitionTime.Time, optr.minimumUpdateCheckInterval) to implement the "check more often after preconditions were changed" period. It feels like a third name would really be cleanest here, I don't think there's a reason why it needs to be coupled with minimumUpdateCheckInterval.
I also think there are some refactoring opportunities here - seems like we can have logic that returns the throttling period suitable for the current condition and pass that to RecentlyChanged.
/cc @Davoska
I'd appreciate your insights on this, I'm touching your code
|
@wking PTAL. Encouraged by your "some day I'd like to break pkg/cvo up into more, smaller sub-packages" comment, I went for a slightly larger refactor of the related code and decoupled the logic from |
1613831 to
cf47bce
Compare
Looks unrelated |
Unrelated |
|
/override ci/prow/e2e-agnostic-upgrade-into-change |
|
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-upgrade-into-change DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/cvo/upgradeable.go
Outdated
| const ( | ||
| adminAckGateFmt string = "^ack-[4-5][.]([0-9]{1,})-[^-]" | ||
| upgradeableAdminAckRequired = configv1.ClusterStatusConditionType("UpgradeableAdminAckRequired") | ||
| checkThrottlePeriod = time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the downside of reducing the checkThrottlePeriod further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I'm not entirely sure. Abhinav's message from 0452814 that introduced the throttling does not help too much about why exactly minUpdateCheckInterval was chosen:
Adds another long running sync like the availableUpdates that updates the operator with upgradeable conditions every minUpdateChekInterval.
minUpdateCheckInterval itself happened in this beast #45 and there's too much content there to be useful, the commit message says just:
- Avoid reconciling too often (exit before applying) when no spec change recorded
I'm not sure about how expensive the upgradeability checks are - that's one of the two reasons I can imagine throttling is useful (the other is suppressing user-observable flapping in busy periods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the refactoring on this PR should make experimenting with this easier ;)
Previously, the throttling reused the `minimumUpdateCheckInterval` value which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst. Hardcode the throttling to 2m (lower bound of previous behavior) to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.
3b8735d to
8d50fb3
Compare
Refactor the code that handles throttling upgradeability checks. Create a new method that computes the duration for which the existing `Upgradeable` status is considered recent enough to not be synced, and simply pass this duration to the `RecentlyChanged` method. The new method is now unit tested, too. Upgradeable-related intervals are now uncoupled to unrelated sync intervals and are grouped in a new struct.
8d50fb3 to
cc94c95
Compare
|
@petr-muller: This pull request references Jira Issue OCPBUGS-5505, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
1 similar comment
|
/retest |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@petr-muller: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-5505 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira cherry-pick release-4.12 |
|
/cherry-pick release-4.12 |
|
@petr-muller: new pull request created: #884 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.11 |
|
@petr-muller: #882 failed to apply on top of branch "release-4.11": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Previously, the throttling reused the
minimumUpdateCheckIntervalvalue which is derived from the full CVO minimum sync period. This value is set between 2m and 4m at CVO startup. This period is unecessarily long and bad for UX, things happen with a delay and our own testcase expects upgradeability to be propagated in 3 minutes at worst.Hardcode the throttling to 2m to prevent flapping on flurries but allow changes to propagate deterministically faster. We will still get a bit of non-determinisim from sync periods and requeueing, so this change should not cause any periodic API-hammering.
I'd like to backport this change to 4.11 where it causes CI flakes in
[bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledgedtest.