Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes #818

wking · 2022-08-09T22:27:14Z

@deads2k and @stbenjam identified an uneccessary delay:

9:42:00, CVO gives up on Kube API server ClusterOperator
9:42:47, Kube API server operator achieves 4.12
9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt
9:46:34, CVO notices that the Kube API server ClusterOperator is happy

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've probably had delays like this since my old e02d148 (#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here. For example, you could currently have a versions[name=operator] bump during the sync loop that the CVO did notice, and that queued notification will break from the sleep and trigger a possible useless reconciliation round while we wait on some other resource. You could drain the notification queue before the sleep to avoid that, but you wouldn't want to drain new-work notifications, and I haven't done the work required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many types the CVO manages, ClusterOperator is the one we most frequently wait on, as large cluster components take their time updating. It's possible but less likely that we'd want similar triggers for additional types in the future (Deployment, etc.), if/when those types develop more elaborate "is the in-cluster resource sufficient happy?" checks.

openshift-ci · 2022-08-09T22:27:30Z

@wking: This pull request references Bugzilla bug 2117033, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.12.0) matches configured target release for branch (4.12.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-08-09T22:27:44Z

pkg/cvo/cvo.go

+func clusterOperatorInterfaceVersionOrDie(obj interface{}, name string) (*configv1.ClusterOperator, string) {
+	co, ok := obj.(*configv1.ClusterOperator)
+	if !ok {
+		panic(fmt.Sprintf("%v is %T, not a ClusterOperator", obj, obj))


This feels like a hack, but I wasn't able to find a cleaner way to get at the structured information I want. Improvements welcome :)

I am not sure if we need a panic here. If clusterOperatorEventHandler() can only be called on an operator why do we need to check if the obj is not an operator?

openshift-ci · 2022-08-09T22:36:21Z

@wking: This pull request references Bugzilla bug 2117033, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.12.0) matches configured target release for branch (4.12.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ns[name=operator] changes David and Stephen identified an uneccessary delay [1]: * 9:42:00, CVO gives up on Kube API server ClusterOperator [2] * 9:42:47, Kube API server operator achieves 4.12 [3] * 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4] * 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5] The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've probably had delays like this since my old e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-13, openshift#560), which landed in 4.6. This commit introduces a "ClusterOperator bumped versions[name=operator]" trigger to break out of the cool-off sleep. There's plenty of room to be more precise here. For example, you could currently have a versions[name=operator] bump during the sync loop that the CVO did notice, and that queued notification will break from the sleep and trigger a possible useless reconciliation round while we wait on some other resource. You could drain the notification queue before the sleep to avoid that, but you wouldn't want to drain new-work notifications, and I haven't done the work required to be able to split those apart. I'm only looking at ClusterOperator at the moment, because of the many types the CVO manages, ClusterOperator is the one we most frequently wait on, as large cluster components take their time updating. It's possible but less likely that we'd want similar triggers for additional types in the future (Deployment, etc.), if/when those types develop more elaborate "is the in-cluster resource sufficient happy?" checks. The panic-backed type casting in clusterOperatorInterfaceVersionOrDie also feel like a hack, but I wasn't able to find a cleaner way to get at the structured information I want. Improvements welcome :) [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1 [2]: From Loki: E0808 09:42:00.022500 1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:' Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220 Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220 [4]: From Loki: I0808 09:46:22.998344 1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5 [5]: From Loki: I0808 09:46:34.556374 1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)

wking · 2022-08-10T04:17:01Z

Update presubmit failed on the likely orthogonal events should not repeat pathologically pod/machine-api-controllers... Readiness probe error: Get "http://10.129.0.14:9442/healthz": dial tcp 10.129.0.14:9442: connect: connection refused. But dropping into Loki with:

{invoker="openshift-internal-ci/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1557150272407474176"} | unpack | namespace="openshift-cluster-version" |~ "clusteroperator \"kube-apiserver|in state|[Nn]otif"

turns up:

...
E0810 01:03:39.710537       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 816): Cluster operator kube-apiserver is updating versions
I0810 01:07:09.150368       1 sync_worker.go:235] Notify the sync worker: Cluster operator kube-apiserver changed versions[name="operator"] from "4.12.0-0.ci.test-2022-08-09-234236-ci-op-dmywi2n7-initial" to "4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest"
I0810 01:07:09.150539       1 sync_worker.go:864] apply: 4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest on generation 3 in state Updating at attempt 5
...
I0810 01:07:16.712837       1 sync_worker.go:987] Done syncing for clusteroperator "kube-apiserver" (107 of 816)
...

Comparing with the test-suite monitor:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/818/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade/1557150272407474176/artifacts/e2e-agnostic-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
Aug 10 00:58:05.747 I clusteroperator/kube-apiserver versions: raw-internal 4.12.0-0.ci.test-2022-08-09-234236-ci-op-dmywi2n7-initial -> 4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest
Aug 10 01:07:09.155 I clusteroperator/kube-apiserver versions: operator 4.12.0-0.ci.test-2022-08-09-234236-ci-op-dmywi2n7-initial -> 4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest
Aug 10 02:08:47.822 I clusteroperator/kube-apiserver versions: raw-internal 4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest -> 4.12.0-0.ci.test-2022-08-09-234236-ci-op-dmywi2n7-initial
Aug 10 02:17:51.529 I clusteroperator/kube-apiserver versions: operator 4.12.0-0.ci.test-2022-08-09-235304-ci-op-dmywi2n7-latest -> 4.12.0-0.ci.test-2022-08-09-234236-ci-op-dmywi2n7-initial

So the CVO tests are doing A->B->A rollbacks, and I'm looking at the A->B leg in Loki (which is the leg managed by the incoming CVO that has my patch), and the test-suite monitor sees the transition at 01:07:09.155, and the new sync round kicks off at 01:07:09.150539, and the CVO gets deep enough into the manifest graph to notice the kube-apiserver's completion at 01:07:16.712837. So down from possibly-minutes to seconds :tada:

LalatenduMohanty

/lgtm

openshift-ci · 2022-08-10T20:25:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2022-08-10T21:19:39Z

/retest-required

Remaining retests: 2 against base HEAD 02838c7 and 8 for PR HEAD 9222fa9 in total

openshift-ci-robot · 2022-08-11T01:20:23Z

/retest-required

Remaining retests: 1 against base HEAD 02838c7 and 7 for PR HEAD 9222fa9 in total

openshift-ci-robot · 2022-08-11T05:24:47Z

/retest-required

Remaining retests: 0 against base HEAD 02838c7 and 6 for PR HEAD 9222fa9 in total

wking · 2022-08-11T15:41:11Z

pods should successfully create sandboxes by adding pod to network is orthogonal.

/override ci/prow/e2e-agnostic-upgrade

openshift-ci · 2022-08-11T15:45:04Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

pods should successfully create sandboxes by adding pod to network is orthogonal.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-08-11T15:45:34Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-08-11T15:58:01Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#818

Bugzilla bug 2117033 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-08-18T20:33:44Z

/cherrypick release-4.11

openshift-cherrypick-robot · 2022-08-18T20:34:31Z

@wking: new pull request created: #826

Details

In response to this:

/cherrypick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ication" formatting Cleaning up a log line I'd fumbled in 9222fa9 (pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes, 2022-08-09, openshift#818). Changes: * Remove redundant "so do not notify about", because "no need to inform about" covers that with more clarity (we're not informing, and even if we did, it would be useless, because there's already a pending notification in place). * Use "Infof" to get the %s formatted in, avoiding: ...no need to inform about: %sCluster operator... `

openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 9, 2022

openshift-ci bot requested a review from jiajliu August 9, 2022 22:27

wking commented Aug 9, 2022

View reviewed changes

openshift-ci bot requested review from LalatenduMohanty and jottofar August 9, 2022 22:27

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2022

wking force-pushed the ClusterOperator-version-trigger branch 2 times, most recently from 8c3faf4 to 7e0bcca Compare August 9, 2022 22:35

wking force-pushed the ClusterOperator-version-trigger branch 3 times, most recently from 2be972c to 482d1bf Compare August 9, 2022 23:02

wking force-pushed the ClusterOperator-version-trigger branch from 482d1bf to 9222fa9 Compare August 9, 2022 23:41

LalatenduMohanty approved these changes Aug 10, 2022

View reviewed changes

openshift-ci bot assigned LalatenduMohanty Aug 10, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2022

openshift-merge-robot merged commit c699d55 into openshift:master Aug 11, 2022

wking deleted the ClusterOperator-version-trigger branch August 11, 2022 15:59

openshift-cherrypick-robot mentioned this pull request Aug 18, 2022

OCPBUGS-306: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes #826

Merged

wking mentioned this pull request Feb 14, 2023

pkg/cvo/sync_worker: Fix "The sync worker already has a pending notification" formatting #898

Merged

Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes #818

Bug 2117033: pkg/cvo/sync_worker: Trigger new sync round on ClusterOperator versions[name=operator] changes #818

Uh oh!

Conversation

wking commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Aug 9, 2022

Uh oh!

wking Aug 9, 2022

Choose a reason for hiding this comment

Uh oh!

LalatenduMohanty Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 9, 2022

Uh oh!

wking commented Aug 10, 2022

Uh oh!

LalatenduMohanty left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 10, 2022

Uh oh!

openshift-ci-robot commented Aug 10, 2022

Uh oh!

openshift-ci-robot commented Aug 11, 2022

Uh oh!

openshift-ci-robot commented Aug 11, 2022

Uh oh!

wking commented Aug 11, 2022

Uh oh!

openshift-ci bot commented Aug 11, 2022

Uh oh!

openshift-ci bot commented Aug 11, 2022

Uh oh!

openshift-ci bot commented Aug 11, 2022

Uh oh!

wking commented Aug 18, 2022

Uh oh!

openshift-cherrypick-robot commented Aug 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wking commented Aug 9, 2022 •

edited

Loading

LalatenduMohanty Aug 10, 2022 •

edited

Loading