Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 13, 2021

Like cc9292a (#400), but for ClusterOperator. In that commit message, I'd explain that we'd continue on past imperfect (e.g. Progressing=True) but still happy-enough Deployments and such and later block on the ClusterOperator. But there's really no need to poll the ClusterOperator while we're blocked on it; we can instead fail that sync loop, take the short cool-off break, and come in again with a new pass at the sync cycle.

This will help reduce delays like rhbz#1927168 where a good chunk of the ~5.5 minute delay was waiting for the network ClusterOperator to become happy. If instead we give up on that sync cycle and start in again with a fresh sync cycle, we would have been more likely to recreate the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded, and unleveled operators more quickly, without giving them time to become happy before complaining in ClusterVersion's status. This is mitigated by most of the returned errors being UpdateEffectNone, which we will render as "waiting on..." Progressing messages. The exceptions are mostly unavailable, which is a serious enough condition that I'm fine complaining about it aggressively, and degraded, which has the fail-after-interval guard keeping us from complaining about it too aggressively.

I'm also shifting the "does not declare expected versions" check before the Get call. It's goal is still to ensure that we fail in CI before shipping an operator with such a manifest, and there's no point in firing off an API GET before failing on a guard that only needs local information.

@openshift-ci openshift-ci bot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label May 13, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2021

@wking: This pull request references Bugzilla bug 1927168, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1927168: pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label May 13, 2021
@openshift-ci openshift-ci bot requested review from smarterclayton and vrutkovs May 13, 2021 05:12
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2021
@wking
Copy link
Member Author

wking commented May 13, 2021

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 13, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2021

@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the install-time-degraded-propagation branch 2 times, most recently from 65dd4cd to eba5b25 Compare May 13, 2021 14:47
…s it alive now?"

Like cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400), but for
ClusterOperator.  In that commit message, I'd explain that we'd
continue on past imperfect (e.g. Progressing=True) but still
happy-enough Deployments and such and later block on the
ClusterOperator.  But there's really no need to poll the
ClusterOperator while we're blocked on it; we can instead fail that
sync loop, take the short cool-off break, and come in again with a new
pass at the sync cycle.

This will help reduce delays like [1] where a good chunk of the ~5.5
minute delay was waiting for the network ClusterOperator to become
happy.  If instead we give up on that sync cycle and start in again
with a fresh sync cycle, we would have been more likely to recreate
the ServiceMonitors CRD more quickly.

The downside is that we will now complain about unavailable, degraded,
and unleveled operators more quickly, without giving them time to
become happy before complaining in ClusterVersion's status.  This is
mitigated by most of the returned errors being UpdateEffectNone, which
we will render as "waiting on..." Progressing messages.  The
exceptions are mostly unavailable, which is a serious enough condition
that I'm fine complaining about it aggressively, and degraded, which
has the fail-after-interval guard keeping us from complaining about it
too aggressively.

I'm also shifting the "does not declare expected versions" check
before the Get call.  It's goal is still to ensure that we fail in CI
before shipping an operator with such a manifest, and there's no point
in firing off an API GET before failing on a guard that only needs
local information.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927168#c2
@wking wking force-pushed the install-time-degraded-propagation branch from eba5b25 to e02d148 Compare May 13, 2021 17:13
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2021

@wking: This pull request references Bugzilla bug 1927168, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1927168: pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jottofar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented May 13, 2021

/run failure is rhbz#1959564.

/override ci/prow/e2e-agnostic

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 13, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic

Details

In response to this:

/run failure is rhbz#1959564.

/override ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented May 14, 2021

This update job passed everything except cluster teardown, which is orthogonal to this PR. Other update presubmits also seem to have failed in unrelated-to-this-PR modes.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

This update job passed everything except cluster teardown, which is orthogonal to this PR. Other update presubmits also seem to have failed in unrelated-to-this-PR modes.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit f0ef34b into openshift:master May 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1927168 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1927168: pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the install-time-degraded-propagation branch May 14, 2021 16:07
wking added a commit to wking/cluster-version-operator that referenced this pull request May 24, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.
wking added a commit to wking/cluster-version-operator that referenced this pull request May 25, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 9, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 9, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 9, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 9, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 9, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Aug 18, 2022
…ns[name=operator] changes

David and Stephen identified an uneccessary delay [1]:

* 9:42:00, CVO gives up on Kube API server ClusterOperator [2]
* 9:42:47, Kube API server operator achieves 4.12 [3]
* 9:46:22, after a cool-off sleep, the CVO starts in on a new manifest graph-walk attempt [4]
* 9:46:34, CVO notices that the Kube API server ClusterOperator is happy [5]

The 3+ minute delay from 9:42:47 to 9:46:22 is not helpful, and we've
probably had delays like this since my old e02d148
(pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot
"is it alive now?", 2021-05-13, openshift#560), which landed in 4.6.

This commit introduces a "ClusterOperator bumped
versions[name=operator]" trigger to break out of the cool-off sleep.

There's plenty of room to be more precise here.  For example, you
could currently have a versions[name=operator] bump during the sync
loop that the CVO did notice, and that queued notification will break
from the sleep and trigger a possible useless reconciliation round
while we wait on some other resource.  You could drain the
notification queue before the sleep to avoid that, but you wouldn't
want to drain new-work notifications, and I haven't done the work
required to be able to split those apart.

I'm only looking at ClusterOperator at the moment, because of the many
types the CVO manages, ClusterOperator is the one we most frequently
wait on, as large cluster components take their time updating.  It's
possible but less likely that we'd want similar triggers for
additional types in the future (Deployment, etc.), if/when those types
develop more elaborate "is the in-cluster resource sufficient happy?"
checks.

The panic-backed type casting in clusterOperatorInterfaceVersionOrDie
also feel like a hack, but I wasn't able to find a cleaner way to get
at the structured information I want.  Improvements welcome :)

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2117033#c1
[2]: From Loki: E0808 09:42:00.022500       1 task.go:117] error running apply for clusteroperator "kube-apiserver" (107 of 806): Cluster operator kube-apiserver is updating versions
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1556564581915037696/artifacts/e2e-gcp-sdn-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/kube-apiserver versions:'
     Aug 08 09:33:48.603 I clusteroperator/kube-apiserver versions: raw-internal 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
     Aug 08 09:42:47.917 I clusteroperator/kube-apiserver versions: operator 4.11.0-rc.7 -> 4.12.0-0.ci-2022-08-07-192220
[4]: From Loki: I0808 09:46:22.998344       1 sync_worker.go:850] apply: 4.12.0-0.ci-2022-08-07-192220 on generation 3 in state Updating at attempt 5
[5]: From Loki: I0808 09:46:34.556374       1 sync_worker.go:973] Done syncing for clusteroperator "kube-apiserver" (107 of 806)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants