Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 24, 2021

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (#560) and the older cc9292a (#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry".

The install-time backoff is from fee2d06 (#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 24, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 24, 2021

@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

Details

In response to this:

Bug 1927168: pkg/payload/task: Fail fast for UpdateError

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from jottofar and vrutkovs May 24, 2021 23:44
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2021
@wking wking force-pushed the fast-fail-UpdateError branch 4 times, most recently from 0529fc9 to 08a8f00 Compare May 25, 2021 18:50
Things like API-server connection errors and patch conflicts deserve
some retries before we bubble them up into ClusterVersion conditions.
But when we are able to retrieve in-cluster objects and determine that
they are not happy, we should exit more quickly so we can complain
about the resource state and start in on the next sync cycle.  For
example, see the recent e02d148 (pkg/cvo/internal/operatorstatus:
Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560)
and the older cc9292a (lib/resourcebuilder: Replace wait-for with
single-shot "is it alive now?", 2020-07-07, openshift#400).  This commit uses
the presence of an UpdateError as a marker for "fail fast; no need to
retry".

The install-time backoff is from fee2d06 (sync: Completely
parallelize the initial payload, 2019-03-11, openshift#136).  I'm not sure if
it really wants the same cap as reconcile and update modes, but I've
left them the same for now, and future commits to pivot the backoff
settings can focus on motivating those pivots.

I'd tried dropping:

  backoff := st.Backoff

and passing st.Backoff directly to ExponentialBackoffWithContext, but
it turns out that Step() [1]:

  ... mutates the provided Backoff to update its Steps and Duration.

Luckily, Backoff has no pointer properties, so storing as a local
variable is sufficient to give us a fresh copy for the local
mutations.

[1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
@wking wking force-pushed the fast-fail-UpdateError branch from 08a8f00 to 6c7cd99 Compare May 25, 2021 22:38
@jottofar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

13 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented May 27, 2021

/hold

while I work out what is going wrong with the unit and e2e-operator jobs.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2021
@vrutkovs
Copy link

/retest

Copy link

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm cancel

Not sure why unit/e2e tests are failing. Changing wait.Backoff.Steps to 2 in setupCVOTest fixes one unit tests, but I don't think its a good idea

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 23, 2021

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/unit 6c7cd99 link /test unit
ci/prow/e2e-agnostic-operator 6c7cd99 link /test e2e-agnostic-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

@openshift-ci openshift-ci bot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Sep 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 6, 2021

@openshift-merge-robot: This pull request references Bugzilla bug 1927168, which is invalid:

  • expected the bug to target the "4.10.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed, recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 26, 2021
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 29, 2021
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 4, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2022

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Feb 3, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 3, 2022

@wking: This pull request references Bugzilla bug 1927168. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1927168: pkg/payload/task: Fail fast for UpdateError

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 2, 2022
…ssing message

For example, in a recent test cluster:

  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
  2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
  2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
  2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing
is much less direct than "is not available" we used for Failing.
6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573)
removed the last important abuse of ClusterOperatorNotAvailable (for
failures to client.Get the target ClusterOperator).  In this commit
I'm moving the remaining abuser over to a new
ClusterOperatorNoVersions.  So the only remaining
ClusterOperatorNotAvailable really means "not Available=True", and we
can say that more directly.
@wking
Copy link
Member Author

wking commented Aug 5, 2022

Ended up landing via #577.

@wking wking deleted the fast-fail-UpdateError branch August 5, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants