-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1927168: pkg/payload/task: Fail fast for UpdateError #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wking: This pull request references Bugzilla bug 1927168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
0529fc9 to
08a8f00
Compare
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step
08a8f00 to
6c7cd99
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
13 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/hold while I work out what is going wrong with the unit and e2e-operator jobs. |
|
/retest |
vrutkovs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm cancel
Not sure why unit/e2e tests are failing. Changing wait.Backoff.Steps to 2 in setupCVOTest fixes one unit tests, but I don't think its a good idea
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/bugzilla refresh The requirements for Bugzilla bugs have changed, recalculating validity. |
|
@openshift-merge-robot: This pull request references Bugzilla bug 1927168, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: This pull request references Bugzilla bug 1927168. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.
|
Ended up landing via #577. |
Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (#560) and the older cc9292a (#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry".
The install-time backoff is from fee2d06 (#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots.