pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

wking · 2021-02-27T22:43:57Z

Also simplify the undone type from a map to a slice of strings. The only consumers are length checks and nestedMessage formatting, so we don't need to preserve structured access to the current and expected values.

I've also dropped the old, early "Cluster operator %s is still updating" guard. If the ClusterOperator is degraded or unavailable, those are much more serious than being behind in version, so we want those guards coming first. The undone guard is now only the final fallback for available, non-degraded operators that aren't completely satisfactory (and I've pivoted to use ClusterOperatorUpdating for that reason).

I'm also now keeping any available condition data, and including that status, reason, and message in the nested message, because it seemed strange to do that for degraded, but not for the more serious unavailable.

And finally, I've dropped the old mode switch, and instead distributed the mode check to each relevant guard. The "we're completely satisfied" case is now where you end up if none of the "we aren't happy" guards trip. This avoids the need to have an "if we got this far, it must have been because we have undone versions" assumption for the ClusterOperatorUpdating block.

jottofar · 2021-03-02T21:35:31Z

This looks okay to me and unit tests should catch any detail I missed. I know you need to change 2 test cases from using ClusterOperatorNotAvailable to ClusterOperatorUpdating. I will circle back with the lgtm after the tests pass.

wking · 2021-03-22T23:52:17Z

Rebased around #514 and added a fix for the cluster_operator_up metric's version label with 92b76cb -> 5de0be4.

pkg/cvo/internal/operatorstatus.go

…handling Also simplify the 'undone' type from a map to a slice of strings. The only consumers are length checks and nestedMessage formatting, so we don't need to preserve structured access to the current and expected values. I've also dropped the old, early "Cluster operator %s is still updating" guard. If the ClusterOperator is degraded or unavailable, those are much more serious than being behind in version, so we want those guards coming first. The undone guard is now only the final fallback for available, non-degraded operators that aren't completely satisfactory (and I've pivoted to use ClusterOperatorUpdating for that reason). I'm also now keeping any available condition data, and including that status, reason, and message in the nested message, because it seemed strange to do that for degraded, but not for the more serious unavailable. And finally, I've dropped the old mode switch, and instead distributed the mode check to each relevant guard. The "we're completely satisfied" case is now where you end up if none of the "we aren't happy" guards trip. This avoids the need to have an "if we got this far, it must have been because we have undone versions" assumption for the ClusterOperatorUpdating block.

Fixing a TODO from 02a389b (sync: Temporarily stop checking version in ClusterOperator.Status, 2019-01-23, openshift#98). The semantics of the 'operator' entry have been clear since at least openshift/api@40c55f8085 (config/v1/types_cluster_operator: ClusterOperatorStatus doc wordsmithing, 2019-10-28, openshift/api#501).

To make it easier to debug when the difference is in a deeper pointer, e.g. [1]: --- FAIL: Test_waitForOperatorStatusToBeDone (0.02s) --- FAIL: Test_waitForOperatorStatusToBeDone/cluster_operator_reporting_available=true_and_degraded=false,_but_no_versions (0.00s) operatorstatus_test.go:533: Incorrect value returned - expected: &payload.UpdateError{Nested:(*errors.errorString)(0xc00060ae00), UpdateEffect:"None", Reason:"ClusterOperatorUpdating", Message:"Cluster operator test-co is updating versions", Name:"test-co", Task:(*payload.Task)(nil)} returned: &payload.UpdateError{Nested:(*errors.errorString)(0xc000238740), UpdateEffect:"None", Reason:"ClusterOperatorUpdating", Message:"Cluster operator test-co is updating versions", Name:"test-co", Task:(*payload.Task)(nil)} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/527/pull-ci-openshift-cluster-version-operator-master-unit/1374504389896572928#1:build-log.txt%3A41

jottofar · 2021-04-08T14:23:53Z

/lgtm

openshift-ci-robot · 2021-04-08T14:25:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2021-04-09T02:41:23Z

/retest

smarterclayton · 2021-04-20T23:47:24Z

I am not comfortable with parts of this change - specifically allowing degraded during initialization and allowing us to exit install in a "ok" state with operators degraded (worse, we report available, then go failed after, when we should install, immediately report "Failed" and "Available" at the same time, and then cause installer to exit as "install failed").

I see no justification in the comment for such a change (the commit message is about code changes, not about the intent change). If you can link to such discussion I'd appreciate it, but this doesn't pass my smell test which is that degraded during install may be suppressible, but that reporting available (for automation) and then immediately going degraded punts problems downstream that should be caught right up front. An install completes if it is degraded, but it should not pretend that it is ok.

EDIT: discussion evolved in slack after this comment, the bug https://bugzilla.redhat.com/show_bug.cgi?id=1951835 captures that evolution re: impact of inertia, significantly delayed Failing condition post install, and how initilaizing and reconciling mode transition is important to capture correctly.

openshift-ci-robot requested review from LalatenduMohanty and vrutkovs February 27, 2021 22:44

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2021

wking force-pushed the remove-vestigial-progressing branch 5 times, most recently from 911e462 to 92b76cb Compare February 28, 2021 21:09

wking force-pushed the remove-vestigial-progressing branch from 92b76cb to 5de0be4 Compare March 22, 2021 23:51

wking force-pushed the remove-vestigial-progressing branch 9 times, most recently from 7f7ebe0 to ce44eb9 Compare March 29, 2021 23:48

wking force-pushed the remove-vestigial-progressing branch 2 times, most recently from cc7346b to 4abd57c Compare March 30, 2021 03:25

jottofar reviewed Apr 5, 2021

View reviewed changes

pkg/cvo/internal/operatorstatus.go Show resolved Hide resolved

wking added 3 commits April 8, 2021 07:19

wking force-pushed the remove-vestigial-progressing branch from 4abd57c to 98794a6 Compare April 8, 2021 14:20

openshift-ci-robot assigned jottofar Apr 8, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2021

openshift-merge-robot merged commit 23bc93b into openshift:master Apr 9, 2021

wking deleted the remove-vestigial-progressing branch April 9, 2021 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

Uh oh!

wking commented Feb 27, 2021

Uh oh!

jottofar commented Mar 2, 2021

Uh oh!

wking commented Mar 22, 2021

Uh oh!

Uh oh!

jottofar commented Apr 8, 2021

Uh oh!

openshift-ci-robot commented Apr 8, 2021

Uh oh!

wking commented Apr 9, 2021

Uh oh!

smarterclayton commented Apr 20, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

pkg/cvo/internal/operatorstatus: Drop deprecated failing/progressing handling #527

Uh oh!

Conversation

wking commented Feb 27, 2021

Uh oh!

jottofar commented Mar 2, 2021

Uh oh!

wking commented Mar 22, 2021

Uh oh!

Uh oh!

jottofar commented Apr 8, 2021

Uh oh!

openshift-ci-robot commented Apr 8, 2021

Uh oh!

wking commented Apr 9, 2021

Uh oh!

smarterclayton commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

smarterclayton commented Apr 20, 2021 •

edited

Loading