Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Feb 27, 2021

Also simplify the undone type from a map to a slice of strings. The only consumers are length checks and nestedMessage formatting, so we don't need to preserve structured access to the current and expected values.

I've also dropped the old, early "Cluster operator %s is still updating" guard. If the ClusterOperator is degraded or unavailable, those are much more serious than being behind in version, so we want those guards coming first. The undone guard is now only the final fallback for available, non-degraded operators that aren't completely satisfactory (and I've pivoted to use ClusterOperatorUpdating for that reason).

I'm also now keeping any available condition data, and including that status, reason, and message in the nested message, because it seemed strange to do that for degraded, but not for the more serious unavailable.

And finally, I've dropped the old mode switch, and instead distributed the mode check to each relevant guard. The "we're completely satisfied" case is now where you end up if none of the "we aren't happy" guards trip. This avoids the need to have an "if we got this far, it must have been because we have undone versions" assumption for the ClusterOperatorUpdating block.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2021
@wking wking force-pushed the remove-vestigial-progressing branch 5 times, most recently from 911e462 to 92b76cb Compare February 28, 2021 21:09
@jottofar
Copy link
Contributor

jottofar commented Mar 2, 2021

This looks okay to me and unit tests should catch any detail I missed. I know you need to change 2 test cases from using ClusterOperatorNotAvailable to ClusterOperatorUpdating. I will circle back with the lgtm after the tests pass.

@wking wking force-pushed the remove-vestigial-progressing branch from 92b76cb to 5de0be4 Compare March 22, 2021 23:51
@wking
Copy link
Member Author

wking commented Mar 22, 2021

Rebased around #514 and added a fix for the cluster_operator_up metric's version label with 92b76cb -> 5de0be4.

@wking wking force-pushed the remove-vestigial-progressing branch 9 times, most recently from 7f7ebe0 to ce44eb9 Compare March 29, 2021 23:48
@wking wking force-pushed the remove-vestigial-progressing branch 2 times, most recently from cc7346b to 4abd57c Compare March 30, 2021 03:25
wking added 3 commits April 8, 2021 07:19
…handling

Also simplify the 'undone' type from a map to a slice of strings.  The
only consumers are length checks and nestedMessage formatting, so we
don't need to preserve structured access to the current and expected
values.

I've also dropped the old, early "Cluster operator %s is still
updating" guard.  If the ClusterOperator is degraded or unavailable,
those are much more serious than being behind in version, so we want
those guards coming first.  The undone guard is now only the final
fallback for available, non-degraded operators that aren't completely
satisfactory (and I've pivoted to use ClusterOperatorUpdating for that
reason).

I'm also now keeping any available condition data, and including that
status, reason, and message in the nested message, because it seemed
strange to do that for degraded, but not for the more serious
unavailable.

And finally, I've dropped the old mode switch, and instead distributed
the mode check to each relevant guard.  The "we're completely
satisfied" case is now where you end up if none of the "we aren't
happy" guards trip.  This avoids the need to have an "if we got this
far, it must have been because we have undone versions" assumption for
the ClusterOperatorUpdating block.
Fixing a TODO from 02a389b (sync: Temporarily stop checking version
in ClusterOperator.Status, 2019-01-23, openshift#98).  The semantics of the
'operator' entry have been clear since at least
openshift/api@40c55f8085 (config/v1/types_cluster_operator:
ClusterOperatorStatus doc wordsmithing, 2019-10-28,
openshift/api#501).
To make it easier to debug when the difference is in a deeper pointer,
e.g. [1]:

  --- FAIL: Test_waitForOperatorStatusToBeDone (0.02s)
      --- FAIL: Test_waitForOperatorStatusToBeDone/cluster_operator_reporting_available=true_and_degraded=false,_but_no_versions (0.00s)
          operatorstatus_test.go:533: Incorrect value returned -
              expected: &payload.UpdateError{Nested:(*errors.errorString)(0xc00060ae00), UpdateEffect:"None", Reason:"ClusterOperatorUpdating", Message:"Cluster operator test-co is updating versions", Name:"test-co", Task:(*payload.Task)(nil)}
              returned: &payload.UpdateError{Nested:(*errors.errorString)(0xc000238740), UpdateEffect:"None", Reason:"ClusterOperatorUpdating", Message:"Cluster operator test-co is updating versions", Name:"test-co", Task:(*payload.Task)(nil)}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/527/pull-ci-openshift-cluster-version-operator-master-unit/1374504389896572928#1:build-log.txt%3A41
@wking wking force-pushed the remove-vestigial-progressing branch from 4abd57c to 98794a6 Compare April 8, 2021 14:20
@jottofar
Copy link
Contributor

jottofar commented Apr 8, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Apr 9, 2021

/retest

@openshift-merge-robot openshift-merge-robot merged commit 23bc93b into openshift:master Apr 9, 2021
@wking wking deleted the remove-vestigial-progressing branch April 9, 2021 16:02
@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 20, 2021

I am not comfortable with parts of this change - specifically allowing degraded during initialization and allowing us to exit install in a "ok" state with operators degraded (worse, we report available, then go failed after, when we should install, immediately report "Failed" and "Available" at the same time, and then cause installer to exit as "install failed").

I see no justification in the comment for such a change (the commit message is about code changes, not about the intent change). If you can link to such discussion I'd appreciate it, but this doesn't pass my smell test which is that degraded during install may be suppressible, but that reporting available (for automation) and then immediately going degraded punts problems downstream that should be caught right up front. An install completes if it is degraded, but it should not pretend that it is ok.

EDIT: discussion evolved in slack after this comment, the bug https://bugzilla.redhat.com/show_bug.cgi?id=1951835 captures that evolution re: impact of inertia, significantly delayed Failing condition post install, and how initilaizing and reconciling mode transition is important to capture correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants