make operator status more kube-like #31

deads2k · 2018-10-05T18:47:26Z

@crawford @smarterclayton @abhinavdahiya @derekwaynecarr @openshift/api-review

Let's figure out the shape of the status reporting API. This is WIP where we can work out fine details. The story so far.....

Kube and OpenShift workload and core resources have a dedicate slice of conditions under .status.conditions that individually track True, False, Unknown states. This avoids multiplying states in a state machine and various ecosystem features have built around them like kubectl wait and the CRD spec/status permissions split. Along with the concept of status being a thing that is completely re-derivable from cluster state.

Since the operatorstatus described here is closely related to the CVO (only consumer), this pull moves the OperatorStatus type into clusterversion.openshift.io. Also, the spec/status split for customresources provides a kube-like experience, so I've renamed it to Operator with filled in .status and an empty .spec.

One key question I had is why the OperatorStatus is namespace scoped. It's fairly easy to keep it there, but in looking at the operators we manage by the CVO, it seems that they are largely cluster-scoped, not largely namespace scoped.

kube-apiserver
kube-controller-manager
kube-scheduler
openshift-apiserver
openshift-controller-manager
dns
sdn
image registry
templates

Given that, naming at a cluster scope seems to make more sense since they do not scale with the number of namespaces available.

abhinavdahiya · 2018-10-05T19:03:07Z

I think we have have previous discussions that there should not be a Operator type that has spec that operator might have to read.

cc @crawford

OperatorStatus is designed to be write only for operators and read only for CVO and users. Also operatorstatus is separate type because we don;t want operators to vendor any CVO releasted code only the operator status related api and clients.

deads2k · 2018-10-05T21:25:34Z

I think we have have previous discussions that there should not be a Operator type that has spec that operator might have to read.

I'm not pitching that we fill that in at the moment. An empty spec (as I have it here) will do fine.

OperatorStatus is designed to be write only for operators and read only for CVO and users. Also operatorstatus is separate type because we don;t want operators to vendor any CVO releasted code only the operator status related api and clients.

It seems fairly likely that this ends up in openshift/api, but with a dynamic client, nothing requires them to.

deads2k · 2018-10-09T12:51:56Z

@crawford @smarterclayton comments? Milestone 2 requires operators to set this, so we should try to get the API consistent early.

crawford

These commit messages provide zero information for people to review in the future. The PR comments have a decent amount of background though. Can you distill that into commit descriptions?

crawford · 2018-10-09T19:55:25Z

pkg/apis/clusterversion.openshift.io/v1/types.go

+	metav1.ObjectMeta `json:"metadata"`
+
+	// Spec hold the intent of how this operator should behave.
+	Spec OperatorSpec `json:"spec""`


We cannot combine the Status and the Spec into the same object. This was the original design and we pivoted away from it. I don't think there is any need for this parent type.

We cannot combine the Status and the Spec into the same object. This was the original design and we pivoted away from it. I don't think there is any need for this parent type.

Responded in the main thread with more detail below. The tl;dr is that spec/status is so ingrained in kubernetes, the burden is on this API to provide a reason why it cannot be structured like a "normal" resource. Not the other way around.

crawford · 2018-10-09T19:58:48Z

pkg/apis/clusterversion.openshift.io/v1/types.go

+type OperatorSpec struct {
+}
+
+type OperatorStatus struct {


Why is this being moved back into this API (and leaving the other copy in the other API)?

Why is this being moved back into this API (and leaving the other copy in the other API)?

The copy only exists while we talk about this API. Before it moves out of WIP, we can remove the other.

crawford · 2018-10-09T20:02:24Z

Since the operatorstatus described here is closely related to the CVO (only consumer), this pull moves the OperatorStatus type into clusterversion.openshift.io.

The OperatorStatus is designed to be the common API between all of the operators. Right now, CVO is the only one who cares, but in the future, operators almost certainly will have dependencies between one another. I'd like to keep this in a separate API.

deads2k · 2018-10-09T20:31:43Z

We cannot combine the Status and the Spec into the same object. This was the original design and we pivoted away from it. I don't think there is any need for this parent type.

@crawford status is a kubernetes API concept. Resources have spec (intent, empty here) and status (actual state that is fully recreate-able based on inspection of cluster or other state). The concept is so common in kube that you see support for it across multiple components like kubectl and CustomResourceDefinitions. Having this API participate in that pattern makes it more consistent with the rest of the stack.

I think the burden is on this type to explain why an empty spec cannot work.

deads2k · 2018-10-09T20:38:09Z

Since the operatorstatus described here is closely related to the CVO (only consumer), this pull moves the OperatorStatus type into clusterversion.openshift.io.

The OperatorStatus is designed to be the common API between all of the operators. Right now, CVO is the only one who cares, but in the future, operators almost certainly will have dependencies between one another. I'd like to keep this in a separate API.

I'm not aware of plans to try to have a secondary status object be a means to express and discover dependency relationships. It seems like we'd want the dependencies to be directly "owned" by producer with consumers being responsible for conforming.

You can see an example of using a common status object not working is the relationship between etcd and all apiservers. Etcd takes care of creating and rotating client cert/keys pairs which are secrets. They must be consumed by the apiserver operators to keep their etcd connections alive. If you tried to include that here, you would create a confidential resource that you cannot grant read access to. If you do not include them here, you have to create the alternative to depending on this resource. I think we have to go with the latter.

crawford · 2018-10-09T22:46:41Z

I'm not aware of plans to try to have a secondary status object be a means to express and discover dependency relationships.

I wasn't clear. The operator status is designed to allow two operators who have a dependency on one another to communicate; not to declare the dependency. CVO currently enforces ordering in order to work around dependencies, but the eventual goal is to allow operators to coordinate between one another by watching relevant operator statuses.

deads2k · 2018-10-10T14:32:45Z

I wasn't clear. The operator status is designed to allow two operators who have a dependency on one another to communicate; not to declare the dependency. CVO currently enforces ordering in order to work around dependencies, but the eventual goal is to allow operators to coordinate between one another by watching relevant operator statuses.

If a consumer knows that it has a dependency on a producer, there is no reason to depend on an intermediate object like this. It could simply go directly to the thing it depends upon.

smarterclayton · 2018-10-10T18:45:32Z

operators almost certainly will have dependencies between one another. I'd like to keep this in a separate API.

If a consumer knows that it has a dependency on a producer, there is no reason to depend on an intermediate object like this

So these two statements are pretty close in intent. Let's work an example:

kube-scheduler operator needs to know when kube-apiserver is updated to v1.14 before it allows scheduler to be updated to v1.14.
kube-scheduler operator must know two things: that the apiserver is at 1.14, and that the update is "complete".

David, what object were you expecting kube-scheduler operator to look at in a dependency case?

Alex, I assume you are implying that kube-scheduler operator would look at the kube apiserver operator object in the status field and look for a generic "version output"? So status.versions = [{"name":"kube-apiserver", "version": "1.14.0"}]?

deads2k · 2018-10-10T18:54:02Z

David, what object were you expecting kube-scheduler operator to look at in a dependency case?

I wired up the kube and openshift control planes so that the kube-apiserver-operator publishes version information about itself and the openshift-apiserver-operator reads version information from that location. It avoids having an intermediate API and the same pattern can be used to collect secret information via secrets.

crawford · 2018-10-10T18:56:33Z

Alex, I assume you are implying that kube-scheduler operator would look at the kube apiserver operator object in the status field and look for a generic "version output"?

Yep. I'm trying to reduce coupling between the operators. If you know how to read the generic status of one operator, you can read it for all of them. The escape hatch for this scheme is the Extension member of the type.

deads2k · 2018-10-10T19:16:50Z

Yep. I'm trying to reduce coupling between the operators. If you know how to read the generic status of one operator, you can read it for all of them. The escape hatch for this scheme is the Extension member of the type.

The thing is, if I'm trying to discover that version, I'm already coupled. This doesn't reduce the coupling, it just makes another object that both operators have to go through.

Ultimately, I'll move the object to a different group to fix the internal schema of said object. But I think the idea of a "universal operator information sharing space" is going to be poorly used internally and never used externally.

smarterclayton · 2018-10-10T19:34:51Z

So certain collaboration between operators is going to be confusing if we try to make it generic here - the status of certain config flags or whatever belongs in a different object (which it all seems we're on board with).

Re versions - it might be good for us to talk about that in a design doc, but a few thoughts below:

will agree with Alex here that the whole point of the CVO is to take a top level version and ensure specific, known versions are showing up in other places. It's reasonable that a human will want to look at operator status and at a glance know whether operator X is at the "right" level (whatever that is). I don't know that means that the operators have to report that, but who better than the kube operators to know what version of kube exists and report that back to an admin?

The counter argument to that is that most or almost all of the individual versions really need to parrot the top level version (i.e. Openshift 4.2.1) while the versions we're talking about here might be kube (v1.13.1) or prometheus (2.4.0) or etcd (3.2.29).

I haven't seen a design for this yet, but I am worried about components outside of the master team family having to look at an object geared for the kube-apiserver. Within master team - sure, since technically the master team has a "cluster-kube-operator" that just happens to be split into 5 processes - but outside the master team I would have expected an operator to ask "is the control plane at v1.13", not "hey kube-apiserver, what version do you think you're at".

smarterclayton · 2018-10-10T19:39:57Z

Re operator spec:

I think the operator spec here isn't the config of the operator - it's the control point that the SLO respects for specific, limited generic operator behavior. I.e. it's a generic spec that all operators are expected to follow for things like paused (which is managementState here). It would not include things like config ever.

Re spec and status:

Given the kube conventions, I argued for this which is the primary goal of the object is to be a generic interface for operator level actions, which includes generic status (human focused) and possibly an abstract version thing (which obviously we have disagreement on). The status is single writer from the operator. The spec would only be things that were necessary for consistent operational behavior across all SLOs, like pausing the operator.

Re: status being namespace vs cluster scoped

Just from a purely mechanical level, I'd rather run oc get operators and see a list than oc get operators --all-namespaces. I think owner refs could handle out of namespace cleanup, although I agree namespace cleanup is better. I do see the possibility that we want oc adm status that combines fetching operators (in which case oc get operators is less of a factor) with also fetching things that might show up on the dashboard including CVO version, firing alerts, etc.

deads2k · 2018-10-11T21:27:03Z

I'll open a different pull for the status structure change the name change. I'll rebase this on that so we can merge the easier bits and revisit some of the harder ones.

deads2k · 2018-10-15T13:55:49Z

@crawford updated per our conversation on Friday.

commit message updated to be more specific
OperatorStatus name is untouched, but a status substruct with separate conditions has been added.
CRD updated to have a /status subresource

After this merges, I plan to open more pulls to push further (name of resource, possibly scoping, possibly group), but I think we agree this far and it will help make integration from other teams easier and reduce overall change.

smarterclayton · 2018-10-15T15:06:25Z

pkg/apis/operatorstatus.openshift.io/v1/types.go

+	// OperatorAvailable indicates that the binary supported by the operator is functional and available in the cluster.
+	OperatorAvailable OperatorStatusConditionType = "Available"
+
+	// OperatorProgressing indicates that the operator is actively binary supported by the operator is functional and available in the cluster.


Looks like the wrong godoc.

Is progressing supposed to be "I'm reconciling towards the desired state"? Or reflect the operator status itself?

I think it's acceptable to have different condition types when the definition is not the same. For instance, Deployment uses Progressing False to represent "failing" because deployment has no perm fail state.

The removal of "Done" means that a consumer doesn't have an unambiguous signal about when all possible changes have been applied. I think we should have a positive, affirmative condition on "up to date".

The removal of "Done" means that a consumer doesn't have an unambiguous signal about when all possible changes have been applied. I think we should have a positive, affirmative condition on "up to date".

as opposed to that being done := available && !progressing && !failed each of which have clear individual signals for different conditions?

pkg/apis/operatorstatus.openshift.io/v1/types.go

abhinavdahiya · 2018-10-15T15:11:07Z

pkg/apis/operatorstatus.openshift.io/v1/types.go

-	// OperatorStatusConditionTypeDegraded indicates that the operator has
-	// encountered an error that is preventing it from working properly.
-	OperatorStatusConditionTypeDegraded OperatorStatusConditionType = "Degraded"
+	// OperatorAvailable indicates that the binary supported by the operator is functional and available in the cluster.


Can explain binary supported by the operator ?

deads2k · 2018-10-15T15:40:48Z

Initial comments addressed. Typo, type name, doc.

abhinavdahiya · 2018-10-15T18:37:16Z

@deads2k #31 (comment) these functions don't seem to belong to pkg/cvo's api.

Where would be rather recommend we keep them. Some Kubernetes api packages do have versioned helper functions in the apis.

deads2k · 2018-10-15T18:55:09Z

@deads2k #31 (comment) these functions don't seem to belong to pkg/cvo's api.

Where would be rather recommend we keep them. Some Kubernetes api packages do have versioned helper functions in the apis.

I think that in the next month we will move this to openshift/api. Once we do that, then they will have a home inside of library-go near https://github.com/openshift/library-go/tree/master/pkg/operator/v1alpha1helpers . We can simulate that structure now if you like. I didn't think that it mattered either way.

abhinavdahiya · 2018-10-15T19:02:37Z

I am fine if we move these helper functions anywhere out of pkg/cvo; lib or anything.

The spec/status split is a kuberentes standard that is respected in a generic way by kubectl and CRDs. Making OperatorStatus conform will make it match expectations.

abhinavdahiya · 2018-10-15T19:19:27Z

/approve

abhinavdahiya · 2018-10-15T19:20:00Z

@crawford / @smarterclayton can you take another look and /lgtm

crawford · 2018-10-15T20:31:42Z

/lgtm

openshift-ci-robot · 2018-10-15T20:31:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, crawford, deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cvo/internal: catch up the operator status handler to PR #31

We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the critical ClusterOperatorDegraded waking admins up when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2018

openshift-ci-robot requested review from abhinavdahiya and wking October 5, 2018 18:47

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 5, 2018

crawford suggested changes Oct 9, 2018

View reviewed changes

deads2k force-pushed the status branch from b2fffa2 to 7453757 Compare October 15, 2018 13:53

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 15, 2018

deads2k changed the title ~~[WIP] make operator status more kube-like~~ make operator status more kube-like Oct 15, 2018

deads2k removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2018

deads2k force-pushed the status branch from 7453757 to 4148fb7 Compare October 15, 2018 14:33

smarterclayton reviewed Oct 15, 2018

View reviewed changes

abhinavdahiya reviewed Oct 15, 2018

View reviewed changes

pkg/apis/operatorstatus.openshift.io/v1/types.go Show resolved Hide resolved

abhinavdahiya reviewed Oct 15, 2018

View reviewed changes

deads2k force-pushed the status branch from b410d7c to 5b1cae6 Compare October 15, 2018 18:08

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 15, 2018

deads2k force-pushed the status branch from 5b1cae6 to 1ff778a Compare October 15, 2018 19:06

deads2k added 2 commits October 15, 2018 15:17

api: make status substruct on operatorstatus

545c342

The spec/status split is a kuberentes standard that is respected in a generic way by kubectl and CRDs. Making OperatorStatus conform will make it match expectations.

generated

fdd6e88

deads2k force-pushed the status branch from 1ff778a to fdd6e88 Compare October 15, 2018 19:18

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2018

deads2k mentioned this pull request Oct 15, 2018

report status to the CVO openshift/cluster-openshift-controller-manager-operator#20

Merged

crawford approved these changes Oct 15, 2018

View reviewed changes

openshift-ci-robot assigned crawford Oct 15, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2018

openshift-merge-robot merged commit fe673cb into openshift:master Oct 15, 2018

deads2k mentioned this pull request Oct 16, 2018

Publish operator status openshift/cluster-ingress-operator#47

Merged

abhinavdahiya added a commit to abhinavdahiya/cluster-version-operator that referenced this pull request Oct 16, 2018

cvo/internal: catch up the operator status handler to PR openshift#31

5aeb623

abhinavdahiya added a commit to abhinavdahiya/cluster-version-operator that referenced this pull request Oct 18, 2018

cvo/internal: catch up the operator status handler to PR openshift#31

8beeeca

openshift-merge-robot added a commit that referenced this pull request Oct 19, 2018

Merge pull request #40 from abhinavdahiya/fix-31

8d52944

cvo/internal: catch up the operator status handler to PR #31

wking mentioned this pull request Nov 24, 2020

OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482

Closed

make operator status more kube-like #31

make operator status more kube-like #31

Uh oh!

Conversation

deads2k commented Oct 5, 2018

Uh oh!

abhinavdahiya commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Oct 5, 2018

Uh oh!

deads2k commented Oct 9, 2018

Uh oh!

crawford left a comment

Choose a reason for hiding this comment

Uh oh!

crawford Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

deads2k Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

crawford Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

deads2k Oct 9, 2018

Choose a reason for hiding this comment

Uh oh!

crawford commented Oct 9, 2018

Uh oh!

deads2k commented Oct 9, 2018

Uh oh!

deads2k commented Oct 9, 2018

Uh oh!

crawford commented Oct 9, 2018

Uh oh!

deads2k commented Oct 10, 2018

Uh oh!

smarterclayton commented Oct 10, 2018

Uh oh!

deads2k commented Oct 10, 2018

Uh oh!

crawford commented Oct 10, 2018

Uh oh!

deads2k commented Oct 10, 2018

Uh oh!

smarterclayton commented Oct 10, 2018

Uh oh!

smarterclayton commented Oct 10, 2018

Uh oh!

deads2k commented Oct 11, 2018

Uh oh!

deads2k commented Oct 15, 2018

Uh oh!

smarterclayton Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

deads2k Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abhinavdahiya Oct 15, 2018

Choose a reason for hiding this comment

Uh oh!

deads2k commented Oct 15, 2018

Uh oh!

abhinavdahiya commented Oct 15, 2018

Uh oh!

deads2k commented Oct 15, 2018

Uh oh!

abhinavdahiya commented Oct 15, 2018

Uh oh!

abhinavdahiya commented Oct 15, 2018

Uh oh!

abhinavdahiya commented Oct 15, 2018

Uh oh!

crawford commented Oct 15, 2018

Uh oh!

openshift-ci-robot commented Oct 15, 2018

Uh oh!

abhinavdahiya commented Oct 5, 2018 •

edited

Loading