-
Notifications
You must be signed in to change notification settings - Fork 213
api: Update to objects from openshift/api #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: Update to objects from openshift/api #55
Conversation
2b64778 to
0eaaaa0
Compare
|
openshift/library-go#97 might come in handy. |
|
I still need to add unit test for invalid for this. David and I talked through the conditions and how they're being used here vs other operators. We agreed that we were roughly in the same spot: Available means either "I've reached the desired state you requested" or "my operands are healthy". In CVO right now we set available false when we start applying a new version, and set it to true after we've synced a payload successfully at least once. Arguably we don't have to reset available to false when we start applying a new version, because "Progressing = true" means the same thing. That's something I want to think about more, but We did talk about how ClusterOperator could be taken into account as a prerequisite for upgrading to a new version. If not all cluster operators are
We can still init status.current if it's unset, but we shouldn't update current until we've completed a sync (right now the code is updating current to the CVO at steps 3-4 when it updates status). Also need to think about what type of status report to provide at 3 - are we Progressing=false with an appropriate message? |
0eaaaa0 to
2039018
Compare
2039018 to
e791c6c
Compare
|
Updated so that this is relatively complete:
|
e791c6c to
4d3dc5a
Compare
|
(to be clear, I didn't make any of the changes discussed in my wall of text comment, just added the Invalid condition and cleaned up the sync loop) |
| ) | ||
|
|
||
| func EnsureClusterOperatorStatus(modified *bool, existing *osv1.ClusterOperator, required osv1.ClusterOperator) { | ||
| func EnsureClusterOperatorStatus(modified *bool, existing *configv1.ClusterOperator, required configv1.ClusterOperator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should no longer exist in CVO. openshift/library-go#97 seems like better place for these functions. For now its okay as-is
4d3dc5a to
9ff9b5c
Compare
|
|
||
| Updates []cvv1.Update | ||
| Condition osv1.ClusterOperatorStatusCondition | ||
| Updates []configv1.Update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this configv1.Update is little ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I realized last night too that when we start bringing metadata back from cincinatti (messages and URLs to display pages) we're going to need to split the update. I was going to suggest ClusterVersionUpdate and AvailableClusterVersionUpdate as the two structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ these sound better.
maybe an issue to make sure we don't forget? |
|
|
||
| // for fields that have meaning that are incomplete, clear them | ||
| // prevents us from loading clearly malformed payloads | ||
| obj = validation.ClearInvalidFields(obj, errs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mutating lister cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we deep copy before making a change in that method.
|
This looks good. 👍 will |
The CO is cluster scoped so that users can run `oc get clusteroperators` and to make it clear it's cluster scoped.
ClusterVersion/ClusterOperator moved to the openshift/api repo, with minor changes that are vendored back here. This is the reaction PR. The major differences in what went upstream: * ClusterOperator is now cluster scoped * ClusterVersion deserialization doesn't check UID or URL deserialization, which needs to be handled in by the operator anyway (subsequent commit) * ClusterVersion URL is not a pointer, suggestion was to create a new field which controls update behavior such as `updateMode: <None|Retrieve|Auto>` where Retrieve might be the default.
The CVO needs to perform some minimum sanity checking on incoming input and communicate to users when it is blocked. The previous mechanism of rejecting the CV on deserialization is only partial validation and doesn't cover all scenarios. In the future we may want to have the CVO register as an admission webhook for its own resource. Add validation immediately after the CVO loads the object from the cache, verifying that the object that we see has no errors. If it does, write an Invalid condition to the status and reset the Progressing condition, then clear the invalid fields so that the sync loop doesn't act on them. Simplify initial status setting by having the initial status check normalize the object status for a set of conditions, including available updates. This reduces the complexity of the CVO main loop slightly.
9ff9b5c to
ab4d84a
Compare
|
I updated tests - also made the sync_test use a customizable backoff so we could have the test complete instantly (that's in the api: reaction commit because the package changed). |
Thanks for that, i missed that in #57 😇 /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrices like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01,
ClusterVersion deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
We want to be able to distinguish these conditions, which can be due to internal misconfiguration or external Cincinnati/network errors [1]. The former can be fixed by cluster admins. The latter could go either way. I dropped the len(upstream) guard from checkForUpdate because there's already an earlier guard in syncAvailableUpdates. The guard I'm removing is from db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). The covering guard is from the later 286641d (api: Update to objects from openshift/api, 2018-11-15, openshift#55). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1685338
We want to be able to distinguish these conditions, which can be due to internal misconfiguration or external Cincinnati/network errors [1]. The former can be fixed by cluster admins. The latter could go either way. I dropped the len(upstream) guard from checkForUpdate because there's already an earlier guard in syncAvailableUpdates. The guard I'm removing is from db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). The covering guard is from the later 286641d (api: Update to objects from openshift/api, 2018-11-15, openshift#55). Personally, I'd rather have GetUpdates return an *Error, so we could dispense with the cast and unused Unknown-reason fallback. But Abhinav wanted the explicit cast in return for a more familiar error type [2]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1685338 [2]: openshift#268 (comment)
We want to be able to distinguish these conditions, which can be due to internal misconfiguration or external Cincinnati/network errors [1]. The former can be fixed by cluster admins. The latter could go either way. I dropped the len(upstream) guard from checkForUpdate because there's already an earlier guard in syncAvailableUpdates. The guard I'm removing is from db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). The covering guard is from the later 286641d (api: Update to objects from openshift/api, 2018-11-15, openshift#55). Personally, I'd rather have GetUpdates return an *Error, so we could dispense with the cast and unused Unknown-reason fallback. But Abhinav wanted the explicit cast in return for a more familiar error type [2]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1685338 [2]: openshift#268 (comment)
ClusterVersion/ClusterOperator moved to the openshift/api repo, with minor
changes that are vendored back here. This is the reaction PR.
The major differences in what went upstream:
which needs to be handled in by the operator anyway (subsequent commit)
which controls update behavior such as
updateMode: <None|Retrieve|Auto>where Retrieve might be the default.
The validation logic checks whether the incoming object is valid, and if it isn't sets the
Invalidcondition on the object. It then ensures that the config spec has any of the invalid "optional" fields cleared - this is somewhat novel, but it ensures we don't accidentally take action to move the version. An invalid object reconciles, but doesn't progress. As part of this change, simplify the initial status sync into something that feels more level driven (ensure we always have conditions populated), which makes the CVO loop a bit easier to read.