-
Notifications
You must be signed in to change notification settings - Fork 213
*: read runtime parameters from CVOConfig #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is useful for debugging.
It turns out that the order of the comments was incorrect and this mistake was silently ignored. This is precisely why you don't give meaning to comments.
This is very useful for debugging.
| // actually require a deep copy, but the code generator (and Go itself) isn't | ||
| // advanced enough to determine that. | ||
| func (c *CVOConfig) DeepCopyInto(out *CVOConfig) { | ||
| *out = *c |
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does explictly call deepcopy on objectmeta.
abhinavdahiya
approved these changes
Aug 2, 2018
This struct contains the runtime parameters to the CVO.
Instead of hard-coding parameters and passing others as flags, periodically read this information from the CVOConfig.
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 16, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 17, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 17, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 18, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 18, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 21, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
wking
added a commit
to wking/cluster-version-operator
that referenced
this pull request
Aug 21, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
openshift-cherrypick-robot
pushed a commit
to openshift-cherrypick-robot/cluster-version-operator
that referenced
this pull request
Aug 21, 2019
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Instead of hard-coding parameters and passing others as flags,
periodically read this information from the CVOConfig.