Skip to content

Conversation

@crawford
Copy link
Contributor

@crawford crawford commented Aug 1, 2018

Instead of hard-coding parameters and passing others as flags,
periodically read this information from the CVOConfig.

It turns out that the order of the comments was incorrect and this
mistake was silently ignored. This is precisely why you don't give
meaning to comments.
This is very useful for debugging.
@crawford crawford requested a review from abhinavdahiya August 1, 2018 23:34
@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 1, 2018
// actually require a deep copy, but the code generator (and Go itself) isn't
// advanced enough to determine that.
func (c *CVOConfig) DeepCopyInto(out *CVOConfig) {
*out = *c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does explictly call deepcopy on objectmeta.

This struct contains the runtime parameters to the CVO.
Instead of hard-coding parameters and passing others as flags,
periodically read this information from the CVOConfig.
@crawford crawford merged commit 16b1dd0 into openshift:master Aug 2, 2018
@crawford crawford deleted the config branch August 2, 2018 01:04
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 16, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 17, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 17, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 18, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 18, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated default and
* The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default, the CVO will continue applying the current
version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants