*: read runtime parameters from CVOConfig #2

crawford · 2018-08-01T23:34:42Z

Instead of hard-coding parameters and passing others as flags,
periodically read this information from the CVOConfig.

This is useful for debugging.

It turns out that the order of the comments was incorrect and this mistake was silently ignored. This is precisely why you don't give meaning to comments.

This is very useful for debugging.

abhinavdahiya · 2018-08-02T00:49:19Z

pkg/apis/clusterversion.openshift.io/v1/cvoconfig.go

+// actually require a deep copy, but the code generator (and Go itself) isn't
+// advanced enough to determine that.
+func (c *CVOConfig) DeepCopyInto(out *CVOConfig) {
+	*out = *c


Does explictly call deepcopy on objectmeta.

This struct contains the runtime parameters to the CVO.

Instead of hard-coding parameters and passing others as flags, periodically read this information from the CVOConfig.

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default, the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default (during bootstrap), the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. * We're only removing defaulting from the bootstrap CVO. Once we get on to the production CVO, we'll still have our previous default-ClusterVersion-injection behavior. An alternative approach we considered was passing a default cluster ID in via a CVO command line option [10]. But Abhinav wants bootstrap operators to be loading their config from on-disk manifests as much as possible [11], and this commit effectively ensures we load ClusterVersion from the installer-provided manifest during bootstrapping. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12 [10]: openshift#239 [11]: openshift#239 (comment)

crawford added 3 commits August 1, 2018 16:34

build: allow update-codegen to pass along flags

b5e72f1

This is useful for debugging.

apis/cvo: update code generator comments

7101e34

It turns out that the order of the comments was incorrect and this mistake was silently ignored. This is precisely why you don't give meaning to comments.

main: print version in the logs

ed8a0e9

This is very useful for debugging.

crawford requested a review from abhinavdahiya August 1, 2018 23:34

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 1, 2018

abhinavdahiya reviewed Aug 2, 2018

View reviewed changes

abhinavdahiya approved these changes Aug 2, 2018

View reviewed changes

crawford added 4 commits August 1, 2018 18:03

apis/cvo: introduce CVOConfig type

35c854b

This struct contains the runtime parameters to the CVO.

*: update generated code

e311065

cvo: refactor worker loops into their own files

6e21061

*: read runtime parameters from CVOConfig

ea678b1

Instead of hard-coding parameters and passing others as flags, periodically read this information from the CVOConfig.

crawford merged commit 16b1dd0 into openshift:master Aug 2, 2018

crawford deleted the config branch August 2, 2018 01:04

wking mentioned this pull request Aug 16, 2019

Bug 1741786: pkg/cvo: Drop ClusterVersion defaulting during bootstrap #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

*: read runtime parameters from CVOConfig #2

*: read runtime parameters from CVOConfig #2

Uh oh!

crawford commented Aug 1, 2018

Uh oh!

abhinavdahiya Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

*: read runtime parameters from CVOConfig #2

*: read runtime parameters from CVOConfig #2

Uh oh!

Conversation

crawford commented Aug 1, 2018

Uh oh!

abhinavdahiya Aug 2, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants