WIP: Bug 1741786: cmd: Add --telemetry-id option #239

wking · 2019-08-19T21:00:34Z

As an alternative to:

removing the ClusterVersion defaulting (Bug 1741786: pkg/cvo: Drop ClusterVersion defaulting during bootstrap #238), or
moving the telemetry ID out into a Telemetry-specific config object

this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID.

We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD), where we currently use the telemetry ID that the installer exposes in metadata.json to feed the cluster manager. There is concern that if Hive 6 extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster.

I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's spec.clusterID property. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that.

Also update the default channel from fast to stable-4.2; if we're going to rely on the CVO-generated default, we need it to have a channel that actually exists.

I've added the WIP subject because there are still two FIXMEs where I'm still working out the plumbing.

As an alternative to: * removing the ClusterVersion defaulting [1], or * moving the telemetry ID out into a Telemetry-specific config object [2] this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID. We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD) [3], where we currently use the telemetry ID that the installer exposes in metadata.json [4] to feed the cluster manager [5]. There is concern that if Hive [6] extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster. I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's spec.clusterID property. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that. [1]: openshift#238 [2]: openshift#238 (comment) [3]: https://www.openshift.com/products/dedicated/ [4]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [5]: https://cloud.redhat.com/openshift/ [6]: https://github.com/openshift/hive/

The latter actually exists, while the former does not. Generated with: $ sed -i 's/"fast"/"stable-4.2"/g' $(git grep -l '"fast"') Using a real channel sets the stage for the installer setting the new --telemetry-id and removing its ClusterVersion in favor of the default pushed by the cluster-version operator in getOrCreateClusterVersion.

openshift-ci-robot · 2019-08-19T21:00:37Z

@wking: This pull request references Bugzilla bug 1741786, which is invalid:

expected the bug to target the "4.2.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

WIP: Bug 1741786: cmd: Add --telemetry-id option

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-08-19T21:00:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-08-19T21:03:49Z

@wking: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/unit	`918c785`	link	`/test unit`
ci/prow/e2e-aws-upgrade	`918c785`	link	`/test e2e-aws-upgrade`
ci/prow/e2e-aws	`918c785`	link	`/test e2e-aws`
ci/prow/images	`918c785`	link	`/test images`
ci/prow/integration	`918c785`	link	`/test integration`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2019-08-19T21:23:29Z

The clusterversion.spec.clusterID is the UUID for the cluster and it is used for more things in addition to the telemetry like the Cincinatti stack uses that as a way to identify one cluster from another. So i'm against the telemetry-id option.

Here's what I would rather see.

the bootstrap-render find the clusterversion object from the directory... and find the clsuterID that's required for the clsuter.
only the bootstrap cluster-version-operator pod be given the clusterID from the flag to ensure the bootstrapping is correct.
The in-cluster cluster-version-operator stay intact with the existing behavior.

wking · 2019-08-19T22:16:36Z

The clusterversion.spec.clusterID is the UUID for the cluster and it is used for more things in addition to the telemetry like the Cincinatti stack uses that as a way to identify one cluster from another.

Ah, I'd forgotten about Cincinnati. Are we aware of any other consumers?

only the bootstrap cluster-version-operator pod be given the clusterID from the flag to ensure the bootstrapping is correct.

This does not allow us to recover from "I accidentally deleted my ClusterVersion", because the CVO would generate a fresh one with a fresh UUID. If we store the current cluster ID in memory, we can recover from that unless the user kills their CVO and removes the ClusterVersion while it's down. Do we care about blocking that hole? Plumbing the option through would certainly be easier if we leave it open.

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default (during bootstrap), the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. * We're only removing defaulting from the bootstrap CVO. Once we get on to the production CVO, we'll still have our previous default-ClusterVersion-injection behavior. An alternative approach we considered was passing a default cluster ID in via a CVO command line option [10]. But Abhinav wants bootstrap operators to be loading their config from on-disk manifests as much as possible [11], and this commit effectively ensures we load ClusterVersion from the installer-provided manifest during bootstrapping. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12 [10]: openshift#239 [11]: openshift#239 (comment)

wking · 2019-08-21T18:51:20Z

Closing now that #238 has new legs.

This avoids a race between: * The cluster-version operator pushing its internally-generated default and * The cluster-bootstrap process pushing the installer-generated ClusterVersion into the cluster [1] Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus channel ('fast', instead of the installer's default 'stable-4.2' etc.) and clusterID (causing the reported Telemetry to come in under a different UUID than the one the installer provided in its metadata.json output [2]). By removing the default (during bootstrap), the CVO will continue applying the current version and reporting some metrics like cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions, cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expected clusterID [3,4]. I'm not clear on how the Telemeter config is updated when the ClusterVersion's clusterID *does* change [5], but the monitoring operator is willing to continue without it [6]. The monitoring operator only updates the Telemeter Deployment to *set* (or update) the cluster ID [7]; it never clears an existing cluster ID. While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this commit, because [8] is in flight touching this space. The defaulting logic dates back to the early CVOConfig work in ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2). Clayton expressed concern about recovering from ClusterVersion deletion without the default [9], but: * Who deletes their ClusterVersion? * Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes, ab4d84a (cvo: Validate the CVO prior to acting on it, 2018-11-18, openshift#55). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident. * If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not. * Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly. * We're only removing defaulting from the bootstrap CVO. Once we get on to the production CVO, we'll still have our previous default-ClusterVersion-injection behavior. An alternative approach we considered was passing a default cluster ID in via a CVO command line option [10]. But Abhinav wants bootstrap operators to be loading their config from on-disk manifests as much as possible [11], and this commit effectively ensures we load ClusterVersion from the installer-provided manifest during bootstrapping. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786 [2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62 [4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229 [5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370 [6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376 [7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763 [8]: openshift#232 [9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12 [10]: openshift#239 [11]: openshift#239 (comment)

wking added 2 commits August 19, 2019 13:55

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 19, 2019

openshift-ci-robot requested review from abhinavdahiya and crawford August 19, 2019 21:00

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 19, 2019

wking closed this Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Bug 1741786: cmd: Add --telemetry-id option #239

WIP: Bug 1741786: cmd: Add --telemetry-id option #239

Uh oh!

wking commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

abhinavdahiya commented Aug 19, 2019

Uh oh!

wking commented Aug 19, 2019

Uh oh!

wking commented Aug 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Bug 1741786: cmd: Add --telemetry-id option #239

WIP: Bug 1741786: cmd: Add --telemetry-id option #239

Uh oh!

Conversation

wking commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

openshift-ci-robot commented Aug 19, 2019

Uh oh!

abhinavdahiya commented Aug 19, 2019

Uh oh!

wking commented Aug 19, 2019

Uh oh!

wking commented Aug 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants