-
Notifications
You must be signed in to change notification settings - Fork 213
WIP: Bug 1741786: cmd: Add --telemetry-id option #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As an alternative to: * removing the ClusterVersion defaulting [1], or * moving the telemetry ID out into a Telemetry-specific config object [2] this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID. We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD) [3], where we currently use the telemetry ID that the installer exposes in metadata.json [4] to feed the cluster manager [5]. There is concern that if Hive [6] extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster. I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's spec.clusterID property. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that. [1]: openshift#238 [2]: openshift#238 (comment) [3]: https://www.openshift.com/products/dedicated/ [4]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [5]: https://cloud.redhat.com/openshift/ [6]: https://github.com/openshift/hive/
The latter actually exists, while the former does not. Generated with: $ sed -i 's/"fast"/"stable-4.2"/g' $(git grep -l '"fast"') Using a real channel sets the stage for the installer setting the new --telemetry-id and removing its ClusterVersion in favor of the default pushed by the cluster-version operator in getOrCreateClusterVersion.
|
@wking: This pull request references Bugzilla bug 1741786, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
The clusterversion.spec.clusterID is the UUID for the cluster and it is used for more things in addition to the telemetry like the Cincinatti stack uses that as a way to identify one cluster from another. So i'm against the telemetry-id option. Here's what I would rather see.
|
Ah, I'd forgotten about Cincinnati. Are we aware of any other consumers?
This does not allow us to recover from "I accidentally deleted my ClusterVersion", because the CVO would generate a fresh one with a fresh UUID. If we store the current cluster ID in memory, we can recover from that unless the user kills their CVO and removes the ClusterVersion while it's down. Do we care about blocking that hole? Plumbing the option through would certainly be easier if we leave it open. |
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
|
Closing now that #238 has new legs. |
This avoids a race between:
* The cluster-version operator pushing its internally-generated
default and
* The cluster-bootstrap process pushing the installer-generated
ClusterVersion into the cluster [1]
Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).
By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.). And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4]. I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6]. The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.
While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally. I haven't done that in this
commit, because [8] is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:
* Who deletes their ClusterVersion?
* Even if someone wanted to delete their ClusterVersion, we'll want to
eventually make the CVO an admission gate for ClusterVersion
changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
2018-11-18, openshift#55). So (once we set up that admission gate), someone
determined to remove ClusterVersion would have to disable that
admission config before they could remove the ClusterVersion. That
seems like enough steps that you wouldn't blow ClusterVersion away
by accident.
* If you do successfully remove your ClusterVersion, you're going to
lose all the status and history it contained. If you scale down
your CVO and then remove your ClusterVersion, it's going to be hard
to recover that lost information. If you left the CVO running, we
could presumably cache the whole thing in memory and push it back
into the cluster. That would at least avoid the "compiled-in CVO
defaults differ from user's choices" issue. But we'd still be
fighting with the user over the presence of ClusterVersion, and I'd
rather not.
* Telemetry will continue pushing reports with the last-seen
clusterID, so unless you disabled Telemetry first, future
"ClusterVersion is missing" alerts (not added in this commit, see
earlier paragraph mentioning alerts) would make it out to let us
know that something was being silly.
* We're only removing defaulting from the bootstrap CVO. Once we get
on to the production CVO, we'll still have our previous
default-ClusterVersion-injection behavior.
An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10]. But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
As an alternative to:
this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID.
We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD), where we currently use the telemetry ID that the installer exposes in
metadata.jsonto feed the cluster manager. There is concern that if Hive 6 extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster.I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's
spec.clusterIDproperty. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that.Also update the default channel from
fasttostable-4.2; if we're going to rely on the CVO-generated default, we need it to have a channel that actually exists.I've added the WIP subject because there are still two FIXMEs where I'm still working out the plumbing.