Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 19, 2019

As an alternative to:

this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID.

We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD), where we currently use the telemetry ID that the installer exposes in metadata.json to feed the cluster manager. There is concern that if Hive 6 extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster.

I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's spec.clusterID property. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that.

Also update the default channel from fast to stable-4.2; if we're going to rely on the CVO-generated default, we need it to have a channel that actually exists.

I've added the WIP subject because there are still two FIXMEs where I'm still working out the plumbing.

wking added 2 commits August 19, 2019 13:55
As an alternative to:

* removing the ClusterVersion defaulting [1], or
* moving the telemetry ID out into a Telemetry-specific config object [2]

this commit allows us to address the race between the
installer-created ClusterVersion and the
cluster-version-operator-generated default ClusterVersion by removing
the former while still allowing an installer-generated Telemetry ID.

We currently need an installer-generated Telemetry ID to support
things like OpenShift Dedicated (OSD) [3], where we currently use the
telemetry ID that the installer exposes in metadata.json [4] to feed
the cluster manager [5].  There is concern that if Hive [6] extracted
the Telemetry ID from the running cluster (vs. feeding in an
installer-chosen ID), that this would race with the cluster's reported
Telemetry and that, for example, Hive-launched OSD Telemetry might be
erroneously claimed to be a user-provisioned cluster.

I'm less clear on why we don't prefer pushing the Telemetry ID into a
Telemetry-specific config.  It seems like it would be less work than
this commit (which has a lot of middleman plumbing).  I'm not entirely
clear on the API we're committed to supporting for ClusterVersion's
spec.clusterID property.  But this CVO plumbing was what we were able
to form a good-enough-for-now consensus around, and we can come back
later with a Telemetery-config approach if we can build consensus
around that.

[1]: openshift#238
[2]: openshift#238 (comment)
[3]: https://www.openshift.com/products/dedicated/
[4]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[5]: https://cloud.redhat.com/openshift/
[6]: https://github.com/openshift/hive/
The latter actually exists, while the former does not.  Generated
with:

  $ sed -i 's/"fast"/"stable-4.2"/g' $(git grep -l '"fast"')

Using a real channel sets the stage for the installer setting the new
--telemetry-id and removing its ClusterVersion in favor of the default
pushed by the cluster-version operator in getOrCreateClusterVersion.
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 19, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1741786, which is invalid:

  • expected the bug to target the "4.2.0" release, but it targets "4.3.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

WIP: Bug 1741786: cmd: Add --telemetry-id option

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 19, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/unit 918c785 link /test unit
ci/prow/e2e-aws-upgrade 918c785 link /test e2e-aws-upgrade
ci/prow/e2e-aws 918c785 link /test e2e-aws
ci/prow/images 918c785 link /test images
ci/prow/integration 918c785 link /test integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

The clusterversion.spec.clusterID is the UUID for the cluster and it is used for more things in addition to the telemetry like the Cincinatti stack uses that as a way to identify one cluster from another. So i'm against the telemetry-id option.

Here's what I would rather see.

  1. the bootstrap-render find the clusterversion object from the directory... and find the clsuterID that's required for the clsuter.
  2. only the bootstrap cluster-version-operator pod be given the clusterID from the flag to ensure the bootstrapping is correct.
  3. The in-cluster cluster-version-operator stay intact with the existing behavior.

@wking
Copy link
Member Author

wking commented Aug 19, 2019

The clusterversion.spec.clusterID is the UUID for the cluster and it is used for more things in addition to the telemetry like the Cincinatti stack uses that as a way to identify one cluster from another.

Ah, I'd forgotten about Cincinnati. Are we aware of any other consumers?

  1. only the bootstrap cluster-version-operator pod be given the clusterID from the flag to ensure the bootstrapping is correct.

This does not allow us to recover from "I accidentally deleted my ClusterVersion", because the CVO would generate a fresh one with a fresh UUID. If we store the current cluster ID in memory, we can recover from that unless the user kills their CVO and removes the ClusterVersion while it's down. Do we care about blocking that hole? Plumbing the option through would certainly be easier if we leave it open.

wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
wking added a commit to wking/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
@wking
Copy link
Member Author

wking commented Aug 21, 2019

Closing now that #238 has new legs.

@wking wking closed this Aug 21, 2019
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Aug 21, 2019
This avoids a race between:

* The cluster-version operator pushing its internally-generated
  default and
* The cluster-bootstrap process pushing the installer-generated
  ClusterVersion into the cluster [1]

Because cluster-bootstrap does not override already-existing objects,
when the CVO's default won that race, the cluster would come up with a
bogus channel ('fast', instead of the installer's default 'stable-4.2'
etc.) and clusterID (causing the reported Telemetry to come in under a
different UUID than the one the installer provided in its
metadata.json output [2]).

By removing the default (during bootstrap), the CVO will continue
applying the current version and reporting some metrics like
cluster_version{type="current"}.  But it will not report metrics that
depend on the state stored in the ClusterVersion object (past
versions, cluster_version{type="failure"}, etc.).  And even the
metrics it does provide may not make it up to the Telemetry API
because the monitoring operator will not be able to pass the Telemeter
container the expected clusterID [3,4].  I'm not clear on how the
Telemeter config is updated when the ClusterVersion's clusterID *does*
change [5], but the monitoring operator is willing to continue without
it [6].  The monitoring operator only updates the Telemeter Deployment
to *set* (or update) the cluster ID [7]; it never clears an existing
cluster ID.

While we won't get Telemetry out of the cluster until the
cluster-bootstrap process (or somebody else) pushes the ClusterVersion
in, we can still alert on it locally.  I haven't done that in this
commit, because [8] is in flight touching this space.

The defaulting logic dates back to the early CVOConfig work in
ea678b1 (*: read runtime parameters from CVOConfig, 2018-08-01, openshift#2).
Clayton expressed concern about recovering from ClusterVersion
deletion without the default [9], but:

* Who deletes their ClusterVersion?

* Even if someone wanted to delete their ClusterVersion, we'll want to
  eventually make the CVO an admission gate for ClusterVersion
  changes, ab4d84a (cvo: Validate the CVO prior to acting on it,
  2018-11-18, openshift#55).  So (once we set up that admission gate), someone
  determined to remove ClusterVersion would have to disable that
  admission config before they could remove the ClusterVersion.  That
  seems like enough steps that you wouldn't blow ClusterVersion away
  by accident.

* If you do successfully remove your ClusterVersion, you're going to
  lose all the status and history it contained.  If you scale down
  your CVO and then remove your ClusterVersion, it's going to be hard
  to recover that lost information.  If you left the CVO running, we
  could presumably cache the whole thing in memory and push it back
  into the cluster.  That would at least avoid the "compiled-in CVO
  defaults differ from user's choices" issue.  But we'd still be
  fighting with the user over the presence of ClusterVersion, and I'd
  rather not.

* Telemetry will continue pushing reports with the last-seen
  clusterID, so unless you disabled Telemetry first, future
  "ClusterVersion is missing" alerts (not added in this commit, see
  earlier paragraph mentioning alerts) would make it out to let us
  know that something was being silly.

* We're only removing defaulting from the bootstrap CVO.  Once we get
  on to the production CVO, we'll still have our previous
  default-ClusterVersion-injection behavior.

An alternative approach we considered was passing a default cluster ID
in via a CVO command line option [10].  But Abhinav wants bootstrap
operators to be loading their config from on-disk manifests as much as
possible [11], and this commit effectively ensures we load
ClusterVersion from the installer-provided manifest during
bootstrapping.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741786
[2]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml#L62
[4]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/config.go#L217-L229
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L369-L370
[6]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/operator/operator.go#L376
[7]: https://github.com/openshift/cluster-monitoring-operator/blob/14b1093149217a6ab5e7603f19ff5449f1ec12fc/pkg/manifests/manifests.go#L1762-L1763
[8]: openshift#232
[9]: https://bugzilla.redhat.com/show_bug.cgi?id=1708697#c12
[10]: openshift#239
[11]: openshift#239 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants