-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 1741786: pkg/cvo: Drop ClusterVersion defaulting during bootstrap #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1741786: pkg/cvo: Drop ClusterVersion defaulting during bootstrap #238
Conversation
|
@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: This pull request references a valid Bugzilla bug. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
eb16de6 to
db2b989
Compare
db2b989 to
26e83b2
Compare
|
Pulled the WIP, because 26e83b2 passes unit tests locally for me. We also have a few rounds of green e2e-aws(-upgrade). I dunno what's going on with the two |
c591707 to
0361562
Compare
|
All green :). @abhinavdahiya, @smarterclayton, please take a look. |
|
/hold Why are we creating cluster version during install and why do we have to? Your description alluded to it but didn’t lay out the current justifications. Originally CV was moving to not installer created |
|
To explore the counter, why does channel need to be set by installer? Metadata.json having it is a bit weird. The CVO shouldn’t be syncing anything out of the payload until it has written the CV, so the bug here seems to be with bootstrap using idempotent create for objects instead of idenpotent create or update? |
|
Also alternatively, perhaps the bootstrap process should be passing intiital cluster id and channel to cvo to create bootstrap deployment, |
Historically, this was used to override broken components, but we haven't needed to do that since openshift/installer#851. Although obviously we could have been managing that from this repo instead of managing it via the installer repo.
Yeah. In fact, ClusterVersion having it is a bit weird ;). If it's just for telemetry, shouldn't it just be in the Telemetry config? Is telemetry/monitoring represented by a custom resource? Nothing stands out to me in the API. Maybe configuration is via a ConfigMap? At the moment, Hive is using the telemetry from
I don't like multiple parties having opinions about cluster config specs. Components that have opinions about their own config could represent their defaults in
As in "but not via ClusterVersion"? I'm fine just leaving channel/upstream defaulting to the CVO. And if we adjust Telemetry ID injection, I'd rather move it to an in-cluster Telemetry custom resource. |
|
To the last questions, "no", it should be as "tell the bootstrap CVO
command how you want to generate its initial resources via arguments". Or
pass the default CV object to bootstrap CVO command. Or inject the CV
immediately after the CRD is created. There are lots of options for taking
the data the installer has and ensuring the CVO gets to handle the
appropriate ordering.
I think cluster-bootstrap is insufficiently intelligent (or the bootstrap
commands it invokes are insufficiently intelligent). Passing those flags
to CVO bootstrap seems the closest to actual intent.
…On Sun, Aug 18, 2019 at 10:02 PM W. Trevor King ***@***.***> wrote:
Why are we creating cluster version during install and why do we have to?
Historically, this was used to override broken components, but we haven't
needed to do that since openshift/installer#851
<openshift/installer#851>. Although obviously we
could have been managing that from this repo instead of managing it via the
installer repo.
Metadata.json having it is a bit weird.
Yeah. In fact, Cluster version having it is a bit weird ;). If it's just
for telemetry, shouldn't it just be in the Telemetry config
<https://github.com/openshift/cluster-monitoring-operator/blob/fbd0e89e8a4fc72a993b64fb80c79145c78f8b67/pkg/manifests/config.go#L125>?
Is telemetry/monitoring represented by a custom resource? Nothing stands
out to me in the API
<https://github.com/openshift/api/tree/a94e914914f4228d0bcba6fc8a22614c5f5e2dad/config/v1>.
Maybe configuration is via a ConfigMap
<https://github.com/openshift/cluster-monitoring-operator/blob/e64d3f1b595959ae9cd4388d7a4fa5d41b5f3a5a/pkg/operator/operator.go#L338>
?
At the moment, Hive is using the telemetry from metadata.json to hold
space for "coming OSD cluster". It could pull the Telemetry ID put of the
running cluster instead, but then OSD-ness would race cluster Telemetry
reporting vs. Hive's extraction. Until we get OSD-ness via the
cluster_installer metrics (openshift/installer#1890
<openshift/installer#1890> #213
<#213>), which
may be where we want it long-term (we'd still need a way to associate
Hive-reported admin creds with the appropriate Telemetry ID).
The CVO shouldn’t be syncing anything out of the payload until it has
written the CV, so the bug here seems to be with bootstrap using idempotent
create for objects instead of idenpotent create or update?
I don't like multiple parties having opinions about cluster config specs.
Components that have opinions about their own config could represent their
defaults in status, but should stay out of spec, which should be
admin-managed for these top-level configs (and installer-managed as a proxy
for the admin at install-time). I'd be fine with cluster-bootstrap
clobbering spec (repeatedly?) until all resource specs matched its
expectations, but that's a reasonably large change vs. its current behavior.
Also alternatively, perhaps the bootstrap process should be passing
intiital cluster id and channel to cvo to create bootstrap deployment...
As in "but not via ClusterVersion"? I'm fine just leaving channel/upstream
defaulting to the CVO. And if we adjust Telemetry ID injection, I'd rather
move it to an in-cluster Telemetry custom resource.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#238?email_source=notifications&email_token=AAI37J2HTHJA6V32SRXCPLDQFGTEPA5CNFSM4IMMANDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RHBTA#issuecomment-522350796>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAI37J7RX5IAFDFLC5YT4FDQFGTEPANCNFSM4IMMANDA>
.
|
As an alternative to: * removing the ClusterVersion defaulting [1], or * moving the telemetry ID out into a Telemetry-specific config object [2] this commit allows us to address the race between the installer-created ClusterVersion and the cluster-version-operator-generated default ClusterVersion by removing the former while still allowing an installer-generated Telemetry ID. We currently need an installer-generated Telemetry ID to support things like OpenShift Dedicated (OSD) [3], where we currently use the telemetry ID that the installer exposes in metadata.json [4] to feed the cluster manager [5]. There is concern that if Hive [6] extracted the Telemetry ID from the running cluster (vs. feeding in an installer-chosen ID), that this would race with the cluster's reported Telemetry and that, for example, Hive-launched OSD Telemetry might be erroneously claimed to be a user-provisioned cluster. I'm less clear on why we don't prefer pushing the Telemetry ID into a Telemetry-specific config. It seems like it would be less work than this commit (which has a lot of middleman plumbing). I'm not entirely clear on the API we're committed to supporting for ClusterVersion's spec.clusterID property. But this CVO plumbing was what we were able to form a good-enough-for-now consensus around, and we can come back later with a Telemetery-config approach if we can build consensus around that. [1]: openshift#238 [2]: openshift#238 (comment) [3]: https://www.openshift.com/products/dedicated/ [4]: https://github.com/openshift/installer/blob/4e204c5e509de1bd31113b0c0e73af1a35e52c0a/pkg/types/clustermetadata.go#L17-L18 [5]: https://cloud.redhat.com/openshift/ [6]: https://github.com/openshift/hive/
The cluster version ID is the UUID for the cluster. We use it to identify the cluster in telemetry on red hat side, also the cincinatti stacks needs a way to identify the cluster. So I think the ID definitely belongs to the clusterversion object. |
|
/cherrypick release-4.1 |
|
@wking: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/openshift-cherrypick-robot/cluster-version-operator\n ! [rejected] cherry-pick-238-to-release-4.1 -> cherry-pick-238-to-release-4.1 (non-fast-forward)\nerror: failed to push some refs to 'https://openshift-cherrypick-robot:[email protected]/openshift-cherrypick-robot/cluster-version-operator'\nhint: Updates were rejected because the tip of your current branch is behind\nhint: its remote counterpart. Integrate the remote changes (e.g.\nhint: 'git pull ...') before pushing again.\nhint: See the 'Note about fast-forwards' in 'git push --help' for details.\n", error: exit status 1 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Oops, we already had one :p. |
…to --wait-for-cluster-version We've been using --enable-default-cluster-version to avoid a cluster-bootstrap vs. cluster-version operator race since d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238). But [1] is considering passing component configuration to the CVO via ClusterVersion that would affect the manifests getting applied. So this commit pivots to a new option that blocks the whole sync cycle on the existence of a ClusterVersion resource. That way, we know we have up-to-date data before reconciling any manifests (although it might slow down the bootstrapping process a bit). [1]: openshift/enhancements#922
…to --wait-for-cluster-version We've been using --enable-default-cluster-version to avoid a cluster-bootstrap vs. cluster-version operator race since d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238). But [1] is considering passing component configuration to the CVO via ClusterVersion that would affect the manifests getting applied. So this commit pivots to a new option that blocks the whole sync cycle on the existence of a ClusterVersion resource. That way, we know we have up-to-date data before reconciling any manifests (although it might slow down the bootstrapping process a bit). [1]: openshift/enhancements#922
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
This avoids a race between:
Because cluster-bootstrap does not override already-existing objects, when the CVO's default won that race, the cluster would come up with a bogus
channel(fast, instead of the installer's defaultstable-4.2, etc.) andclusterID(causing the reported Telemetry to come in under a different UUID than the one the installer provided in itsmetadata.jsonoutput).By removing the default, the CVO will continue applying the current version and reporting some metrics like
cluster_version{type="current"}. But it will not report metrics that depend on the state stored in the ClusterVersion object (past versions,cluster_version{type="failure"}, etc.). And even the metrics it does provide may not make it up to the Telemetry API because the monitoring operator will not be able to pass the Telemeter container the expectedclusterID. I'm not clear on how the Telemeter config is updated when the ClusterVersion'sclusterIDdoes change because of this guard, but the monitoring operator is willing to continue without it. The monitoring operator only updates the Telemeter Deployment to set (or update) the cluster ID; it never clears an existing cluster ID.While we won't get Telemetry out of the cluster until the cluster-bootstrap process (or somebody else) pushes the ClusterVersion in, we can still alert on it locally. I haven't done that in this PR, because #232 is in flight touching this space.
The defaulting logic dates back to the early CVOConfig work in ea678b1 (#2). @smarterclayton expressed concern about recovering from ClusterVersion deletion without the default, but:
Who deletes their ClusterVersion?
Even if someone wanted to delete their ClusterVersion, we'll want to eventually make the CVO an admission gate for ClusterVersion changes (ab4d84a). So (once we set up that admission gate), someone determined to remove ClusterVersion would have to disable that admission config before they could remove the ClusterVersion. That seems like enough steps that you wouldn't blow ClusterVersion away by accident.
If you do successfully remove your ClusterVersion, you're going to lose all the status and history it contained. If you scale down your CVO and then remove your ClusterVersion, it's going to be hard to recover that lost information. If you left the CVO running, we could presumably cache the whole thing in memory and push it back into the cluster. That would at least avoid the "compiled-in CVO defaults differ from user's choices" issue. But we'd still be fighting with the user over the presence of ClusterVersion, and I'd rather not.
Telemetry will continue pushing reports with the last-seen clusterID, so unless you disabled Telemetry first, future "ClusterVersion is missing" alerts (not added in this commit, see earlier paragraph mentioning alerts) would make it out to let us know that something was being silly.
I'm pushing this up to get feedback on the approach (CC @abhinavdahiya). The WIP is because I haven't updated the unit test to handle the lack of defaulting.