-
Notifications
You must be signed in to change notification settings - Fork 584
OCPBUGS-48641: config/v1/types_cluster_version: Explain image and version both set #2158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-48641: config/v1/types_cluster_version: Explain image and version both set #2158
Conversation
|
Hello @wking! Some important instructions when contributing to openshift/api: |
|
@wking: This pull request references Jira Issue OCPBUGS-48641, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
config/v1/types_cluster_version.go
Outdated
| // Some of the fields are inter-related with restrictions and meanings described here. | ||
| // 1. image is specified, version is specified, architecture is specified. API validation error. | ||
| // 2. image is specified, version is specified, architecture is not specified. You should not do this. version is silently ignored and image is used. | ||
| // 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the changes below about Version, is it more complete this way?
| // 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version. | |
| // 2. image is specified, version is specified, architecture is not specified. image is used if the version metadata in the referenced image matches the specified version. API validation error otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lifted from Slack:
"""
I think "API validation error" in the Godocs is trying to describe the CEL kubebuilder:validation:XValidation:rule here, because those are enforced by the Kube API server's validation, and clients cannot push invalid combinations. For version, there's no CEL enforcement, it's just up to the CVO to decide how to handle the "image and version both set" situation, and report its thoughts in status.conditions
"""
:TIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What validations are made? Is it possible that they could be moved to CEL? Or does it need to introspect the image to validate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to introspect the image, checking the version in the release image's release-metadata file against the spec version string that the cluster admin said they expected. See OCPBUGS-48641's:
Verifying payload failed version="4.17.99" image="quay.io/openshift-release-dev/ocp-release@sha256:82aa2a914d4cd964deda28b99049abbd1415f96c0929667b0499dd968864a8dd" failure=release image version 4.17.13 does not match the expected upstream version 4.17.99
error message for an example, where the CVO looks inside the sha256:82aa2a9... release image, sees that the release-metadata file claims that release image is 4.17.13, and then complains that the release image's 4.17.13 diverges from the spec version's 4.17.99 expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, and is this validation via a webhook or part of the controller? The API validation error otherwise message seems odd here, especially if this is a controller based validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's via the controller (CVO). And yes, that's why I would rather use my current text, and not the API validation error text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would certainly be dropping the API validation error
| Architecture ClusterVersionArchitecture `json:"architecture"` | ||
|
|
||
| // version is a semantic version identifying the update version. | ||
| // version is ignored if image is specified and required if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking a bit more deeply into where the nominal "ignored" came from, on 2022-11-08 I claimed it was ignored. I'm not sure what 2022-me was thinking there; possibly I was just focused on how the CVO looks up which image to use (and that logic doesn't run when image is explicitly set in spec), and I overlooked the sync-worker validation as it judges the requested desiredUpdate for ReleaseAccepted?
|
/lgtm |
965895d to
387fac3
Compare
387fac3 to
435a43a
Compare
|
/cc |
config/v1/types_cluster_version.go
Outdated
| // Some of the fields are inter-related with restrictions and meanings described here. | ||
| // 1. image is specified, version is specified, architecture is specified. API validation error. | ||
| // 2. image is specified, version is specified, architecture is not specified. You should not do this. version is silently ignored and image is used. | ||
| // 2. image is specified, version is specified, architecture is not specified. The version metadata in the referenced image must match the specified version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What validations are made? Is it possible that they could be moved to CEL? Or does it need to introspect the image to validate?
| // version is a semantic version identifying the update version. | ||
| // version is ignored if image is specified and required if | ||
| // architecture is specified. | ||
| // version is required if architecture is specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a CEL rule to validate this. We can test that it ratchets so that existing broken resources do not suddenly become broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"version is ... required if architecture is specified" dates back to #1339, and seems orthogonal to the change I'm suggesting here. And actually, oc adm upgrade --to-multi-arch is setting both architecture and version, and I don't see a reason to block that; it's the same sanity-check of "yes, the image the cluster retrieved seems like the release the cluster admin was expecting" for folks where version numbers are more recognizable than image digests (everybody? Definitely me, anyway). Should I drop that unnecessary constraint from the docs in this pull request, or can I file a follow-up pull request dropping that constraint once this one merges? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's orthogonal, but it's good to make compatible incremental change as we are touching areas of APIs.
What happens if version is missing when architecture is specified today? Ignoring CLI tooling that would set it, since folks can manipulate these resources themselves, would it cause CVO to return errors when it processes the object?
If so, adding a CEL rule as below would give more immediate feedback to a user, and is relatively free to us to implement. As of 4.18 this should ratchet itself, but we would need to test it.
// +kubebuilder:validation:XValidation:rule="!has(self.architecture) || has(self.version)",message="version if required when architecture is set"
A self ratcheting version
// +kubebuilder:validation:XValidation:rule="!has(self.architecture) || has(self.version) || has(oldSelf.architecture)",message="version if required when architecture is set"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if version is missing when architecture is specified today?
Looks like that's already guarded here. With a launch 4.17.12 aws Cluster Bot cluster:
$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate", "value": {"architecture": "Multi"}}]'
The ClusterVersion "version" is invalid: spec.desiredUpdate: Invalid value: "object": no such key: version evaluating rule: Version must be set if Architecture is setSo I can leave the version is required if architecture is specified docs in place here, and don't need to add additional CEL.
config/v1/types_cluster_version.go
Outdated
| // image should be used when the desired version does not exist in availableUpdates or history. | ||
| // When image is set, version is ignored. When image is set, version should be empty. | ||
| // When image is set, architecture cannot be specified. | ||
| // If both version and image are set, the version metadata in the referenced image must match the specified version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does version metadata actually mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ oc adm release info -o json quay.io/openshift-release-dev/ocp-release:4.17.12-x86_64 | jq -r .metadata.version
4.17.12Docs discussing the release-metadata file that holds that as part of the release image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an admin were unsure, would they be able to check this themselves from the docs on the API? Perhaps linking out to this doc would be a useful help for users of this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much precedent to linking out to the enhancements repo for more details:
api$ git --no-pager grep github.com/openshift/enhancements/blob/
README.md:conventions](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#api),
machine/v1beta1/types_machine.go: // https://github.com/openshift/enhancements/blob/master/enhancements/machine-instance-lifecycle.mdHow about inlining something more here to make it clear that it's metadata extracted from the release image? Maybe "the version metadata extracted from the referenced image" would be sufficient? Or "the version extracted from the referenced image"? Or "the version string..."? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the version extracted from the referenced image
Lets get it updated to this.
Do we assume then that the admin knows the semver version of the image just based on how they request an update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the version extracted from the referenced image
Lets get it updated to this.
Done, and also rebased onto the current tip, with 435a43a -> f2e7034.
Do we assume then that the admin knows the semver version of the image just based on how they request an update?
If they do, I'd expect them to set version to pull in the "is the incoming release the SemVer version you expected?" guard. For example, oc adm upgrade --to "${VERSION}" is setting both version and image here and in other, similar locations, in case the OpenShift Update Service used to populate ClusterVersion's status.availableUpdates or status.conditionalUpdates is falsely claiming the wrong version for a particular image pullspec.
The image property expects a by-digest pullspec (e.g. see openshift/oc#390's client-side tag-pullspec guard), so it's less clear in the "image set, version not set" case if the admin knows the SemVer release they're trying to move towards. But even if they do, unless they tell us via version, all we have to go on is their image, so there's no way I can see to enforce a "is the incoming release the SemVer version you expected?" guard, because we don't have the expected-version version property to check against.
Catching up with openshift/cluster-version-operator@9be6175c5f (pkg/cvo/sync_worker: Make expected/actual version mismatch fatal, 2020-08-09, openshift/cluster-version-operator#431), which uses the 'version' property as a sanity check for "is this pullspec the version I'm expecting?". This protects users from compromised or man-in-the-middled upstream update services who attempt downgrade and similar attacks by misrepresenting a recommended update. The text I'm adjusting landed in 354e2fb (config/v1/types_cluster_version: Add Architecture to DesiredUpdate, 2022-12-07, openshift#1339), but version-ignoring was never implemented, so nobody can be relying on that nominal behavior. And as the man-in-the-middle use case demonstrates, version-ignoring would be less safe than the version-match-enforcing behavior that the cluster-version operator has used since 2020. I edited types_cluster_version.go by hand, and then updated the other files with: $ hack/update-codegen-crds.sh $ hack/update-openapi.sh $ hack/update-swagger-docs.sh
435a43a to
f2e7034
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, JoelSpeed, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@wking: Jira Issue OCPBUGS-48641: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-48641 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Catching up with openshift/cluster-version-operator@9be6175c5f (openshift/cluster-version-operator#431), which uses the
versionproperty as a sanity check for "is this pullspec the version I'm expecting?". This protects users from compromised or man-in-the-middled upstream update services who attempt downgrade and similar attacks by misrepresenting a recommended update.The text I'm adjusting landed in 354e2fb (#1339), but version-ignoring was never implemented, so nobody can be relying on that nominal behavior. And as the man-in-the-middle use case demonstrates, version-ignoring would be less safe than the version-match-enforcing behavior that the cluster-version operator has used since 2020.