Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest #549

wking · 2021-04-20T17:47:13Z

And also fail on Deployments that set ProgressDeadlineExceeded. More background in the commit messages, but these two changes should address both points from rhbz#1951339.

See in a customer cluster [1]: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].status.conditions' [ { "lastTransitionTime": "2021-04-12T22:04:41Z", "lastUpdateTime": "2021-04-12T22:04:41Z", "message": "Deployment has minimum availability.", "reason": "MinimumReplicasAvailable", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-04-19T17:33:30Z", "lastUpdateTime": "2021-04-19T17:33:30Z", "message": "ReplicaSet \"marketplace-operator-f7cc88d59\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" } ] Previous touch for this logic was 0442094 (lib/resourcebuilder/apps: Only error on Deployment Available=False *and* Progressing=False, 2020-08-06, openshift#430), where I said: Available=True, Progressing=False is the happy, steady state. This commit tightens that back down a bit, to account for cases where the Deployment controller claims the Deployment is still available, but has given up on rolling out a requested update [2]. We definitely want to complain about Deployments like that, because admins might want to investigate the sticking issue, and possibly delete or otherwise poke the Deployment so we come back in and reconcile it so the Deployment controller gives it another attempt. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1951339#c0 [2]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#failed-deployment

Even if the manifest authors state no opinions, these are not properties that we want to allow cluster admins to manipulate. For example, a customer cluster recently stuck a Deployment by inserting a reference to a non-existent secret [1]: $ yaml2json <namespaces/openshift-marketplace/apps/deployments.yaml | jq -r '.items[].spec.template.spec.containers[].envFrom[]' { "secretRef": { "name": "openshift-reg" } } $ yaml2json <namespaces/openshift-marketplace/pods/marketplace-operator-f7cc88d59-hhh75/marketplace-operator-f7cc88d59-hhh75.yaml | jq -r '.status.containerStatuses[].state' { "waiting": { "message": "secret \"openshift-reg\" not found", "reason": "CreateContainerConfigError" } } The outgoing logic dates back to the beginning of reconciling these properties in 14fab0b (add generic 2-way merge handler for random types, 2018-09-27, openshift#26), and this commit's tightening follows on a number of reconciliation tightenings like 29b92d2 (lib/resourcemerge/core: Clear livenessProbe and readinessProbe if nil in required, 2020-01-16, openshift#298). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1951339#c0

openshift-ci-robot · 2021-04-20T17:47:21Z

@wking: This pull request references Bugzilla bug 1951339, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jottofar · 2021-04-21T21:19:33Z

/retest

jottofar · 2021-04-21T21:19:57Z

/lgtm

openshift-ci-robot · 2021-04-21T21:20:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-04-22T00:00:55Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#549

Bugzilla bug 1951339 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Since this package was created in d9f6718 (lib: add lib for applying objects, 2018-08-14, openshift#7), the volume(mount) merge logic has required manifest entries to exist, but has allowed in-cluster entries to persist without removal. That hasn't been a problem until [1]: 1. In 4.3, the autoscaler asked for a ca-cert volume mount, based on the cluster-autoscaler-operator-ca config map. 2. In 4.4, the autoscaler dropped those manifest entries [2]. 3. In 4.9, the autoscaler asked the CVO to remove the config map [3]. That lead some born-in 4.3 clusters to have crashlooping autoscalers, because the mount attempts kept failing on the missing config map. We couldn't think of a plausible reason why cluster admins would want to inject additional volume mounts in a CVO-managed pod configuration, so this commit removes that ability and begins clearing away any volume(mount) configuration that is not present in the reconciling manifest. Cluster administrators who do need to add additional mounts in an emergency are free to use ClusterVersion's spec.overrides to take control of a particular CVO-managed resource. This joins a series of similar previous tightenings, including 02bb9ba (lib/resourcemerge/core: Clear env and envFrom if unset in manifest, 2021-04-20, openshift#549) and ca299b8 (lib/resourcemerge: remove ports which are no longer required, 2020-02-13, openshift#322). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2002834 [2]: openshift/cluster-autoscaler-operator@f08589d#diff-547486373183980619528df695869ed32b80c18383bc16b57a5ee931bf0edd39L89 [3]: openshift/cluster-autoscaler-operator@9a7b3be#diff-d0cf785e044c611986a4d9bdd65bb373c86f9eb1c97bd3f105062184342a872dR4

to handle securityContext changes differently. Since d9f6718, if a securityContext is not explicitly specified in the manifest the resource's securityContext will remain unchanged and it will continue to use the securityContext setting of the currently running resource (if there is one). We're not sure of the exact reason the logic was originally developed in this manner but this change joins a series of similar previous tightenings, including openshift@02bb9ba (lib/resourcemerge/core: Clear env and envFrom if unset in manifest, 2021-04-20, openshift#549) and openshift@ca299b8 (lib/resourcemerge: remove ports which are no longer required, 2020-02-13, openshift#322). Reconciliation of a PodSpec or Container securityContext changes has been changed such that the entire securityContext structure, or any sub field of it, will be cleared if not specified in the manifest. For example, prior to this change assume Deployment machine-api-operator is running on the cluster with the following: securityContext: runAsNonRoot: true runAsUser: 65534 and during an upgrade the Deployment machine-api-operator no longer specifies a securityContext. The resulting upgraded Deployment machine-api-operator will still have the original securityContext: securityContext: runAsNonRoot: true runAsUser: 65534 Similarly, there is no way to remove, or clear, a securityContext field such as runAsUser. You can only modify it. After this change the above scenario will correctly result in the Deployment machine-api-operator not specifying securityContext upon upgrade completion. The changes apply to both the SecurityContext within a Container and the PodSecurityContext within a PodSpec.

to handle securityContext changes differently. Since d9f6718, if a securityContext is not explicitly specified in the manifest the resource's securityContext will remain unchanged and it will continue to use the securityContext setting of the currently running resource (if there is one). We're not sure of the exact reason the logic was originally developed in this manner but this change joins a series of similar previous tightenings, including openshift@02bb9ba (lib/resourcemerge/core: Clear env and envFrom if unset in manifest, 2021-04-20, openshift#549) and openshift@ca299b8 (lib/resourcemerge: remove ports which are no longer required, 2020-02-13, openshift#322). Reconciliation of a PodSpec or Container securityContext changes in Kind Deployment, Job, and DaemonSet has been changed such that the entire securityContext structure, or any sub field of it, will be cleared if not specified in the manifest. Other aspects of Job are affected as well (e.g. ManualSelector and ActiveDeadlineSeconds). For example, prior to this change assume Deployment machine-api-operator is running on the cluster with the following: securityContext: runAsNonRoot: true runAsUser: 65534 and during an upgrade the Deployment machine-api-operator no longer specifies a securityContext. The resulting upgraded Deployment machine-api-operator will still have the original securityContext: securityContext: runAsNonRoot: true runAsUser: 65534 Similarly, there is no way to remove, or clear, a securityContext field such as runAsUser. You can only modify it. After this change the above scenario will correctly result in the Deployment machine-api-operator not specifying securityContext upon upgrade completion. The changes apply to both the SecurityContext within a Container and the PodSecurityContext within a PodSpec.

to handle securityContext changes differently. Since d9f6718, if a securityContext is not explicitly specified in the manifest the resource's securityContext will remain unchanged and it will continue to use the securityContext setting of the currently running resource (if there is one). We're not sure of the exact reason the logic was originally developed in this manner but this change joins a series of similar previous tightenings, including openshift@02bb9ba (lib/resourcemerge/core: Clear env and envFrom if unset in manifest, 2021-04-20, openshift#549) and openshift@ca299b8 (lib/resourcemerge: remove ports which are no longer required, 2020-02-13, openshift#322). Reconciliation has been changed such that the entire securityContext structure, or any sub field of it, will be cleared if not specified in the manifest. This change affects Deployments, Jobs, and DaemonSets. It affects the securityContext found in both a PodSpec and a Container. Since the functions setInt64Ptr and setBoolPtr have been changed the impact is wide affecting ServiceAccounts, the PodSpec fields ShareProcessNamespace and TerminationGracePeriodSeconds, and the Job fields ActiveDeadlineSeconds and ManualSelector. For example, prior to this change assume Deployment machine-api-operator is running on the cluster with the following: securityContext: runAsNonRoot: true runAsUser: 65534 and during an upgrade the Deployment machine-api-operator no longer specifies a securityContext. The resulting upgraded Deployment machine-api-operator will still have the original securityContext: securityContext: runAsNonRoot: true runAsUser: 65534 Similarly, there is no way to remove, or clear, a securityContext field such as runAsUser. You can only modify it. After this change the above scenario will correctly result in the Deployment machine-api-operator not specifying securityContext upon upgrade completion. The changes apply to both the SecurityContext within a Container and the PodSecurityContext within a PodSpec.

wking added 2 commits April 20, 2021 10:32

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 20, 2021

openshift-ci-robot requested review from LalatenduMohanty, abhinavdahiya and jianlinliu April 20, 2021 17:47

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2021

openshift-ci-robot assigned jottofar Apr 21, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2021

openshift-merge-robot merged commit ed0250c into openshift:master Apr 22, 2021

wking deleted the always-reconcile-container-environment-variables branch April 23, 2021 23:44

wking mentioned this pull request Sep 14, 2021

Bug 2002834: lib/resourcemerge/core: Remove unrecognized volumes and mounts #654

Merged

jottofar mentioned this pull request Jul 25, 2022

Bug 2108858: lib/resourcemerge: change SecurityContext reconcile #804

Merged

wking mentioned this pull request Oct 20, 2022

Bug 1939855: controllers/updateservice_controller.go: handle init container changes openshift/cincinnati-operator#150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest #549

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest #549

Uh oh!

wking commented Apr 20, 2021

Uh oh!

openshift-ci-robot commented Apr 20, 2021

Uh oh!

jottofar commented Apr 21, 2021

Uh oh!

jottofar commented Apr 21, 2021

Uh oh!

openshift-ci-robot commented Apr 21, 2021

Uh oh!

openshift-ci-robot commented Apr 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest #549

Bug 1951339: lib/resourcemerge/core: Clear env and envFrom if unset in manifest #549

Uh oh!

Conversation

wking commented Apr 20, 2021

Uh oh!

openshift-ci-robot commented Apr 20, 2021

Uh oh!

jottofar commented Apr 21, 2021

Uh oh!

jottofar commented Apr 21, 2021

Uh oh!

openshift-ci-robot commented Apr 21, 2021

Uh oh!

openshift-ci-robot commented Apr 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants