Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Jun 9, 2023

From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR backports both. The alternative is to backport only #830 and perform the necessary adjustments, proposed in #947.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

@openshift-ci openshift-ci bot requested review from vrutkovs and wking June 9, 2023 10:51
@petr-muller petr-muller changed the title lib/resourcemerge/core: Reconcile seccompProfile in ensurePodSecurityContext WIP: lib/resourcemerge/core: Reconcile seccompProfile in ensurePodSecurityContext Jun 9, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2023
@petr-muller
Copy link
Member Author

/hold

Possibly resolves OCPBUGS-5011 but needs more discussion and testing :)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2023
@petr-muller
Copy link
Member Author

/test ci/prow/e2e-agnostic

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2023

@petr-muller: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-agnostic
  • /test e2e-agnostic-operator
  • /test e2e-agnostic-upgrade
  • /test gofmt
  • /test images
  • /test lint
  • /test unit

Use /test all to run all jobs.

Details

In response to this:

/test ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/test e2e-agnostic

@petr-muller
Copy link
Member Author

/test ci/prow/e2e-agnostic

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2023

@petr-muller: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-agnostic
  • /test e2e-agnostic-operator
  • /test e2e-agnostic-upgrade
  • /test gofmt
  • /test images
  • /test lint
  • /test unit

Use /test all to run all jobs.

Details

In response to this:

/test ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/test e2e-agnostic

@petr-muller petr-muller force-pushed the ocpbugs-5011-backport-ocpbugs-575 branch from 4a1aedd to 7365c0b Compare June 30, 2023 13:45
jottofar and others added 2 commits July 10, 2023 16:19
to handle securityContext changes differently. Since d9f6718, if a
securityContext is not explicitly specified in the manifest the
resource's securityContext will remain unchanged and it will
continue to use the securityContext setting of the currently running
resource (if there is one). We're not sure of the exact reason the
logic was originally developed in this manner but this change joins
a series of similar previous tightenings, including
openshift@02bb9ba
(lib/resourcemerge/core: Clear env and envFrom if unset in
manifest, 2021-04-20, openshift#549) and
openshift@ca299b8
(lib/resourcemerge: remove ports which are no longer required,
2020-02-13, openshift#322).

Reconciliation has been changed such that the entire securityContext
structure, or any sub field of it, will be cleared if not specified
in the manifest. This change affects Deployments, Jobs, and
DaemonSets. It affects the securityContext found in both a PodSpec
and a Container. Since the functions setInt64Ptr and setBoolPtr
have been changed the impact is wide affecting ServiceAccounts, the
PodSpec fields ShareProcessNamespace and
TerminationGracePeriodSeconds, and the Job fields
ActiveDeadlineSeconds and ManualSelector.

For example, prior to this change assume Deployment machine-api-operator
is running on the cluster with the following:

securityContext:
    runAsNonRoot: true
    runAsUser: 65534

and during an upgrade the Deployment machine-api-operator no longer
specifies a securityContext. The resulting upgraded Deployment
machine-api-operator will still have the original securityContext:

securityContext:
    runAsNonRoot: true
    runAsUser: 65534

Similarly, there is no way to remove, or clear, a securityContext
field such as runAsUser. You can only modify it.

After this change the above scenario will correctly result in the
Deployment machine-api-operator not specifying securityContext
upon upgrade completion.

The changes apply to both the SecurityContext within a Container
and the PodSecurityContext within a PodSpec.
…Context

This recently started breaking 4.10-to-4.11-to-4.12 updates [1] with
the first failing run [2] like:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-08-24T12:12:13Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-08-24T12:12:13Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-24-061508" image="registry.build03.ci.openshift.org/ci-op-2v4w1wbk/release@sha256:37a2aa2a46ed42eedf73cf0ee1ec0f9a37b822823518172c01ef6ad2ba3ffa12" architecture="amd64"
  2022-08-24T12:31:04Z Available=True : Done applying 4.11.1
  2022-08-24T14:07:56Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/package-server-manager has a replica failure FailedCreate: pods "package-server-manager-fbb7cb8b9-5zmwg" is forbidden: violates PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  2022-08-24T13:40:48Z Progressing=True WorkloadNotProgressing: Unable to apply 4.12.0-0.ci-2022-08-24-061508: the workload openshift-operator-lifecycle-manager/package-server-manager cannot roll out
  2022-08-24T12:36:55Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

while the previous run [3] had:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-08-22T12:12:30Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-08-22T12:12:30Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-22-055706" image="registry.build03.ci.openshift.org/ci-op-5wrvyzw8/release@sha256:42cb26b6290927a1857dfc246336739cebd297c51d9af43f586ceaff9073e825" architecture="amd64"
  2022-08-22T12:36:17Z Available=True : Done applying 4.12.0-0.ci-2022-08-22-055706
  2022-08-22T14:09:42Z Failing=False :
  2022-08-22T14:44:58Z Progressing=False : Cluster version is 4.12.0-0.ci-2022-08-22-055706
  2022-08-22T12:40:12Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

Jian Zhang also noticed the breakage in QE [4].

The breaking change was an OLM namespace touch:

  $ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/release/artifacts/release-images-latest
  $ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/release/artifacts/release-images-latest
  $ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
  $ diff -U0 <(curl -s "${REF_A}" | jq -r "${JQ}") <(curl -s "${REF_B}" | jq -r "${JQ}") | grep olm
  -operator-lifecycle-manager openshift/operator-framework-olm@fd42910
  -operator-registry openshift/operator-framework-olm@fd42910
  +operator-lifecycle-manager openshift/operator-framework-olm@f8c466a
  +operator-registry openshift/operator-framework-olm@f8c466a
  $ git clone https://github.com/openshift/operator-framework-olm
  $ cd operator-framework-olm
  $ git --no-pager log --oneline fd429109b22e8..f8c466aeea67 | grep namespace
  be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
  $ git --no-pager show --oneline be84d80d -- manifests
  be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
  diff --git a/manifests/0000_50_olm_00-namespace.yaml b/manifests/0000_50_olm_00-namespace.yaml
  index 8fffa527..168e8867 100644
  --- a/manifests/0000_50_olm_00-namespace.yaml
  +++ b/manifests/0000_50_olm_00-namespace.yaml
  @@ -3,6 +3,8 @@ kind: Namespace
   metadata:
     name: openshift-operator-lifecycle-manager
     labels:
  +    pod-security.kubernetes.io/enforce: restricted
  +    pod-security.kubernetes.io/enforce-version: "v1.24"
       openshift.io/scc: "anyuid"
       openshift.io/cluster-monitoring: "true"
     annotations:
  @@ -16,7 +18,7 @@ kind: Namespace
   metadata:
     name: openshift-operators
     labels:
  -    pod-security.kubernetes.io/enforce: baseline
  +    pod-security.kubernetes.io/enforce: privileged
       pod-security.kubernetes.io/enforce-version: "v1.24"
       openshift.io/scc: "anyuid"
     annotations:

which went out with [5], and will not be backported to 4.11.

The relevant seccompProfile change landed in 4.11 (bot not in 4.10),
which is why born-in-4.11 clusters are not impacted by the change:

  $ git --no-pager grep seccompProfile: origin/release-4.11 -- manifests
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                              seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-collect-profiles.cronjob.yaml:            seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.yaml:        seccompProfile:
  $ git --no-pager grep seccompProfile: origin/release-4.10 -- manifests
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                              seccompProfile:

The property itself is from Kubernetes 1.19:

  $ git clone https://github.com/kubernetes/api
  $ cd api
  $ git blame origin/release-1.19 -- core/v1/types.go | grep '^type .* struct\|SeccompProfile.*seccompProfile' | grep -B1 SeccompProfile
  20b270679a (Paulo Gomes             2020-06-24 21:37:49 +0100 3306)     SeccompProfile *SeccompProfile `json:"seccompProfile,omitempty" protobuf:"bytes,10,opt,name=seccompProfile"`
  20b270679a (Paulo Gomes             2020-06-24 21:37:49 +0100 5979)     SeccompProfile *SeccompProfile `json:"seccompProfile,omitempty" protobuf:"bytes,11,opt,name=seccompProfile"`
  $ git blame origin/release-1.18 -- core/v1/types.go | grep 'SeccompProfile.*seccompProfile'
  ...no hits...

So the fact that the CVO was not reconciling it would technically be a
bug since OpenShift 4.6, which shipped Kubernetes 1.19.  But auditing
the most recent named release in each minor branch:

  $ oc adm release extract --to 4.6 quay.io/openshift-release-dev/ocp-release:4.6.60-x86_64
  $ oc adm release extract --to 4.7 quay.io/openshift-release-dev/ocp-release:4.7.56-x86_64
  $ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.48-x86_64
  $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.47-x86_64
  $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.30-x86_64
  $ grep -lr seccompProfile: 4.*
  4.10/0000_50_cluster-monitoring-operator_00_0alertmanager-custom-resource-definition.yaml
  4.10/0000_50_cluster-monitoring-operator_00_0prometheus-custom-resource-definition.yaml
  4.10/0000_50_cluster-monitoring-operator_00_0thanosruler-custom-resource-definition.yaml
  4.10/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.7/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.8/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.9/0000_50_olm_00-clusterserviceversions.crd.yaml

And we have sound CRD reconciliation.  So at the moment, the only
manifests asking for seccompProfile where we'd be worried about a lack
of CVO reconciliation are 4.11 and later.

In this commit, I'm adding seccompProfile reconciliation.  There are
quite a few other properties within (Pod)SecurityContext that the CVO
is still ignoring, and I'm unclear on why we have been using
per-property merging instead of DeepEqual.  But I'm going to punt on a
broader move to DeepEqual for now, to make the narrow fix that should
unstick 4.10-to-4.11-to-4.12 updates.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760
[4]: https://issues.redhat.com/browse/OCPBUGS-575
[5]: openshift/operator-framework-olm#367
@petr-muller petr-muller force-pushed the ocpbugs-5011-backport-ocpbugs-575 branch from 7365c0b to c1c2109 Compare July 10, 2023 14:19
@petr-muller petr-muller changed the title WIP: lib/resourcemerge/core: Reconcile seccompProfile in ensurePodSecurityContext OCPBUGS-5011: Backport SecurityContext reconciliation behavior Jul 10, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 10, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is invalid:

  • expected the bug to target the "4.11.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-5011 to depend on a bug targeting a version in 4.12.0, 4.12.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This recently started breaking 4.10-to-4.11-to-4.12 updates 1 with
the first failing run 2 like:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-08-24T12:12:13Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2022-08-24T12:12:13Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-24-061508" image="registry.build03.ci.openshift.org/ci-op-2v4w1wbk/release@sha256:37a2aa2a46ed42eedf73cf0ee1ec0f9a37b822823518172c01ef6ad2ba3ffa12" architecture="amd64"
2022-08-24T12:31:04Z Available=True : Done applying 4.11.1
2022-08-24T14:07:56Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/package-server-manager has a replica failure FailedCreate: pods "package-server-manager-fbb7cb8b9-5zmwg" is forbidden: violates PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
2022-08-24T13:40:48Z Progressing=True WorkloadNotProgressing: Unable to apply 4.12.0-0.ci-2022-08-24-061508: the workload openshift-operator-lifecycle-manager/package-server-manager cannot roll out
2022-08-24T12:36:55Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

while the previous run 3 had:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
2022-08-22T12:12:30Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2022-08-22T12:12:30Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-22-055706" image="registry.build03.ci.openshift.org/ci-op-5wrvyzw8/release@sha256:42cb26b6290927a1857dfc246336739cebd297c51d9af43f586ceaff9073e825" architecture="amd64"
2022-08-22T12:36:17Z Available=True : Done applying 4.12.0-0.ci-2022-08-22-055706
2022-08-22T14:09:42Z Failing=False :
2022-08-22T14:44:58Z Progressing=False : Cluster version is 4.12.0-0.ci-2022-08-22-055706
2022-08-22T12:40:12Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

Jian Zhang also noticed the breakage in QE 4.

The breaking change was an OLM namespace touch:

$ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/release/artifacts/release-images-latest
$ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/release/artifacts/release-images-latest
$ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
$ diff -U0 <(curl -s "${REF_A}" | jq -r "${JQ}") <(curl -s "${REF_B}" | jq -r "${JQ}") | grep olm
-operator-lifecycle-manager openshift/operator-framework-olm@fd42910
-operator-registry openshift/operator-framework-olm@fd42910
+operator-lifecycle-manager openshift/operator-framework-olm@f8c466a
+operator-registry openshift/operator-framework-olm@f8c466a
$ git clone https://github.com/openshift/operator-framework-olm
$ cd operator-framework-olm
$ git --no-pager log --oneline fd429109b22e8..f8c466aeea67 | grep namespace
be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
$ git --no-pager show --oneline be84d80d -- manifests
be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
diff --git a/manifests/0000_50_olm_00-namespace.yaml b/manifests/0000_50_olm_00-namespace.yaml
index 8fffa527..168e8867 100644
--- a/manifests/0000_50_olm_00-namespace.yaml
+++ b/manifests/0000_50_olm_00-namespace.yaml
@@ -3,6 +3,8 @@ kind: Namespace
metadata:
name: openshift-operator-lifecycle-manager
labels:

  • pod-security.kubernetes.io/enforce: restricted
  • pod-security.kubernetes.io/enforce-version: "v1.24"
    openshift.io/scc: "anyuid"
    openshift.io/cluster-monitoring: "true"
    annotations:
    @@ -16,7 +18,7 @@ kind: Namespace
    metadata:
    name: openshift-operators
    labels:
  • pod-security.kubernetes.io/enforce: baseline
  • pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/enforce-version: "v1.24"
    openshift.io/scc: "anyuid"
    annotations:

which went out with 5, and will not be backported to 4.11.

The relevant seccompProfile change landed in 4.11 (bot not in 4.10),
which is why born-in-4.11 clusters are not impacted by the change:

$ git --no-pager grep seccompProfile: origin/release-4.11 -- manifests
origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_07-collect-profiles.cronjob.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml: seccompProfile:
origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.yaml: seccompProfile:
$ git --no-pager grep seccompProfile: origin/release-4.10 -- manifests
origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:
origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml: seccompProfile:

The property itself is from Kubernetes 1.19:

$ git clone https://github.com/kubernetes/api
$ cd api
$ git blame origin/release-1.19 -- core/v1/types.go | grep '^type .* struct|SeccompProfile.*seccompProfile' | grep -B1 SeccompProfile
20b270679a (Paulo Gomes 2020-06-24 21:37:49 +0100 3306) SeccompProfile *SeccompProfile json:"seccompProfile,omitempty" protobuf:"bytes,10,opt,name=seccompProfile"
20b270679a (Paulo Gomes 2020-06-24 21:37:49 +0100 5979) SeccompProfile *SeccompProfile json:"seccompProfile,omitempty" protobuf:"bytes,11,opt,name=seccompProfile"
$ git blame origin/release-1.18 -- core/v1/types.go | grep 'SeccompProfile.*seccompProfile'
...no hits...

So the fact that the CVO was not reconciling it would technically be a
bug since OpenShift 4.6, which shipped Kubernetes 1.19. But auditing
the most recent named release in each minor branch:

$ oc adm release extract --to 4.6 quay.io/openshift-release-dev/ocp-release:4.6.60-x86_64
$ oc adm release extract --to 4.7 quay.io/openshift-release-dev/ocp-release:4.7.56-x86_64
$ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.48-x86_64
$ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.47-x86_64
$ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.30-x86_64
$ grep -lr seccompProfile: 4.*
4.10/0000_50_cluster-monitoring-operator_00_0alertmanager-custom-resource-definition.yaml
4.10/0000_50_cluster-monitoring-operator_00_0prometheus-custom-resource-definition.yaml
4.10/0000_50_cluster-monitoring-operator_00_0thanosruler-custom-resource-definition.yaml
4.10/0000_50_olm_00-clusterserviceversions.crd.yaml
4.7/0000_50_olm_00-clusterserviceversions.crd.yaml
4.8/0000_50_olm_00-clusterserviceversions.crd.yaml
4.9/0000_50_olm_00-clusterserviceversions.crd.yaml

And we have sound CRD reconciliation. So at the moment, the only
manifests asking for seccompProfile where we'd be worried about a lack
of CVO reconciliation are 4.11 and later.

In this commit, I'm adding seccompProfile reconciliation. There are
quite a few other properties within (Pod)SecurityContext that the CVO
is still ignoring, and I'm unclear on why we have been using
per-property merging instead of DeepEqual. But I'm going to punt on a
broader move to DeepEqual for now, to make the narrow fix that should
unstick 4.10-to-4.11-to-4.12 updates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 10, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is invalid:

  • expected the bug to target the "4.11.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-5011 to depend on a bug targeting a version in 4.12.0, 4.12.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR backports both.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is invalid:

  • expected the bug to target the "4.11.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-5011 to depend on a bug targeting a version in 4.12.0, 4.12.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

⚠️ this PR is mutually exclusive with #947 ⚠️


From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR backports both. The alternative is to backport only #830 and perform the necessary adjustments, proposed in #947.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/test unit

@petr-muller
Copy link
Member Author

/uncc @vrutkovs
/cc @LalatenduMohanty

@openshift-ci openshift-ci bot requested review from LalatenduMohanty and removed request for vrutkovs July 10, 2023 15:31
@petr-muller
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2023

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 12, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.11.z) matches configured target version for branch (4.11.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-575 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-575 targets the "4.12.0" version, which is one of the valid target versions: 4.12.0, 4.12.z
  • bug has dependents

Requesting review from QA contact:
/cc @gkarager

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from gkarager July 12, 2023 13:54
@petr-muller
Copy link
Member Author

Based on yesterday's talk with @LalatenduMohanty we prefer this PR over #947 so I closed that one and unholding this one

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2023
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backports look clean:

$ cherry-pick-diff origin/release-4.11..origin/pr/940 origin/release-4.12
aae14d2a -> c1c21091 lib/resourcemerge/core: Reconcile seccompProfile in ensurePodSecurityContext
619baaee -> 617e84d6 lib/resourcemerge: change SecurityContext reconcile

And the logic outlined in the pull request's topic post makes sense to me.

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 17, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.11.z) matches configured target version for branch (4.11.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-575 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-575 targets the "4.12.0" version, which is one of the valid target versions: 4.12.0, 4.12.z
  • bug has dependents

Requesting review from QA contact:
/cc @gkarager

Details

In response to this:

From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR backports both. The alternative is to backport only #830 and perform the necessary adjustments, proposed in #947.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Jul 21, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jiajliu
Copy link

jiajliu commented Jul 21, 2023

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 21, 2023
@openshift-merge-robot openshift-merge-robot merged commit 8966b29 into openshift:release-4.11 Jul 21, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: Jira Issue OCPBUGS-5011: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-5011 has been moved to the MODIFIED state.

Details

In response to this:

From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR backports both. The alternative is to backport only #830 and perform the necessary adjustments, proposed in #947.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.11.0-202311211130.p0.g8966b29.assembly.stream for distgit cluster-version-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants