Skip to content

Conversation

@petr-muller
Copy link
Member

⚠️ this PR is mutually exclusive with #940 ⚠️


From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR adds necessary adjustments, including changing the behavior expected in the backported tests so it matches the CVO behavior after backporting only #830 behavior without the #804 one. The alternative is to backport both, proposed in #940.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

wking and others added 3 commits July 10, 2023 16:52
…Context

This recently started breaking 4.10-to-4.11-to-4.12 updates [1] with
the first failing run [2] like:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-08-24T12:12:13Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-08-24T12:12:13Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-24-061508" image="registry.build03.ci.openshift.org/ci-op-2v4w1wbk/release@sha256:37a2aa2a46ed42eedf73cf0ee1ec0f9a37b822823518172c01ef6ad2ba3ffa12" architecture="amd64"
  2022-08-24T12:31:04Z Available=True : Done applying 4.11.1
  2022-08-24T14:07:56Z Failing=True WorkloadNotProgressing: deployment openshift-operator-lifecycle-manager/package-server-manager has a replica failure FailedCreate: pods "package-server-manager-fbb7cb8b9-5zmwg" is forbidden: violates PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  2022-08-24T13:40:48Z Progressing=True WorkloadNotProgressing: Unable to apply 4.12.0-0.ci-2022-08-24-061508: the workload openshift-operator-lifecycle-manager/package-server-manager cannot roll out
  2022-08-24T12:36:55Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

while the previous run [3] had:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/e2e-aws-sdn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2022-08-22T12:12:30Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2022-08-22T12:12:30Z ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.12.0-0.ci-2022-08-22-055706" image="registry.build03.ci.openshift.org/ci-op-5wrvyzw8/release@sha256:42cb26b6290927a1857dfc246336739cebd297c51d9af43f586ceaff9073e825" architecture="amd64"
  2022-08-22T12:36:17Z Available=True : Done applying 4.12.0-0.ci-2022-08-22-055706
  2022-08-22T14:09:42Z Failing=False :
  2022-08-22T14:44:58Z Progressing=False : Cluster version is 4.12.0-0.ci-2022-08-22-055706
  2022-08-22T12:40:12Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec

Jian Zhang also noticed the breakage in QE [4].

The breaking change was an OLM namespace touch:

  $ REF_A=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760/artifacts/release/artifacts/release-images-latest
  $ REF_B=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880/artifacts/release/artifacts/release-images-latest
  $ JQ='[.spec.tags[] | .name + " " + .annotations["io.openshift.build.source-location"] + "/commit/" + .annotations["io.openshift.build.commit.id"]] | sort[]'
  $ diff -U0 <(curl -s "${REF_A}" | jq -r "${JQ}") <(curl -s "${REF_B}" | jq -r "${JQ}") | grep olm
  -operator-lifecycle-manager openshift/operator-framework-olm@fd42910
  -operator-registry openshift/operator-framework-olm@fd42910
  +operator-lifecycle-manager openshift/operator-framework-olm@f8c466a
  +operator-registry openshift/operator-framework-olm@f8c466a
  $ git clone https://github.com/openshift/operator-framework-olm
  $ cd operator-framework-olm
  $ git --no-pager log --oneline fd429109b22e8..f8c466aeea67 | grep namespace
  be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
  $ git --no-pager show --oneline be84d80d -- manifests
  be84d80d (psa) restrict olm namespace + remove labels from openshift-operators ns
  diff --git a/manifests/0000_50_olm_00-namespace.yaml b/manifests/0000_50_olm_00-namespace.yaml
  index 8fffa527..168e8867 100644
  --- a/manifests/0000_50_olm_00-namespace.yaml
  +++ b/manifests/0000_50_olm_00-namespace.yaml
  @@ -3,6 +3,8 @@ kind: Namespace
   metadata:
     name: openshift-operator-lifecycle-manager
     labels:
  +    pod-security.kubernetes.io/enforce: restricted
  +    pod-security.kubernetes.io/enforce-version: "v1.24"
       openshift.io/scc: "anyuid"
       openshift.io/cluster-monitoring: "true"
     annotations:
  @@ -16,7 +18,7 @@ kind: Namespace
   metadata:
     name: openshift-operators
     labels:
  -    pod-security.kubernetes.io/enforce: baseline
  +    pod-security.kubernetes.io/enforce: privileged
       pod-security.kubernetes.io/enforce-version: "v1.24"
       openshift.io/scc: "anyuid"
     annotations:

which went out with [5], and will not be backported to 4.11.

The relevant seccompProfile change landed in 4.11 (bot not in 4.10),
which is why born-in-4.11 clusters are not impacted by the change:

  $ git --no-pager grep seccompProfile: origin/release-4.11 -- manifests
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                              seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_06-psm-operator.deployment.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-collect-profiles.cronjob.yaml:            seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_07-olm-operator.deployment.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.ibm-cloud-managed.yaml:        seccompProfile:
  origin/release-4.11:manifests/0000_50_olm_08-catalog-operator.deployment.yaml:        seccompProfile:
  $ git --no-pager grep seccompProfile: origin/release-4.10 -- manifests
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                                    seccompProfile:
  origin/release-4.10:manifests/0000_50_olm_00-clusterserviceversions.crd.yaml:                                              seccompProfile:

The property itself is from Kubernetes 1.19:

  $ git clone https://github.com/kubernetes/api
  $ cd api
  $ git blame origin/release-1.19 -- core/v1/types.go | grep '^type .* struct\|SeccompProfile.*seccompProfile' | grep -B1 SeccompProfile
  20b270679a (Paulo Gomes             2020-06-24 21:37:49 +0100 3306)     SeccompProfile *SeccompProfile `json:"seccompProfile,omitempty" protobuf:"bytes,10,opt,name=seccompProfile"`
  20b270679a (Paulo Gomes             2020-06-24 21:37:49 +0100 5979)     SeccompProfile *SeccompProfile `json:"seccompProfile,omitempty" protobuf:"bytes,11,opt,name=seccompProfile"`
  $ git blame origin/release-1.18 -- core/v1/types.go | grep 'SeccompProfile.*seccompProfile'
  ...no hits...

So the fact that the CVO was not reconciling it would technically be a
bug since OpenShift 4.6, which shipped Kubernetes 1.19.  But auditing
the most recent named release in each minor branch:

  $ oc adm release extract --to 4.6 quay.io/openshift-release-dev/ocp-release:4.6.60-x86_64
  $ oc adm release extract --to 4.7 quay.io/openshift-release-dev/ocp-release:4.7.56-x86_64
  $ oc adm release extract --to 4.8 quay.io/openshift-release-dev/ocp-release:4.8.48-x86_64
  $ oc adm release extract --to 4.9 quay.io/openshift-release-dev/ocp-release:4.9.47-x86_64
  $ oc adm release extract --to 4.10 quay.io/openshift-release-dev/ocp-release:4.10.30-x86_64
  $ grep -lr seccompProfile: 4.*
  4.10/0000_50_cluster-monitoring-operator_00_0alertmanager-custom-resource-definition.yaml
  4.10/0000_50_cluster-monitoring-operator_00_0prometheus-custom-resource-definition.yaml
  4.10/0000_50_cluster-monitoring-operator_00_0thanosruler-custom-resource-definition.yaml
  4.10/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.7/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.8/0000_50_olm_00-clusterserviceversions.crd.yaml
  4.9/0000_50_olm_00-clusterserviceversions.crd.yaml

And we have sound CRD reconciliation.  So at the moment, the only
manifests asking for seccompProfile where we'd be worried about a lack
of CVO reconciliation are 4.11 and later.

In this commit, I'm adding seccompProfile reconciliation.  There are
quite a few other properties within (Pod)SecurityContext that the CVO
is still ignoring, and I'm unclear on why we have been using
per-property merging instead of DeepEqual.  But I'm going to punt on a
broader move to DeepEqual for now, to make the narrow fix that should
unstick 4.10-to-4.11-to-4.12 updates.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1562409104817786880
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-from-stable-4.10-e2e-aws-sdn-upgrade/1561684150702837760
[4]: https://issues.redhat.com/browse/OCPBUGS-575
[5]: openshift/operator-framework-olm#367
The backported change aae14d2 added a
bunch of tests that also test behavior changes previously introduced in
619baae, so these tests do not pass
without backporting these behavior changes.

Capture the actual behavior of only backporting aae14d2
in the tests so that they pass.
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jul 10, 2023
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011, which is invalid:

  • expected the bug to target the "4.11.z" version, but no target version was set
  • expected Jira Issue OCPBUGS-5011 to depend on a bug targeting a version in 4.12.0, 4.12.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

⚠️ this PR is mutually exclusive with #940 ⚠️


From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR adds necessary adjustments, including changing the behavior expected in the backported tests so it matches the CVO behavior after backporting only #830 behavior without the #804 one. The alternative is to backport both, proposed in #940.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 10, 2023
@openshift-ci openshift-ci bot requested review from vrutkovs and wking July 10, 2023 15:17
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 10, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: petr-muller
Once this PR has been reviewed and has the lgtm label, please assign wking for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 10, 2023
@petr-muller
Copy link
Member Author

/uncc @vrutkovs
/cc @LalatenduMohanty

@openshift-ci openshift-ci bot requested review from LalatenduMohanty and removed request for vrutkovs July 10, 2023 15:31
@petr-muller
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 11, 2023

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member Author

Based on yesterday's discussion with @LalatenduMohanty we prefer #940

/close

@openshift-ci openshift-ci bot closed this Jul 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2023

@petr-muller: Closed this PR.

Details

In response to this:

Based on yesterday's discussion with @LalatenduMohanty we prefer #940

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-5011. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

⚠️ this PR is mutually exclusive with #940 ⚠️


From OCPBUGS-5011:

After upgrading the cluster from 4.10 to 4.11, openshift-apiserver fails to match the new Pod Security Admission criteria.

Apiserver folks pinned this on CVO reconciliation behavior:

  • the openshift-apiserver-operator deployment/pod on 4.10 doesn't contain any definition of seccompProfile in the security context (it does on 4.11)
  • the Cluster Version Operator actually skips seccompProfile reconciliation when upgrading to 4.11
  • therefore the resulting deployment/pod after the upgrade will be missing seccompProfile, and this will cause the pod to fire PS violation alerts

This behavior in CVO was introduced in #830 in 4.12 which was not backported to 4.11. Cherry-picking aae14d2 does not apply cleanly on release-4.11 without changes from #804 (another 4.12 PR that changes SecurityContext reconciliation), so this PR adds necessary adjustments, including changing the behavior expected in the backported tests so it matches the CVO behavior after backporting only #830 behavior without the #804 one. The alternative is to backport both, proposed in #940.

This PR resolves the problem identified by apiserver, but it does not fully resolve OCPBUGS-5011 for these two reasons:

  • Even with the fixed CVO, there is a period during the update where we already run the 4.11 kube-apiserver (which brings the alert reporting the PSA violations) while we still have the 4.10 openshift-apiserver-operator ReplicaSet (which violates the PSA, it is only fixed in 4.11). So even with fixed CVO, we will still trip the alert during the upgrade.
  • Because OCPBUGS-5011 was around for a long time, it became a honeypot for any report where the PodSecurityViolation alert fired, even ones unrelated to openshift-apiserver. The CVO change does not fix all culprits that trip this alert. I have not investigated the culprits unrelated to openshift-apiserver. It is possible that some of them are caused by the same gotcha like above, but it is also possible that they are separate problems.

So even if we merge this, the observable behavior of "we trip an alert during the update" will not disappear, disappointing all people who linked a "my update tripped the alert" report to OCPBUGS-5011.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants