OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900

Miciah · 2025-06-09T13:27:16Z

gatewayapicontroller: Add checks for empty slices

Check whether the slice of parent resource references in an httproute's status is empty before indexing the slice.

Before this commit, the "Ensure HTTPRoute object is created" test sometimes panicked with "runtime error: index out of range [0] with length 0".

Similarly, check whether the slice of load-balancer ingress points in a service's status is empty before indexing it.

gatewayapicontroller: Clean up resources when done

Delete the gatewayclass and uninstall OSSM after all the Gateway API controller tests are done.

Before this change, the Gateway API controller tests left OSSM installed, including the subscription, CSV, installplan, bundled CRDs, RBAC resources, deployment, service, serviceaccount, etc., when the tests were finished. This clutter could cause problems for other tests, or for the same test if it was run again.

The new cleanup logic uses the OperatorsV1 client from github.com/operator-framework/operator-lifecycle-manager. Importing this package requires a replace stanza for openshift/api in go.mod.

This vendors github.com/operator-framework/operator-lifecycle-manager v0.30.1-0.20250114164243-1b6752ec65fa rather than the newest revision in order to avoid bringing in additional problematic vendor bumps that the newest revision would bring in.

gatewayapicontroller: Always log errors

Add the error value to some log messages that were missing it.

openshift-ci-robot · 2025-06-09T13:27:22Z

@Miciah: This pull request references Jira Issue OCPBUGS-56281, which is invalid:

expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

gatewayapicontroller: Add checks for empty slices

Check whether the slice of parent resource references in an httproute's status is empty before indexing the slice.

Before this commit, the "Ensure HTTPRoute object is created" test sometimes panicked with "runtime error: index out of range [0] with length 0".

Similarly, check whether the slice of load-balancer ingress points in a service's status is empty before indexing it.

gatewayapicontroller: Clean up resources when done

Delete the gatewayclass and uninstall OSSM after all the Gateway API controller tests are done.

Before this change, the Gateway API controller tests left OSSM installed, including the subscription, CSV, installplan, bundled CRDs, RBAC resources, deployment, service, serviceaccount, etc., when the tests were finished. This clutter could cause problems for other tests, or for the same test if it was run again.

The new cleanup logic uses the OperatorsV1 client from github.com/operator-framework/operator-lifecycle-manager. Importing this package requires a replace stanza for openshift/api in go.mod.

This vendors github.com/operator-framework/operator-lifecycle-manager v0.30.1-0.20250114164243-1b6752ec65fa rather than the newest revision in order to avoid bringing in additional problematic vendor bumps that the newest revision would bring in.

gatewayapicontroller: Always log errors

Add the error value to some log messages that were missing it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-trt · 2025-06-09T23:15:18Z

Job Failure Risk Analysis for sha: bf853bf

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-disruptive	IncompleteTests Tests for this run (19) are below the historical average (1505): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2	IncompleteTests Tests for this run (19) are below the historical average (1822): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (29) are below the historical average (1778): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

rhamini3 · 2025-06-11T17:45:39Z

LGTM, @melvinjoseph86 PTAL

melvinjoseph86 · 2025-06-12T07:11:17Z

/lgtm

melvinjoseph86 · 2025-06-12T07:14:05Z

/retest

openshift-trt · 2025-06-12T15:26:42Z

Job Failure Risk Analysis for sha: 1967dd2

Job Name	Failure Risk
pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback	IncompleteTests Tests for this run (94) are below the historical average (209): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones	High [sig-network-edge][OCPFeatureGate:GatewayAPIController][Feature:Router][apigroup:gateway.networking.k8s.io] Ensure custom gatewayclass can be accepted [Suite:openshift/conformance/parallel] This test has passed 98.38% of 2463 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade	IncompleteTests Tests for this run (196) are below the historical average (3374): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling	Low [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week. Open Bugs [CI] e2e-openstack-ovn-etcd-scaling job permanent fails at many openshift-test tests etcd-scaling jobs failing ~60% of the time --- [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week. Open Bugs etcd-scaling jobs failing ~60% of the time
pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling	Medium [sig-instrumentation] disruption/metrics-api connection/new should be available throughout the test Potential external regression detected for High Risk Test analysis

openshift-trt · 2025-06-13T08:14:44Z

Job Failure Risk Analysis for sha: ab81b79

Job Name	Failure Risk
pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback	MissingData
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones	High [sig-network-edge][OCPFeatureGate:GatewayAPIController][Feature:Router][apigroup:gateway.networking.k8s.io] Ensure custom gatewayclass can be accepted [Suite:openshift/conformance/parallel] This test has passed 99.76% of 2503 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 50.00% of 2 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week. Open Bugs etcd-scaling jobs failing ~60% of the time
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade	IncompleteTests Tests for this run (2125) are below the historical average (3401): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Miciah · 2025-06-13T19:35:15Z

https://github.com/openshift/origin/compare/1967dd22c83963e780eb9953bc38da760e090dc8..1dcc98a3c2ec7c38dcee818e750e14ce57d70892 made these changes:

Add logic to delete the Istio CR in the test cleanup.
Declare package consts for istioName and ingressNamespace and use these instead of function-local variables and string literals.
Omit the namespace when getting the Istio CR, which is cluster-scoped.

Before these changes, pods.json from e2e-aws #1932229162710339584 had the istiod pod. After these changes, pods.json from e2e-aws #1933552902287134720 does not have the istiod pod. It appears that the istiod pod cleanup is working properly.

Also, comparing must-gather.tar from 1933552902287134720 and must-gather.tar from 1932229162710339584, the older must-gather archive has the istiorevisions.sailoperator.io.yaml CRD whereas the newer must-gather archive does not. Neither must-gather archive has any other istio.io or sailoperator.io CRDs. I believe that deleting the Istio CR enables the cleanup to delete all OSSM-installed CRDs successfully.

openshift-trt · 2025-06-13T23:17:41Z

Job Failure Risk Analysis for sha: 1dcc98a

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn	High [sig-network-edge][OCPFeatureGate:GatewayAPIController][Feature:Router][apigroup:gateway.networking.k8s.io] Ensure HTTPRoute object is created [Suite:openshift/conformance/parallel] This test has passed 99.22% of 2451 runs on release 4.20 [Overall] in the last week. Open Bugs Component Readiness: [Networking / router] [OCPFeatureGate:GatewayAPIController] test regressed on HyperShift Azure AKS
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 50.00% of 2 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week. Open Bugs etcd-scaling jobs failing ~60% of the time
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	High [sig-api-machinery] API priority and fairness should ensure that requests can be classified by adding FlowSchema and PriorityLevelConfiguration [Suite:openshift/conformance/parallel] [Suite:k8s] This test has passed 99.97% of 3060 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade	IncompleteTests Tests for this run (2125) are below the historical average (3318): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade	IncompleteTests Tests for this run (19) are below the historical average (1620): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling	Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week. Open Bugs etcd-scaling jobs failing ~60% of the time --- [sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week. --- [sig-api-machinery] disruption/cache-oauth-api apiserver/oauth-apiserver connection/new should be available throughout the test This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:vsphere SecurityMode:default Topology:ha Upgrade:none] in the last week.

abhat · 2025-06-16T16:00:44Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5

openshift-ci · 2025-06-16T16:00:48Z

@abhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/101e8ee0-4acb-11f0-928a-4bd1c2be89d0-0

alebedev87 · 2025-06-16T22:01:29Z

test/extended/router/gatewayapicontroller.go

+			e2e.Failf("Failed to delete GatewayClass %q", gatewayClassName)
+		}
+
+		g.By("Deleting the OSSM Operator resources")


I'm curious, why we don't use an owner reference for Subscription? We could owner reference the gatewayclass and let Kube do the cascading deletion.

Upd: Deletion of Subscription doesn't delete CSV or CRDs. The CRD part is understandable: there can be some data loss. But CSV is kinda interesting.

There are a few reasons not to put or rely on an owner reference on the subscription:

You could create the subscription manually; we cannot assume that the operator created it.

You could have multiple gatewayclasses with our controller name, and then it isn't clear how we would configure the owner references on the subscription. Would we add only the first gatewayclass with our controller name? Would we add all gatewayclasses with our controller name? If we added more than one owner reference, would we need to delete old owner references when the corresponding gatewayclasses were deleted? If we did delete stale owner references, would that prevent garbage collection, or would we always leave one non-stale reference to trigger garbage collection?

I don't know for sure that OLM doesn't look at the owner reference. We would need to check this.

I am not confident that an owner reference would cause the subscription to be deleted as the owner reference on the Istio CR didn't cause it to be deleted (see OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900 (comment)).

Deleting the Istio CR only requires changing the test, it is more explicit than relying on garbage collection, and it is more obviously safe to backport.

alebedev87 · 2025-06-16T22:20:09Z

test/extended/router/gatewayapicontroller.go

+		g.By("Deleting the Istio CR")
+
+		o.Expect(oc.AsAdmin().Run("delete").Args("--ignore-not-found=true", "istio", istioName).Execute()).Should(o.Succeed())


Istio CR is supposed to be garbage collected since its owner reference is gatewayclass.

The owner reference on the Istio CR didn't cause it to be deleted (see #29900 (comment)).

The owner reference on the Istio CR didn't cause it to be deleted

I didn't manage to reproduce this behavior. I saw Istio CR gets deleted after GatewayClass:

$ oc get gc NAME CONTROLLER ACCEPTED AGE openshift-default openshift.io/gateway-controller/v1 True 4m12s 04:57:08 $ oc get istio NAME REVISIONS READY IN USE ACTIVE REVISION STATUS VERSION AGE openshift-gateway 1 1 0 openshift-gateway Healthy v1.24.3 4m18s 04:57:14 $ oc get istio openshift-gateway -o yaml | yq .metadata.ownerReferences[0] apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass name: openshift-default uid: 3f6ef6ed-9e6b-4821-9706-221ff0bca83e 04:57:34 $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE istiod-openshift-gateway-7b567bc8b4-z9972 1/1 Running 0 4m48s router-default-76c4888886-fmtzq 1/1 Running 0 77m router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m 04:57:52 $ oc delete gc openshift-default gatewayclass.gateway.networking.k8s.io "openshift-default" deleted 04:58:07 $ oc get istio No resources found 04:58:14 $ oc -n openshift-ingress get pods NAME READY STATUS RESTARTS AGE router-default-76c4888886-fmtzq 1/1 Running 0 78m router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m

--ignore-not-found=true will prevent the delete from failing if GC has already deleted the object. I'll add a code comment that the delete might be superfluous but it's there just in case.

Comment added in https://github.com/openshift/origin/compare/38d8018dfd320088688bd559b77c7f73e998ef13..e2271274f2b3f34c7eff8ffa252520acb593cacc.

alebedev87 · 2025-06-16T22:44:51Z

test/extended/router/gatewayapicontroller.go

+			if err != nil && strings.Contains(err.Error(), "not found") {
+				e2e.Logf("Subscription %q not found; retrying...", expectedSubscriptionName)
+				return false, nil
+			}


I think that we should be consistent among all the polls we do in this block. I personally prefer how it's done for the OSSM deployment below:

if err != nil { e2e.Logf("Failed to get OSSM operator deployment %q: %v; retrying...", deploymentOSSMName, err) return false, nil }

No assertions, just a retry for any error until the timeout is triggered. I think that some errors (not only "Not Found") can be temporary or intermittent.

I was trying to keep my changes more narrowly focused. All right, I can make the polling loop for the subscription retry on all errors.

Fixed in https://github.com/openshift/origin/compare/1dcc98a3c2ec7c38dcee818e750e14ce57d70892..38d8018dfd320088688bd559b77c7f73e998ef13.

Thealisyed · 2025-06-17T10:52:18Z

LGTM, holding off for @alebedev87 comments

openshift-trt · 2025-06-17T13:13:39Z

Job Failure Risk Analysis for sha: 38d8018

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-csi	IncompleteTests Tests for this run (19) are below the historical average (1374): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-disruptive	IncompleteTests Tests for this run (19) are below the historical average (1140): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2	IncompleteTests Tests for this run (18) are below the historical average (1403): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-2of2	IncompleteTests Tests for this run (19) are below the historical average (1430): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn	IncompleteTests Tests for this run (19) are below the historical average (1146): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling	IncompleteTests Tests for this run (19) are below the historical average (1343): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade	IncompleteTests Tests for this run (19) are below the historical average (1315): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade	IncompleteTests Tests for this run (19) are below the historical average (810): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Miciah · 2025-06-19T03:44:12Z

The aggregated jobs each failed while buliding the tests-openshift.origin-amd64 image, with the error message, "Error: Unable to find a match: python3-cinderclient" (missing RPM package). I'll retry in case it was glitch with the Yum repository.

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5

openshift-ci · 2025-06-19T03:44:17Z

@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aad73360-4cbf-11f0-9efa-6a57a5235fed-0

Miciah · 2025-06-19T19:51:10Z

This time all the aggregated jobs failed to build the image with the erorr message, "Error: Unable to find a match: realtime-tests rteval".

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5

openshift-ci · 2025-06-19T19:51:13Z

@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c0124e40-4d46-11f0-9623-e230cc269dc8-0

openshift-trt · 2025-09-27T14:33:14Z