-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900
Conversation
|
@Miciah: This pull request references Jira Issue OCPBUGS-56281, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fc08232 to
bf853bf
Compare
|
Job Failure Risk Analysis for sha: bf853bf
|
|
LGTM, @melvinjoseph86 PTAL |
|
/lgtm |
|
/retest |
|
Job Failure Risk Analysis for sha: 1967dd2
|
1967dd2 to
ab81b79
Compare
|
Job Failure Risk Analysis for sha: ab81b79
|
ab81b79 to
1dcc98a
Compare
|
https://github.com/openshift/origin/compare/1967dd22c83963e780eb9953bc38da760e090dc8..1dcc98a3c2ec7c38dcee818e750e14ce57d70892 made these changes:
Before these changes, Also, comparing |
|
Job Failure Risk Analysis for sha: 1dcc98a
|
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
|
@abhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/101e8ee0-4acb-11f0-928a-4bd1c2be89d0-0 |
| e2e.Failf("Failed to delete GatewayClass %q", gatewayClassName) | ||
| } | ||
|
|
||
| g.By("Deleting the OSSM Operator resources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why we don't use an owner reference for Subscription? We could owner reference the gatewayclass and let Kube do the cascading deletion.
Upd: Deletion of Subscription doesn't delete CSV or CRDs. The CRD part is understandable: there can be some data loss. But CSV is kinda interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few reasons not to put or rely on an owner reference on the subscription:
- You could create the subscription manually; we cannot assume that the operator created it.
- You could have multiple gatewayclasses with our controller name, and then it isn't clear how we would configure the owner references on the subscription. Would we add only the first gatewayclass with our controller name? Would we add all gatewayclasses with our controller name? If we added more than one owner reference, would we need to delete old owner references when the corresponding gatewayclasses were deleted? If we did delete stale owner references, would that prevent garbage collection, or would we always leave one non-stale reference to trigger garbage collection?
- I don't know for sure that OLM doesn't look at the owner reference. We would need to check this.
- I am not confident that an owner reference would cause the subscription to be deleted as the owner reference on the Istio CR didn't cause it to be deleted (see OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900 (comment)).
- Deleting the Istio CR only requires changing the test, it is more explicit than relying on garbage collection, and it is more obviously safe to backport.
| g.By("Deleting the Istio CR") | ||
|
|
||
| o.Expect(oc.AsAdmin().Run("delete").Args("--ignore-not-found=true", "istio", istioName).Execute()).Should(o.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Istio CR is supposed to be garbage collected since its owner reference is gatewayclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owner reference on the Istio CR didn't cause it to be deleted (see #29900 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owner reference on the Istio CR didn't cause it to be deleted
I didn't manage to reproduce this behavior. I saw Istio CR gets deleted after GatewayClass:
$ oc get gc
NAME CONTROLLER ACCEPTED AGE
openshift-default openshift.io/gateway-controller/v1 True 4m12s
04:57:08 $ oc get istio
NAME REVISIONS READY IN USE ACTIVE REVISION STATUS VERSION AGE
openshift-gateway 1 1 0 openshift-gateway Healthy v1.24.3 4m18s
04:57:14 $ oc get istio openshift-gateway -o yaml | yq .metadata.ownerReferences[0]
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
name: openshift-default
uid: 3f6ef6ed-9e6b-4821-9706-221ff0bca83e
04:57:34 $ oc -n openshift-ingress get pods
NAME READY STATUS RESTARTS AGE
istiod-openshift-gateway-7b567bc8b4-z9972 1/1 Running 0 4m48s
router-default-76c4888886-fmtzq 1/1 Running 0 77m
router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m
04:57:52 $ oc delete gc openshift-default
gatewayclass.gateway.networking.k8s.io "openshift-default" deleted
04:58:07 $ oc get istio
No resources found
04:58:14 $ oc -n openshift-ingress get pods
NAME READY STATUS RESTARTS AGE
router-default-76c4888886-fmtzq 1/1 Running 0 78m
router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--ignore-not-found=true will prevent the delete from failing if GC has already deleted the object. I'll add a code comment that the delete might be superfluous but it's there just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if err != nil && strings.Contains(err.Error(), "not found") { | ||
| e2e.Logf("Subscription %q not found; retrying...", expectedSubscriptionName) | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should be consistent among all the polls we do in this block. I personally prefer how it's done for the OSSM deployment below:
if err != nil {
e2e.Logf("Failed to get OSSM operator deployment %q: %v; retrying...", deploymentOSSMName, err)
return false, nil
}
No assertions, just a retry for any error until the timeout is triggered. I think that some errors (not only "Not Found") can be temporary or intermittent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep my changes more narrowly focused. All right, I can make the polling loop for the subscription retry on all errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1dcc98a to
38d8018
Compare
|
LGTM, holding off for @alebedev87 comments |
|
Job Failure Risk Analysis for sha: 38d8018
|
|
The aggregated jobs each failed while buliding the tests-openshift.origin-amd64 image, with the error message, "Error: Unable to find a match: python3-cinderclient" (missing RPM package). I'll retry in case it was glitch with the Yum repository. /payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
|
@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aad73360-4cbf-11f0-9efa-6a57a5235fed-0 |
|
This time all the aggregated jobs failed to build the image with the erorr message, "Error: Unable to find a match: realtime-tests rteval". /payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
|
@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c0124e40-4d46-11f0-9623-e230cc269dc8-0 |
|
Job Failure Risk Analysis for sha: 573f478
|
2 similar comments
|
Job Failure Risk Analysis for sha: 573f478
|
|
Job Failure Risk Analysis for sha: 573f478
|
|
/hold Revision 573f478 was retested 3 times: holding |
|
Job Failure Risk Analysis for sha: 573f478
|
1 similar comment
|
Job Failure Risk Analysis for sha: 573f478
|
|
Job Failure Risk Analysis for sha: 573f478
|
|
Job Failure Risk Analysis for sha: 573f478
|
|
ci/prow/e2e-aws-csi failed because the ELB for the API couldn't be provisioned: "TooManyLoadBalancers" suggests the CI account hit a quota. /test e2e-aws-csi e2e-openstack-ovn failed because the installer reported, "failed to provision control-plane machines within 15m0s". I wasn't able to find any further details in the CI artifacts. /test e2e-openstack-ovn /hold cancel |
|
/hold Revision 573f478 was retested 3 times: holding |
|
None of the failing jobs are required, so I don't understand the reason for the hold. The e2e-vsphere-ovn has "unknown" in the "Required" column, so maybe that is confusing openshift-ci-robot? /skip |
|
e2e-vsphere-ovn failed on several test:
I believe all the failures were caused by flaky tests, so I'm going to rerun the CI job. /test e2e-vsphere-ovn |
|
/hold cancel |
|
@Miciah: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@Miciah: Jira Issue Verification Checks: Jira Issue OCPBUGS-56281 Jira Issue OCPBUGS-56281 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This reverts the inclusion of the olm depdendency in openshift#29900 Signed-off-by: Todd Short <[email protected]>
|
/cherry-pick release-4.20 |
|
@jupierce: #29900 failed to apply on top of branch "release-4.20": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.21.0-0.nightly-2025-10-22-123727 |
gatewayapicontroller: Add checks for empty slices
Check whether the slice of parent resource references in an httproute's status is empty before indexing the slice.
Before this commit, the "Ensure HTTPRoute object is created" test sometimes panicked with "runtime error: index out of range [0] with length 0".
Similarly, check whether the slice of load-balancer ingress points in a service's status is empty before indexing it.
gatewayapicontroller: Clean up resources when done
Delete the gatewayclass and uninstall OSSM after all the Gateway API controller tests are done.
Before this change, the Gateway API controller tests left OSSM installed, including the subscription, CSV, installplan, bundled CRDs, RBAC resources, deployment, service, serviceaccount, etc., when the tests were finished. This clutter could cause problems for other tests, or for the same test if it was run again.
The new cleanup logic uses the
OperatorsV1client fromgithub.amrom.workers.dev/operator-framework/operator-lifecycle-manager. Importing this package requires areplacestanza foropenshift/apiingo.mod.This vendors
github.com/operator-framework/operator-lifecycle-manager v0.30.1-0.20250114164243-1b6752ec65farather than the newest revision in order to avoid bringing in additional problematic vendor bumps that the newest revision would bring in.gatewayapicontroller: Always log errors
Add the error value to some log messages that were missing it.