-
Notifications
You must be signed in to change notification settings - Fork 426
OCPBUGS-46379: Kas bootstrap bin #5871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-46379: Kas bootstrap bin #5871
Conversation
|
@enxebre: This pull request references Jira Issue OCPBUGS-46379, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-aws |
|
cc @wking |
91f8d4d to
298beeb
Compare
|
/test e2e-aws |
298beeb to
df2bfa8
Compare
|
/test e2e-aws |
df2bfa8 to
64f52bc
Compare
|
/test e2e-aks |
|
|
||
| // we want to keep the process running during the lifecycle of the Pod. | ||
| // start a goroutine that will close the done channel when the context is done | ||
| done := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why we want to keep the process running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the pod RestartPolicy is Always, this mimics current behaviour and I would defer deviating from it to a different change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, any reason we are not adding this an an init container instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess it can't be an init container because it needs the actual API-server in another long-lived container in this same Pod to talk to. But couldn't we set a container-scoped restartPolicy: OnFailure for this container to get both:
- The ability to exit 0 when we were successfully reconciled and recover the resources the container process had been consuming. Until some future when management of these resources moves from "successfully reconciled once per 4.y.z release" to "actively watched and managed with some regularity", which would be nice, but is likely more than we want to bite off in a single pull request.
- Reporting via
KubePodCrashLoopingif the container has trouble, while the container continues to relaunch and retry. Not as direct as having the controlling CPO know why the container was having trouble, but at least there would be a sign of trouble visible in Kube at a higher level than "dip into the container's logs".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking this last comment led me to do a little bit of experimentation on a 4.19 ci cluster :)
-
Tried changing the restart policy of a side container under
.spec.containers, and failed admission with:
* spec.template.spec.containers[1].restartPolicy: Forbidden: may not be set for non-init containers -
Then tried moving the container under
.spec.initContainerswith a restartPolicy ofOnFailure, and also failed admission with:
* spec.template.spec.initContainers[0].restartPolicy: Unsupported value: "OnFailure": supported values: "Always" -
Then tried changing the initContainer restartPolicy to
Alwaysand the deployment was accepted. The init container ended up running as a side container (which was new to me :)). I could not see a difference though between the additional container under.spec.containersor the init container with restartPolicy as Always.
Bottom line, I think what we have here is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the reason I didn't just set restart on failure for this container is that afaik individual containers can't supersede the Pod restartPolicy. Moving it to init as a side container will technically differ operationally from what we do at the moment and possibly causing more restarts while racing rendering for no value? so I'd rather keep it as it is to keep changes scoped and defer any further change to different PRs. After this one we'll still need to move the apply logic to this binary.
I added a comment in code to clarify about the restart policies.
|
don't we want to add this in cpov2 as well? |
Yes, It's stated as a follow up in the PR desc. This is just the first deliverable to keep PR sizing small |
| } else { | ||
| for _, cvoVersion := range clusterVersion.Status.History { | ||
| knownVersions = sets.NewString(clusterVersion.Status.Desired.Version) | ||
| knownVersions.Insert(cvoVersion.Version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standalone OCP currently doesn't do any garbage collection, so what you have now is fine as it stands. But once you hit your first Completed entry and insert that into knownVersions, you can break, because there shouldn't be anything left on the cluster that cares about those ancient releases anymore:
for _, cvoVersion := range clusterVersion.Status.History {
knownVersions = sets.NewString(clusterVersion.Status.Desired.Version)
knownVersions.Insert(cvoVersion.Version)
if cvoVersion.State == configv1.CompletedUpdate {
break
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the logic and unit coverage to reflect this
kas-bootstrap/kas_boostrap.go
Outdated
| } | ||
|
|
||
| featureGate.Status.FeatureGates = desiredFeatureGates | ||
| if err := c.Status().Update(ctx, &featureGate); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FeatureGates type is pretty stable, so Update is likely safe. But just to be extra cautious, using a Patch instead of an Update with a payload that says exactly what you want to set is a good way to avoid clearing unrecognized new properties. For an example of me recovering from being bitten by this in code I maintain, see openshift/oc#1111.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the logic to patch instead
…te status kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade). It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status. It aims to alleviate the current bash scripts fragility
d2eb26f to
9a6036b
Compare
9a6036b to
7a82544
Compare
|
/test e2e-kubevirt-aws-ovn-reduced |
|
/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main" |
|
/lgtm |
|
@enxebre: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1 similar comment
|
/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main" |
|
@enxebre: Overrode contexts on behalf of enxebre: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main, Red Hat Konflux / hypershift-operator-main-on-pull-request DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test verify |
|
@enxebre: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
a1ef7b8
into
openshift:main
|
@enxebre: Jira Issue OCPBUGS-46379: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-46379 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] Distgit: hypershift |
|
Thank you @enxebre. Are you all okay with or considering a cherry-pick of this fix to 4.18? |
Follow up for openshift#5871 It aims to alleviate the current bash scripts fragility and cover existing and any upcoming changes to this logic with appropriate test coverage
This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged
This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged
This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged
|
@enxebre do you still think that a backport is doable for these changes? |
What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.
It aims to alleviate the current bash scripts fragility, fix the fact we always replace instead of append the featureGate.status and include current and any upcoming changes to this logic with appropriate test coverage
Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.
Add it to cpov2
Which issue(s) this PR fixes (optional, use
fixes #<issue_number>(, fixes #<issue_number>, ...)format, where issue_number might be a GitHub issue, or a Jira story:Fixes #OCPBUGS-46379
Checklist