OCPBUGS-46379: Kas bootstrap bin #5871

enxebre · 2025-03-19T22:21:03Z

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

It aims to alleviate the current bash scripts fragility, fix the fact we always replace instead of append the featureGate.status and include current and any upcoming changes to this logic with appropriate test coverage

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.
Add it to cpov2

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-46379

Checklist

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci-robot · 2025-03-19T22:21:09Z

@enxebre: This pull request references Jira Issue OCPBUGS-46379, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #

Checklist

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-03-19T22:21:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2025-03-19T22:23:10Z

/test e2e-aws

enxebre · 2025-03-19T22:25:32Z

cc @wking

enxebre · 2025-03-19T22:47:06Z

/test e2e-aws

enxebre · 2025-03-20T07:57:26Z

/test e2e-aws

enxebre · 2025-03-20T14:50:17Z

/test e2e-aks

muraee · 2025-03-20T16:20:31Z

kas-bootstrap/kas_boostrap.go

+
+	// we want to keep the process running during the lifecycle of the Pod.
+	// start a goroutine that will close the done channel when the context is done
+	done := make(chan struct{})


can you explain why we want to keep the process running?

because the pod RestartPolicy is Always, this mimics current behaviour and I would defer deviating from it to a different change

ack, any reason we are not adding this an an init container instead?

I'd guess it can't be an init container because it needs the actual API-server in another long-lived container in this same Pod to talk to. But couldn't we set a container-scoped restartPolicy: OnFailure for this container to get both:

The ability to exit 0 when we were successfully reconciled and recover the resources the container process had been consuming. Until some future when management of these resources moves from "successfully reconciled once per 4.y.z release" to "actively watched and managed with some regularity", which would be nice, but is likely more than we want to bite off in a single pull request.

Reporting via KubePodCrashLooping if the container has trouble, while the container continues to relaunch and retry. Not as direct as having the controlling CPO know why the container was having trouble, but at least there would be a sign of trouble visible in Kube at a higher level than "dip into the container's logs".

@wking this last comment led me to do a little bit of experimentation on a 4.19 ci cluster :)

Tried changing the restart policy of a side container under .spec.containers, and failed admission with:
* spec.template.spec.containers[1].restartPolicy: Forbidden: may not be set for non-init containers

Then tried moving the container under .spec.initContainers with a restartPolicy of OnFailure, and also failed admission with:
* spec.template.spec.initContainers[0].restartPolicy: Unsupported value: "OnFailure": supported values: "Always"

Then tried changing the initContainer restartPolicy to Always and the deployment was accepted. The init container ended up running as a side container (which was new to me :)). I could not see a difference though between the additional container under .spec.containers or the init container with restartPolicy as Always.

Bottom line, I think what we have here is fine.

yes, the reason I didn't just set restart on failure for this container is that afaik individual containers can't supersede the Pod restartPolicy. Moving it to init as a side container will technically differ operationally from what we do at the moment and possibly causing more restarts while racing rendering for no value? so I'd rather keep it as it is to keep changes scoped and defer any further change to different PRs. After this one we'll still need to move the apply logic to this binary.
I added a comment in code to clarify about the restart policies.

muraee · 2025-03-20T16:20:58Z

don't we want to add this in cpov2 as well?

enxebre · 2025-03-20T16:32:22Z

don't we want to add this in cpov2 as well?

Yes, It's stated as a follow up in the PR desc. This is just the first deliverable to keep PR sizing small

wking · 2025-03-20T23:17:49Z

kas-bootstrap/kas_boostrap.go

+	} else {
+		for _, cvoVersion := range clusterVersion.Status.History {
+			knownVersions = sets.NewString(clusterVersion.Status.Desired.Version)
+			knownVersions.Insert(cvoVersion.Version)


standalone OCP currently doesn't do any garbage collection, so what you have now is fine as it stands. But once you hit your first Completed entry and insert that into knownVersions, you can break, because there shouldn't be anything left on the cluster that cares about those ancient releases anymore:

for _, cvoVersion := range clusterVersion.Status.History { knownVersions = sets.NewString(clusterVersion.Status.Desired.Version) knownVersions.Insert(cvoVersion.Version) if cvoVersion.State == configv1.CompletedUpdate { break } }

Thanks, updated the logic and unit coverage to reflect this

wking · 2025-03-20T23:22:09Z

kas-bootstrap/kas_boostrap.go

+	}
+
+	featureGate.Status.FeatureGates = desiredFeatureGates
+	if err := c.Status().Update(ctx, &featureGate); err != nil {


The FeatureGates type is pretty stable, so Update is likely safe. But just to be extra cautious, using a Patch instead of an Update with a payload that says exactly what you want to set is a good way to avoid clearing unrecognized new properties. For an example of me recovering from being bitten by this in code I maintain, see openshift/oc#1111.

Thanks, updated the logic to patch instead

…te status kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade). It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status. It aims to alleviate the current bash scripts fragility

enxebre · 2025-03-25T17:27:39Z

/test e2e-kubevirt-aws-ovn-reduced

enxebre · 2025-03-25T17:29:15Z

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

csrwng · 2025-03-25T17:29:19Z

/lgtm

openshift-ci · 2025-03-25T17:33:42Z

@enxebre: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main
Red Hat Konflux / hypershift-operator-main-on-pull-request

Only the following failed contexts/checkruns were expected:

ci/prow/e2e-aks
ci/prow/e2e-aks-4-18
ci/prow/e2e-aws
ci/prow/e2e-aws-4-18
ci/prow/e2e-aws-upgrade-hypershift-operator
ci/prow/e2e-kubevirt-aws-ovn-reduced
ci/prow/images
ci/prow/okd-scos-e2e-aws-ovn
ci/prow/security
ci/prow/unit
ci/prow/verify
pull-ci-openshift-hypershift-main-e2e-aks
pull-ci-openshift-hypershift-main-e2e-aks-4-18
pull-ci-openshift-hypershift-main-e2e-aws
pull-ci-openshift-hypershift-main-e2e-aws-4-18
pull-ci-openshift-hypershift-main-e2e-aws-upgrade-hypershift-operator
pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
pull-ci-openshift-hypershift-main-images
pull-ci-openshift-hypershift-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-hypershift-main-security
pull-ci-openshift-hypershift-main-unit
pull-ci-openshift-hypershift-main-verify
tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-03-25T18:53:15Z

/retest-required

Remaining retests: 0 against base HEAD 7be9e13 and 2 for PR HEAD 7a82544 in total

openshift-ci-robot · 2025-03-26T00:33:23Z

/retest-required

Remaining retests: 0 against base HEAD 7be9e13 and 2 for PR HEAD 7a82544 in total

enxebre · 2025-03-26T08:04:20Z

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

openshift-ci · 2025-03-26T08:04:50Z

@enxebre: Overrode contexts on behalf of enxebre: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main, Red Hat Konflux / hypershift-operator-main-on-pull-request

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

enxebre · 2025-03-26T16:21:46Z

/test verify

openshift-ci-robot · 2025-03-27T00:32:03Z

/retest-required

Remaining retests: 0 against base HEAD 879dadb and 2 for PR HEAD 7a82544 in total

openshift-ci · 2025-03-27T03:31:38Z

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-03-27T03:34:22Z

@enxebre: Jira Issue OCPBUGS-46379: All pull requests linked via external trackers have merged:

openshift/hypershift#5871

Jira Issue OCPBUGS-46379 has been moved to the MODIFIED state.

Details

In response to this:

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

It aims to alleviate the current bash scripts fragility, fix the fact we always replace instead of append the featureGate.status and include current and any upcoming changes to this logic with appropriate test coverage

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.
Add it to cpov2

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-46379

Checklist

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-03-27T06:29:13Z

[ART PR BUILD NOTIFIER]

Distgit: hypershift
This PR has been included in build ose-hypershift-container-v4.20.0-202503270540.p0.ga1ef7b8.assembly.stream.el9.
All builds following this will include this PR.

rtheis · 2025-03-28T19:43:40Z

Thank you @enxebre. Are you all okay with or considering a cherry-pick of this fix to 4.18?

Follow up for openshift#5871 It aims to alleviate the current bash scripts fragility and cover existing and any upcoming changes to this logic with appropriate test coverage

enxebre · 2025-03-31T13:13:32Z

Thank you @enxebre. Are you all okay with or considering a cherry-pick of this fix to 4.18?

I think we can backport this and #5937 together

This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged

rtheis · 2025-04-30T11:07:29Z

@enxebre do you still think that a backport is doable for these changes?

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 19, 2025

openshift-ci bot added the do-not-merge/needs-area label Mar 19, 2025

enxebre mentioned this pull request Mar 19, 2025

OCPBUGS-46379: Let KASContainerApplyBootstrap append to status.featureGates #5570

Closed

4 tasks

openshift-ci bot requested review from hasueki and rtheis March 19, 2025 22:21

openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing labels Mar 19, 2025

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Mar 19, 2025

enxebre marked this pull request as draft March 19, 2025 22:21

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025

enxebre force-pushed the kas-bootstrap-bin branch from 91f8d4d to 298beeb Compare March 19, 2025 22:28

enxebre force-pushed the kas-bootstrap-bin branch from 298beeb to df2bfa8 Compare March 20, 2025 07:56

enxebre marked this pull request as ready for review March 20, 2025 10:45

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2025

openshift-ci bot requested a review from csrwng March 20, 2025 10:46

enxebre force-pushed the kas-bootstrap-bin branch from df2bfa8 to 64f52bc Compare March 20, 2025 11:10

muraee reviewed Mar 20, 2025

View reviewed changes

wking reviewed Mar 20, 2025

View reviewed changes

enxebre force-pushed the kas-bootstrap-bin branch from d2eb26f to 9a6036b Compare March 25, 2025 09:20

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2025

Let KAS Deployment to run the new kas-bootstrap binary

7a82544

enxebre force-pushed the kas-bootstrap-bin branch from 9a6036b to 7a82544 Compare March 25, 2025 11:36

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2025

openshift-merge-bot bot merged commit a1ef7b8 into openshift:main Mar 27, 2025
16 of 18 checks passed

enxebre mentioned this pull request Mar 31, 2025

CNTRLPLANE-378: Move bootstrap apply bash into kas-bootstrap binary #5937

Merged

4 tasks

enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 1, 2025

Remove kas from EnsureNoCrashingPods exceptions

30d2990

This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged

This was referenced Apr 1, 2025

NO-JIRA: Remove kas from EnsureNoCrashingPods exceptions #5946

Merged

CNTRLPLANE-378: Run kas-bootstrap binary for cpov2 #5947

Merged

enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 9, 2025

Remove kas from EnsureNoCrashingPods exceptions

cc7edbb

This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged

enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 21, 2025

Remove kas from EnsureNoCrashingPods exceptions

b597610

This was introduced here openshift#5871 It shouldn't be needed now openshift#5937 is merged

OCPBUGS-46379: Kas bootstrap bin #5871

OCPBUGS-46379: Kas bootstrap bin #5871

Uh oh!

Conversation

enxebre commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 19, 2025

Uh oh!

openshift-ci bot commented Mar 19, 2025

Uh oh!

enxebre commented Mar 19, 2025

Uh oh!

enxebre commented Mar 19, 2025

Uh oh!

enxebre commented Mar 19, 2025

Uh oh!

enxebre commented Mar 20, 2025

Uh oh!

enxebre commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muraee commented Mar 20, 2025

Uh oh!

enxebre commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre commented Mar 25, 2025

Uh oh!

enxebre commented Mar 25, 2025

Uh oh!

csrwng commented Mar 25, 2025

Uh oh!

openshift-ci bot commented Mar 25, 2025

Uh oh!

openshift-ci-robot commented Mar 25, 2025

Uh oh!

openshift-ci-robot commented Mar 26, 2025

Uh oh!

enxebre commented Mar 26, 2025

Uh oh!

openshift-ci bot commented Mar 26, 2025

Uh oh!

enxebre commented Mar 26, 2025

Uh oh!

openshift-ci-robot commented Mar 27, 2025

Uh oh!

openshift-ci bot commented Mar 27, 2025

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 27, 2025

Uh oh!

openshift-bot commented Mar 27, 2025

Uh oh!

rtheis commented Mar 28, 2025

Uh oh!

enxebre commented Mar 31, 2025

Uh oh!

rtheis commented Apr 30, 2025

Uh oh!

enxebre commented Mar 19, 2025 •

edited

Loading

wking Mar 20, 2025 •

edited

Loading