Skip to content

Conversation

@enxebre
Copy link
Member

@enxebre enxebre commented Mar 19, 2025

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

It aims to alleviate the current bash scripts fragility, fix the fact we always replace instead of append the featureGate.status and include current and any upcoming changes to this logic with appropriate test coverage

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.
Add it to cpov2

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-46379

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 19, 2025
@openshift-ci-robot
Copy link

@enxebre: This pull request references Jira Issue OCPBUGS-46379, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from hasueki and rtheis March 19, 2025 22:21
@openshift-ci openshift-ci bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing labels Mar 19, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Mar 19, 2025
@enxebre enxebre marked this pull request as draft March 19, 2025 22:21
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@enxebre
Copy link
Member Author

enxebre commented Mar 19, 2025

/test e2e-aws

@enxebre
Copy link
Member Author

enxebre commented Mar 19, 2025

cc @wking

@enxebre enxebre force-pushed the kas-bootstrap-bin branch from 91f8d4d to 298beeb Compare March 19, 2025 22:28
@enxebre
Copy link
Member Author

enxebre commented Mar 19, 2025

/test e2e-aws

@enxebre enxebre force-pushed the kas-bootstrap-bin branch from 298beeb to df2bfa8 Compare March 20, 2025 07:56
@enxebre
Copy link
Member Author

enxebre commented Mar 20, 2025

/test e2e-aws

@enxebre enxebre marked this pull request as ready for review March 20, 2025 10:45
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2025
@openshift-ci openshift-ci bot requested a review from csrwng March 20, 2025 10:46
@enxebre enxebre force-pushed the kas-bootstrap-bin branch from df2bfa8 to 64f52bc Compare March 20, 2025 11:10
@enxebre
Copy link
Member Author

enxebre commented Mar 20, 2025

/test e2e-aks


// we want to keep the process running during the lifecycle of the Pod.
// start a goroutine that will close the done channel when the context is done
done := make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why we want to keep the process running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the pod RestartPolicy is Always, this mimics current behaviour and I would defer deviating from it to a different change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, any reason we are not adding this an an init container instead?

Copy link
Member

@wking wking Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess it can't be an init container because it needs the actual API-server in another long-lived container in this same Pod to talk to. But couldn't we set a container-scoped restartPolicy: OnFailure for this container to get both:

  • The ability to exit 0 when we were successfully reconciled and recover the resources the container process had been consuming. Until some future when management of these resources moves from "successfully reconciled once per 4.y.z release" to "actively watched and managed with some regularity", which would be nice, but is likely more than we want to bite off in a single pull request.
  • Reporting via KubePodCrashLooping if the container has trouble, while the container continues to relaunch and retry. Not as direct as having the controlling CPO know why the container was having trouble, but at least there would be a sign of trouble visible in Kube at a higher level than "dip into the container's logs".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking this last comment led me to do a little bit of experimentation on a 4.19 ci cluster :)

  • Tried changing the restart policy of a side container under .spec.containers, and failed admission with:
    * spec.template.spec.containers[1].restartPolicy: Forbidden: may not be set for non-init containers

  • Then tried moving the container under .spec.initContainers with a restartPolicy of OnFailure, and also failed admission with:
    * spec.template.spec.initContainers[0].restartPolicy: Unsupported value: "OnFailure": supported values: "Always"

  • Then tried changing the initContainer restartPolicy to Always and the deployment was accepted. The init container ended up running as a side container (which was new to me :)). I could not see a difference though between the additional container under .spec.containers or the init container with restartPolicy as Always.

Bottom line, I think what we have here is fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the reason I didn't just set restart on failure for this container is that afaik individual containers can't supersede the Pod restartPolicy. Moving it to init as a side container will technically differ operationally from what we do at the moment and possibly causing more restarts while racing rendering for no value? so I'd rather keep it as it is to keep changes scoped and defer any further change to different PRs. After this one we'll still need to move the apply logic to this binary.
I added a comment in code to clarify about the restart policies.

@muraee
Copy link
Contributor

muraee commented Mar 20, 2025

don't we want to add this in cpov2 as well?

@enxebre
Copy link
Member Author

enxebre commented Mar 20, 2025

don't we want to add this in cpov2 as well?

Yes, It's stated as a follow up in the PR desc. This is just the first deliverable to keep PR sizing small

} else {
for _, cvoVersion := range clusterVersion.Status.History {
knownVersions = sets.NewString(clusterVersion.Status.Desired.Version)
knownVersions.Insert(cvoVersion.Version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standalone OCP currently doesn't do any garbage collection, so what you have now is fine as it stands. But once you hit your first Completed entry and insert that into knownVersions, you can break, because there shouldn't be anything left on the cluster that cares about those ancient releases anymore:

for _, cvoVersion := range clusterVersion.Status.History {
	knownVersions = sets.NewString(clusterVersion.Status.Desired.Version)
	knownVersions.Insert(cvoVersion.Version)
	if cvoVersion.State == configv1.CompletedUpdate {
		break
	}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the logic and unit coverage to reflect this

}

featureGate.Status.FeatureGates = desiredFeatureGates
if err := c.Status().Update(ctx, &featureGate); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FeatureGates type is pretty stable, so Update is likely safe. But just to be extra cautious, using a Patch instead of an Update with a payload that says exactly what you want to set is a good way to avoid clearing unrecognized new properties. For an example of me recovering from being bitten by this in code I maintain, see openshift/oc#1111.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the logic to patch instead

…te status

kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

It aims to alleviate the current bash scripts fragility
@enxebre enxebre force-pushed the kas-bootstrap-bin branch from d2eb26f to 9a6036b Compare March 25, 2025 09:20
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2025
@enxebre enxebre force-pushed the kas-bootstrap-bin branch from 9a6036b to 7a82544 Compare March 25, 2025 11:36
@enxebre
Copy link
Member Author

enxebre commented Mar 25, 2025

/test e2e-kubevirt-aws-ovn-reduced

@enxebre
Copy link
Member Author

enxebre commented Mar 25, 2025

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

@csrwng
Copy link
Contributor

csrwng commented Mar 25, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 25, 2025

@enxebre: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main
  • Red Hat Konflux / hypershift-operator-main-on-pull-request

Only the following failed contexts/checkruns were expected:

  • ci/prow/e2e-aks
  • ci/prow/e2e-aks-4-18
  • ci/prow/e2e-aws
  • ci/prow/e2e-aws-4-18
  • ci/prow/e2e-aws-upgrade-hypershift-operator
  • ci/prow/e2e-kubevirt-aws-ovn-reduced
  • ci/prow/images
  • ci/prow/okd-scos-e2e-aws-ovn
  • ci/prow/security
  • ci/prow/unit
  • ci/prow/verify
  • pull-ci-openshift-hypershift-main-e2e-aks
  • pull-ci-openshift-hypershift-main-e2e-aks-4-18
  • pull-ci-openshift-hypershift-main-e2e-aws
  • pull-ci-openshift-hypershift-main-e2e-aws-4-18
  • pull-ci-openshift-hypershift-main-e2e-aws-upgrade-hypershift-operator
  • pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
  • pull-ci-openshift-hypershift-main-images
  • pull-ci-openshift-hypershift-main-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-hypershift-main-security
  • pull-ci-openshift-hypershift-main-unit
  • pull-ci-openshift-hypershift-main-verify
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7be9e13 and 2 for PR HEAD 7a82544 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7be9e13 and 2 for PR HEAD 7a82544 in total

@enxebre
Copy link
Member Author

enxebre commented Mar 26, 2025

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

@enxebre: Overrode contexts on behalf of enxebre: Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main, Red Hat Konflux / hypershift-operator-main-on-pull-request

Details

In response to this:

/override "Red Hat Konflux / hypershift-operator-main-enterprise-contract / hypershift-operator-main"
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@enxebre
Copy link
Member Author

enxebre commented Mar 26, 2025

/test verify

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 879dadb and 2 for PR HEAD 7a82544 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 27, 2025

@enxebre: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit a1ef7b8 into openshift:main Mar 27, 2025
16 of 18 checks passed
@openshift-ci-robot
Copy link

@enxebre: Jira Issue OCPBUGS-46379: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-46379 has been moved to the MODIFIED state.

Details

In response to this:

What this PR does / why we need it:
kas-bootstrap is a tool to run the pre-required actions for bootstraping the kas during cluster creation (or upgrade).
It will apply some CRDs rendered by the cluster-config-operator and update the featureGate CR status by appending the git FeatureGate status.

It aims to alleviate the current bash scripts fragility, fix the fact we always replace instead of append the featureGate.status and include current and any upcoming changes to this logic with appropriate test coverage

Follow up:
Move the logic from kasContainerApplyBootstrap to kasContainerBootstrap and drop the former.
Add it to cpov2

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-46379

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: hypershift
This PR has been included in build ose-hypershift-container-v4.20.0-202503270540.p0.ga1ef7b8.assembly.stream.el9.
All builds following this will include this PR.

@rtheis
Copy link
Contributor

rtheis commented Mar 28, 2025

Thank you @enxebre. Are you all okay with or considering a cherry-pick of this fix to 4.18?

enxebre added a commit to enxebre/hypershift that referenced this pull request Mar 31, 2025
Follow up for openshift#5871
It aims to alleviate the current bash scripts fragility and cover existing and any upcoming changes to this logic with appropriate test coverage
@enxebre
Copy link
Member Author

enxebre commented Mar 31, 2025

Thank you @enxebre. Are you all okay with or considering a cherry-pick of this fix to 4.18?

I think we can backport this and #5937 together

enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 1, 2025
This was introduced here openshift#5871
It shouldn't be needed now openshift#5937 is merged
enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 9, 2025
This was introduced here openshift#5871
It shouldn't be needed now openshift#5937 is merged
enxebre added a commit to enxebre/hypershift that referenced this pull request Apr 21, 2025
This was introduced here openshift#5871
It shouldn't be needed now openshift#5937 is merged
@rtheis
Copy link
Contributor

rtheis commented Apr 30, 2025

@enxebre do you still think that a backport is doable for these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants