Skip to content

Conversation

@clobrano
Copy link
Contributor

This change configures the podman-etcd resource to allow restarts in case of a start failure by setting . This is a prerequisite for the resource-agent to attempt restarts, improving the resilience of the etcd cluster. The actual restart logic is handled by the resource-agent itself.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 17, 2025

@clobrano: This pull request references OCPEDGE-2231 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This change configures the podman-etcd resource to allow restarts in case of a start failure by setting . This is a prerequisite for the resource-agent to attempt restarts, improving the resilience of the etcd cluster. The actual restart logic is handled by the resource-agent itself.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 17, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2025
@clobrano clobrano changed the title OCPEDGE-2231: feat: Allow podman-etcd resource-agent to restart on start failure OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure Nov 17, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Added a migration-threshold=5 meta attribute to etcd and kubelet PCS resource creation commands; added a PCS property command to set start-failure-is-fatal=false after cluster start. No control flow or error-handling changes.

Changes

Cohort / File(s) Summary
PCS resource configuration updates
pkg/tnf/pkg/pcs/cluster.go, pkg/tnf/pkg/pcs/etcd.go
Appended migration-threshold=5 meta attribute to resource creation commands. In cluster.go, added /usr/sbin/pcs property set start-failure-is-fatal=false executed after cluster start.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2025
@clobrano clobrano force-pushed the enhancement/podman-etcd-resource-create-with-start-failure-is-fatal-false branch 4 times, most recently from a3a171f to db85da4 Compare November 21, 2025 13:25
@clobrano clobrano marked this pull request as ready for review November 28, 2025 09:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2025
@openshift-ci openshift-ci bot requested review from jaypoulz and slintes November 28, 2025 09:57
@clobrano
Copy link
Contributor Author

clobrano commented Dec 2, 2025

/retest-required

Copy link

@fonta-rh fonta-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Great addition! This change is very, very important, specially during the cluster bootstrap process, where there are a lot of transitions happening. It gives us a lot more leeway in handling race conditions that don't end up in fatal situations, just minor deviations from the default timeline.

"/usr/sbin/pcs cluster enable --all",
"/usr/sbin/pcs cluster sync",
"/usr/sbin/pcs cluster reload corosync",
"/usr/sbin/pcs property set start-failure-is-fatal=false",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a global property, would we want to move this line before the "resource create......"? So that it's in effect before the resource even starts up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it makes sense 👍

// Note: the kubelet service needs to be disabled when using systemd agent
// Done by after-setup jobs on both nodes
"/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true",
"/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true migration-threshold=5",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment explaining why 5 attempts (IE: we can't set infinity because it might get stuck in an endless loop but we want to give enough time for transient issues to smooth over)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add the comment

This change configures the TNF cluster to allow restarts in case of a
start failure by setting the attribute `start-failure-is-fatal=false`.

This is a prerequisite for the resource-agents to attempt restarts upon
failures during their start action.
@clobrano clobrano force-pushed the enhancement/podman-etcd-resource-create-with-start-failure-is-fatal-false branch from db85da4 to 7b43e41 Compare January 9, 2026 15:40
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/tnf/pkg/pcs/cluster.go (1)

36-37: LGTM! Clear explanation of migration-threshold.

The addition of migration-threshold=5 with an explanatory comment is a good practice. Setting a finite threshold prevents indefinite restart attempts that could impact cluster stability.

💡 Optional: Clarify post-threshold behavior

Consider briefly noting what happens after 5 failures (e.g., "resource stops attempting restarts"):

-// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. This provides a safe limit on retries.
+// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. After 5 failures, the resource stops attempting restarts.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between db85da4 and 7b43e41.

📒 Files selected for processing (2)
  • pkg/tnf/pkg/pcs/cluster.go
  • pkg/tnf/pkg/pcs/etcd.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pkg/pcs/etcd.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/tnf/pkg/pcs/cluster.go
🔇 Additional comments (1)
pkg/tnf/pkg/pcs/cluster.go (1)

33-33: Cluster-wide property correctly applies to etcd resource.

The start-failure-is-fatal=false property set via pcs property set is a cluster-level property that applies to all resources in the cluster, including the etcd resource. Placement is correct: the property is set after cluster start but before resource creation, ensuring all subsequently created resources (both kubelet in this file and etcd in etcd.go) inherit this setting.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 9, 2026

@clobrano: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 7b43e41 link true /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants