OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513

clobrano · 2025-11-17T15:14:40Z

This change configures the podman-etcd resource to allow restarts in case of a start failure by setting . This is a prerequisite for the resource-agent to attempt restarts, improving the resilience of the etcd cluster. The actual restart logic is handled by the resource-agent itself.

openshift-ci-robot · 2025-11-17T15:14:44Z

@clobrano: This pull request references OCPEDGE-2231 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

Details

In response to this:

This change configures the podman-etcd resource to allow restarts in case of a start failure by setting . This is a prerequisite for the resource-agent to attempt restarts, improving the resilience of the etcd cluster. The actual restart logic is handled by the resource-agent itself.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-11-17T15:14:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-11-17T15:15:01Z

Walkthrough

Added a migration-threshold=5 meta attribute to etcd and kubelet PCS resource creation commands; added a PCS property command to set start-failure-is-fatal=false after cluster start. No control flow or error-handling changes.

Changes

Cohort / File(s)	Summary
PCS resource configuration updates `pkg/tnf/pkg/pcs/cluster.go`, `pkg/tnf/pkg/pcs/etcd.go`	Appended `migration-threshold=5` meta attribute to resource creation commands. In `cluster.go`, added `/usr/sbin/pcs property set start-failure-is-fatal=false` executed after cluster start.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

✨ Finishing touches

📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

clobrano · 2025-12-02T09:47:29Z

/retest-required

fonta-rh

/lgtm
Great addition! This change is very, very important, specially during the cluster bootstrap process, where there are a lot of transitions happening. It gives us a lot more leeway in handling race conditions that don't end up in fatal situations, just minor deviations from the default timeline.

fonta-rh · 2026-01-09T10:52:20Z

pkg/tnf/pkg/pcs/cluster.go

 		"/usr/sbin/pcs cluster enable --all",
 		"/usr/sbin/pcs cluster sync",
 		"/usr/sbin/pcs cluster reload corosync",
+		"/usr/sbin/pcs property set start-failure-is-fatal=false",


Since it's a global property, would we want to move this line before the "resource create......"? So that it's in effect before the resource even starts up

Ah, yes, it makes sense 👍

fonta-rh · 2026-01-09T10:56:10Z

pkg/tnf/pkg/pcs/cluster.go

 		// Note: the kubelet service needs to be disabled when using systemd agent
 		// Done by after-setup jobs on both nodes
-		"/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true",
+		"/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true migration-threshold=5",


I would add a comment explaining why 5 attempts (IE: we can't set infinity because it might get stuck in an endless loop but we want to give enough time for transient issues to smooth over)

Good point, I'll add the comment

This change configures the TNF cluster to allow restarts in case of a start failure by setting the attribute `start-failure-is-fatal=false`. This is a prerequisite for the resource-agents to attempt restarts upon failures during their start action.

openshift-ci · 2026-01-09T15:40:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/tnf/OWNERS~~ [clobrano,fonta-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/tnf/pkg/pcs/cluster.go (1)
36-37: LGTM! Clear explanation of migration-threshold.

The addition of migration-threshold=5 with an explanatory comment is a good practice. Setting a finite threshold prevents indefinite restart attempts that could impact cluster stability.
💡 Optional: Clarify post-threshold behavior

Consider briefly noting what happens after 5 failures (e.g., "resource stops attempting restarts"):
-// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. This provides a safe limit on retries.
+// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. After 5 failures, the resource stops attempting restarts.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between db85da4 and 7b43e41.

📒 Files selected for processing (2)

pkg/tnf/pkg/pcs/cluster.go
pkg/tnf/pkg/pcs/etcd.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/tnf/pkg/pcs/etcd.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/tnf/pkg/pcs/cluster.go

🔇 Additional comments (1)

pkg/tnf/pkg/pcs/cluster.go (1)

33-33: Cluster-wide property correctly applies to etcd resource.

The start-failure-is-fatal=false property set via pcs property set is a cluster-level property that applies to all resources in the cluster, including the etcd resource. Placement is correct: the property is set after cluster start but before resource creation, ensuring all subsequently created resources (both kubelet in this file and etcd in etcd.go) inherit this setting.

openshift-ci · 2026-01-09T19:26:18Z

@clobrano: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`7b43e41`	link	true	`/test e2e-aws-ovn-single-node`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 17, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2025

clobrano changed the title ~~OCPEDGE-2231: feat: Allow podman-etcd resource-agent to restart on start failure~~ OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure Nov 17, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2025

clobrano force-pushed the enhancement/podman-etcd-resource-create-with-start-failure-is-fatal-false branch 4 times, most recently from a3a171f to db85da4 Compare November 21, 2025 13:25

clobrano marked this pull request as ready for review November 28, 2025 09:55

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 28, 2025

openshift-ci bot requested review from jaypoulz and slintes November 28, 2025 09:57

fonta-rh approved these changes Jan 9, 2026

View reviewed changes

clobrano force-pushed the enhancement/podman-etcd-resource-create-with-start-failure-is-fatal-false branch from db85da4 to 7b43e41 Compare January 9, 2026 15:40

coderabbitai bot reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513

OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513

clobrano commented Nov 17, 2025

Uh oh!

openshift-ci-robot commented Nov 17, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci bot commented Nov 17, 2025

Uh oh!

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

clobrano commented Dec 2, 2025

Uh oh!

fonta-rh left a comment •

edited

Loading

Uh oh!

fonta-rh Jan 9, 2026

Uh oh!

clobrano Jan 9, 2026

Uh oh!

fonta-rh Jan 9, 2026

Uh oh!

clobrano Jan 9, 2026

Uh oh!

openshift-ci bot commented Jan 9, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

openshift-ci bot commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513

Are you sure you want to change the base?

OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513

Conversation

clobrano commented Nov 17, 2025

Uh oh!

openshift-ci-robot commented Nov 17, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Nov 17, 2025

Uh oh!

coderabbitai bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

clobrano commented Dec 2, 2025

Uh oh!

fonta-rh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fonta-rh Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

clobrano Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

fonta-rh Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

clobrano Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jan 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Nov 17, 2025 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Nov 17, 2025 •

edited

Loading

fonta-rh left a comment •

edited

Loading