-
Notifications
You must be signed in to change notification settings - Fork 149
OCPEDGE-2231: [TNF] feat: Allow podman-etcd resource-agent to restart on start failure #1513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@clobrano: This pull request references OCPEDGE-2231 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
WalkthroughAdded a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
a3a171f to
db85da4
Compare
|
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Great addition! This change is very, very important, specially during the cluster bootstrap process, where there are a lot of transitions happening. It gives us a lot more leeway in handling race conditions that don't end up in fatal situations, just minor deviations from the default timeline.
pkg/tnf/pkg/pcs/cluster.go
Outdated
| "/usr/sbin/pcs cluster enable --all", | ||
| "/usr/sbin/pcs cluster sync", | ||
| "/usr/sbin/pcs cluster reload corosync", | ||
| "/usr/sbin/pcs property set start-failure-is-fatal=false", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a global property, would we want to move this line before the "resource create......"? So that it's in effect before the resource even starts up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, it makes sense 👍
| // Note: the kubelet service needs to be disabled when using systemd agent | ||
| // Done by after-setup jobs on both nodes | ||
| "/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true", | ||
| "/usr/sbin/pcs resource create kubelet systemd:kubelet clone meta interleave=true migration-threshold=5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment explaining why 5 attempts (IE: we can't set infinity because it might get stuck in an endless loop but we want to give enough time for transient issues to smooth over)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll add the comment
This change configures the TNF cluster to allow restarts in case of a start failure by setting the attribute `start-failure-is-fatal=false`. This is a prerequisite for the resource-agents to attempt restarts upon failures during their start action.
db85da4 to
7b43e41
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, fonta-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/tnf/pkg/pcs/cluster.go (1)
36-37: LGTM! Clear explanation of migration-threshold.The addition of
migration-threshold=5with an explanatory comment is a good practice. Setting a finite threshold prevents indefinite restart attempts that could impact cluster stability.💡 Optional: Clarify post-threshold behavior
Consider briefly noting what happens after 5 failures (e.g., "resource stops attempting restarts"):
-// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. This provides a safe limit on retries. +// Note: Setting `migration-threshold=5` to prevent endless restart loops caused by the infinite default. After 5 failures, the resource stops attempting restarts.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
pkg/tnf/pkg/pcs/cluster.gopkg/tnf/pkg/pcs/etcd.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/pkg/pcs/etcd.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/tnf/pkg/pcs/cluster.go
🔇 Additional comments (1)
pkg/tnf/pkg/pcs/cluster.go (1)
33-33: Cluster-wide property correctly applies to etcd resource.The
start-failure-is-fatal=falseproperty set viapcs property setis a cluster-level property that applies to all resources in the cluster, including the etcd resource. Placement is correct: the property is set after cluster start but before resource creation, ensuring all subsequently created resources (both kubelet in this file and etcd in etcd.go) inherit this setting.
|
@clobrano: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This change configures the podman-etcd resource to allow restarts in case of a start failure by setting . This is a prerequisite for the resource-agent to attempt restarts, improving the resilience of the etcd cluster. The actual restart logic is handled by the resource-agent itself.