OCPEDGE-1880: TNF: handle node replacements #1523

slintes · 2025-12-15T18:25:57Z

Summary

Refactors and extends TNF (Two Node Fencing) node handling to improve reliability during node lifecycle events, particularly node replacements and failures. Extracts logic into dedicated modules with comprehensive retry mechanisms and proper error handling.

Key Changes

Architecture & Reliability:

Extract node handling logic from starter.go into dedicated nodehandler.go
Prevent concurrent node handling with mutex protection

Node Replacement Support:

Implement update-setup job to handle node replacement scenarios
Detect offline nodes and update pacemaker cluster configuration
Re-run auth jobs and after-setup jobs when nodes change

Job Management:

Add RestartJobOrRunController() for restarting existing jobs
Implement job locking to prevent parallel execution conflicts
Track running controllers to prevent duplicate starts

For better overview, no actual changes yet Signed-off-by: Marc Sluiter <[email protected]>

Signed-off-by: Marc Sluiter <[email protected]>

by Carlo Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci-robot · 2025-12-15T18:26:01Z

@slintes: This pull request references OCPEDGE-1880 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Refactors and extends TNF (Two Node Fencing) node handling to improve reliability during node lifecycle events, particularly node replacements and failures. Extracts logic into dedicated modules with comprehensive retry mechanisms and proper error handling.

Key Changes

Architecture & Reliability:

Extract node handling logic from starter.go into dedicated nodehandler.go

Prevent concurrent node handling with mutex protection

Node Replacement Support:

Implement update-setup job to handle node replacement scenarios

Detect offline nodes and update pacemaker cluster configuration

Re-run auth jobs and after-setup jobs when nodes change

Job Management:

Add RestartJobOrRunController() for restarting existing jobs

Implement job locking to prevent parallel execution conflicts

Track running controllers to prevent duplicate starts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-12-15T18:26:08Z

Walkthrough

Refactors Two Node Fencing (TNF): adds job/controller orchestration with concurrency controls, node-handling and event-driven starter logic, new job utilities and tests, updates PCS/fencing and kubelet interactions, and introduces an update-setup runner and related CLI subcommand.

Changes

Cohort / File(s)	Summary
Kubernetes Manifest Configuration `bindata/tnfdeployment/job.yaml`	Added `app.kubernetes.io/component: two-node-fencing-setup` label and injected `MY_NODE_NAME` env var into `tnf-job` container via downward API.
CLI Entry Point `cmd/tnf-setup-runner/main.go`	Added `NewUpdateSetupCommand()` and registered `update-setup` subcommand that calls `tnfupdatesetup.RunTnfUpdateSetup`.
Job Utilities & Tests `pkg/tnf/pkg/jobs/utils.go`, `pkg/tnf/pkg/jobs/utils_test.go`	New job helpers: polling/waiting/deletion utilities (`WaitForStopped`, `WaitForCompletion`, `DeleteAndWait`), state predicates (`IsStopped`, `IsComplete`, `IsFailed`) and condition helpers; tests added.
TNF Job Controller & Tests `pkg/tnf/pkg/jobs/tnf.go`, `pkg/tnf/pkg/jobs/tnf_test.go`	New `RunTNFJobController` and `RestartJobOrRunController` with global concurrency maps (`runningControllers`, `restartJobLocks`) and comprehensive tests (including parallel locking).
Job Controller Refactor `pkg/tnf/pkg/jobs/jobcontroller.go`	Replaced local `isComplete`/`isFailed` with exported `IsComplete`/`IsFailed` usage.
Tools API: Conditions Removed `pkg/tnf/pkg/tools/conditions.go`	Removed `IsConditionTrue` and `IsConditionPresentAndEqual` (moved to jobs package).
Tools: Nodes Utilities & Tests `pkg/tnf/pkg/tools/nodes.go`, `pkg/tnf/pkg/tools/nodes_test.go`	Added `IsNodeReady` and `GetNodeIPForPacemaker` with IP-selection and normalization logic; tests added.
Config: Cluster Utilities `pkg/tnf/pkg/config/cluster.go`	Added `GetClusterConfigIgnoreMissingNode` and flexible validation (0 or 2 nodes; allow 1 when ignoring missing node); uses `tools.GetNodeIPForPacemaker`.
Operator Node Handler & Tests `pkg/tnf/operator/nodehandler.go`, `pkg/tnf/operator/nodehandler_test.go`	New node handler/orchestrator with mutexes, retry/backoff, controller startup sequencing, bootstrap wait, setup-update flow and unit tests.
Operator Starter & Tests `pkg/tnf/operator/starter.go`, `pkg/tnf/operator/starter_test.go`	Converted to event-driven handlers for nodes/secrets, updated `handleFencingSecretChange` signature, moved to async restart logic, and adapted tests to use `corev1listers.NodeLister`.
Runners: After-Setup / Auth / Fencing / Setup `pkg/tnf/after-setup/runner.go`, `pkg/tnf/auth/runner.go`, `pkg/tnf/fencing/runner.go`, `pkg/tnf/setup/runner.go`	Updated to use job utilities (`jobs.IsConditionTrue`, waiting helpers), replaced exec flows with package calls (pcs/kubelet/jobs), adjusted fencing to accept node slice, added polling/wait semantics and logging tweaks.
Update-Setup Workflow `pkg/tnf/update-setup/runner.go`	New `RunTnfUpdateSetup()` implementing multi-step update-setup workflow (node detection, etcd revision wait, fencing updates, cluster restart) and helper command runner.
Kubelet Package `pkg/tnf/pkg/kubelet/kubelet.go`	Added `Disable(ctx)` which runs `systemctl disable kubelet` via exec helper.
PCS: Auth, Fencing, Cluster, Etcd `pkg/tnf/pkg/pcs/auth.go`, `pkg/tnf/pkg/pcs/fencing.go`, `pkg/tnf/pkg/pcs/fencing_test.go`, `pkg/tnf/pkg/pcs/cluster.go`, `pkg/tnf/pkg/pcs/etcd.go`	Added `pcs.Authenticate` and `TokenPath`; changed `ConfigureFencing` signature to accept `[]string` node names and dynamic delay logic; increased stonith wait to 120s; added `migration-threshold=5` and `start-failure-is-fatal=false`; updated fencing tests.
Etcd Utilities `pkg/tnf/pkg/etcd/etcd.go`	Added `WaitForUpdatedRevision()` to poll StaticPodOperatorState for node revision convergence and refactored static container removal logic to use it.
Tools: Jobs Constants `pkg/tnf/pkg/tools/jobs.go`	Added `JobTypeUpdateSetup`, `AfterSetupJobCompletedTimeout` (5m), `AllCompletedTimeout` (30m), and `GetSubCommand()` handling for update-setup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas needing extra attention:

pkg/tnf/operator/nodehandler.go — concurrency, backoff, and bootstrap sequencing.
pkg/tnf/pkg/jobs/tnf.go and pkg/tnf/pkg/jobs/tnf_test.go — global maps, mutex correctness, and RestartJobOrRunController locking semantics.
pkg/tnf/update-setup/runner.go — multi-step orchestration interacting with etcd/pcs and command execution.
pkg/tnf/pkg/pcs/fencing.go — signature change, command construction, and timing changes (120s).
API relocations/removals: ensure all call sites updated for condition helpers moved from tools → jobs and changes to ConfigureFencing signature.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/tnf/after-setup/runner.go (1)

1-1: Package declaration doesn't match directory name.

The file at pkg/tnf/after-setup/runner.go declares package auth, but it should declare package aftersetup (or similar) to match its directory. Currently, there is also a separate pkg/tnf/auth/runner.go that declares package auth. Having both files declare the same package name in different directories creates confusion and violates Go naming conventions. Rename the package in after-setup/runner.go to match its directory.

🧹 Nitpick comments (13)

pkg/tnf/pkg/etcd/etcd.go (1)
86-91: Use %d for integer format verbs.

CurrentRevision and LatestAvailableRevision are int32 values, but %q (quoted string) is used in the log messages. This will still work but produces unnecessarily quoted output.
 		if nodeStatus.CurrentRevision == status.LatestAvailableRevision {
-			klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+			klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
 		} else {
-			klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)
+			klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision)
 			allUpdated = false
 		}
pkg/tnf/pkg/jobs/utils_test.go (2)
317-350: Reactor ordering may cause test to succeed for unintended reasons.

The deleted flag is set on the first Get call, but DeleteAndWait calls Get before Delete to retrieve the job UID. This means deleted becomes true before deletion actually occurs. The test passes because the fake client's default Get returns the job on the first call (before the reactor sets deleted = true), but this is fragile.

Consider tracking both the initial Get and post-delete Get separately:
-			// Make Get return NotFound after deletion
-			deleted := false
-			client.PrependReactor("get", "jobs", func(action k8stesting.Action) (handled bool, ret runtime.Object, err error) {
-				if deleted {
-					return true, nil, apierrors.NewNotFound(batchv1.Resource("jobs"), "test-job")
-				}
-				// First get after deletion initiates the wait loop
-				deleted = true
-				return false, nil, nil
-			})
+			// Track Get calls - DeleteAndWait does: Get (for UID) -> Delete -> poll Get
+			getCallCount := 0
+			client.PrependReactor("get", "jobs", func(action k8stesting.Action) (handled bool, ret runtime.Object, err error) {
+				getCallCount++
+				if getCallCount > 2 {
+					// After initial Get and one poll iteration, return NotFound
+					return true, nil, apierrors.NewNotFound(batchv1.Resource("jobs"), "test-job")
+				}
+				return false, nil, nil
+			})
385-407: Test will wait for the full 1-minute timeout.

The "Job deletion times out" test will block for up to 1 minute (the hardcoded timeout in DeleteAndWait) since the reactor always returns the job. This may slow down CI runs significantly.

Consider whether this timeout scenario is worth the wait time, or if the implementation could accept a timeout parameter for testability.
pkg/tnf/pkg/pcs/auth.go (1)
28-34: Consider using file I/O instead of shell command for token file creation.

Using echo via shell to write the token file works, but writing directly with Go's os.WriteFile would be more robust and eliminate any shell escaping concerns, even though %q should handle the ClusterID safely.
import "os"

// Direct file write instead of shell command
if err := os.WriteFile(TokenPath, []byte(tokenValue), 0600); err != nil {
    return false, fmt.Errorf("failed to create token file: %w", err)
}
pkg/tnf/update-setup/runner.go (3)
63-69: Consider propagating cluster status check error details.

The cluster status check returns nil when the cluster isn't running on this node, which is correct for the workflow. However, the error from exec.Execute is logged but silently ignored. If the error indicates something other than "cluster not running" (e.g., command not found, permission denied), this could mask underlying issues.

162-164: Hardcoded sleep for stabilization is a code smell.

The 10-second time.Sleep with a comment "without this the etcd start on the new node fails for some reason..." suggests a race condition that isn't fully understood. Consider replacing with a proper condition check or at least making this configurable via a constant.
+const stabilizationDelay = 10 * time.Second
+
 	// wait a bit for things to settle
-	// without this the etcd start on the new node fails for some reason...
-	time.Sleep(10 * time.Second)
+	// stabilization delay required before starting cluster on new node
+	time.Sleep(stabilizationDelay)
180-190: Helper function signature has unconventional parameter order.

The ctx parameter is conventionally placed first in Go function signatures. This aids readability and follows the pattern used elsewhere in this codebase (e.g., exec.Execute(ctx, command)).
-func runCommands(commands []string, ctx context.Context) error {
+func runCommands(ctx context.Context, commands []string) error {
Then update the call sites at lines 121, 141, and 172 accordingly.
pkg/tnf/pkg/jobs/tnf_test.go (1)

135-137: Time-based synchronization in tests can cause flakiness.

The 100ms sleep to "give the goroutine a moment to start" is fragile. Consider using a synchronization primitive or polling with a short timeout instead.
pkg/tnf/operator/nodehandler.go (2)
119-128: Clarify retry behavior for edge cases.

Returning nil for unsupported node counts (>2 or <2) exits without retry, which is intentional per comments. However, the >2 case logs at Error level but doesn't propagate an error, which could mask configuration issues.

Consider returning an error for the >2 nodes case to surface it as a degraded condition, or downgrade the log level to match the non-error return:
 	if len(nodeList) > 2 {
-		klog.Errorf("found more than 2 control plane nodes (%d), unsupported use case, no further steps are taken for now", len(nodeList))
+		klog.Warningf("found more than 2 control plane nodes (%d), unsupported use case, no further steps are taken for now", len(nodeList))
 		// don't retry
 		return nil
 	}
302-311: Hard-coded timeout differs from configurable timeouts used elsewhere.

waitForTnfAfterSetupJobsCompletion uses a hard-coded 20-minute timeout, while other similar waits use constants from tools package (e.g., tools.AfterSetupJobCompletedTimeout).

For consistency and maintainability:
 func waitForTnfAfterSetupJobsCompletion(ctx context.Context, kubeClient kubernetes.Interface, nodeList []*corev1.Node) error {
 	for _, node := range nodeList {
 		jobName := tools.JobTypeAfterSetup.GetJobName(&node.Name)
 		klog.Infof("Waiting for after-setup job %s to complete", jobName)
-		if err := jobs.WaitForCompletion(ctx, kubeClient, jobName, operatorclient.TargetNamespace, 20*time.Minute); err != nil {
+		if err := jobs.WaitForCompletion(ctx, kubeClient, jobName, operatorclient.TargetNamespace, tools.AfterSetupJobCompletedTimeout); err != nil {
 			return fmt.Errorf("failed to wait for after-setup job %s to complete: %w", jobName, err)
 		}
 	}
 	return nil
 }
pkg/tnf/pkg/jobs/tnf.go (1)
66-75: Direct index access to container and command arrays could panic.

Lines 72-73 assume Containers[0] exists and Command has at least 2 elements. If the job template is malformed or modified, this will panic.

Add defensive checks:
 		func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error {
 			if nodeName != nil {
 				job.Spec.Template.Spec.NodeName = *nodeName
 			}
 			job.SetName(jobType.GetJobName(nodeName))
 			job.Labels["app.kubernetes.io/name"] = jobType.GetNameLabelValue()
+			if len(job.Spec.Template.Spec.Containers) == 0 {
+				return fmt.Errorf("job template has no containers")
+			}
+			if len(job.Spec.Template.Spec.Containers[0].Command) < 2 {
+				return fmt.Errorf("job template container command has fewer than 2 elements")
+			}
 			job.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE")
 			job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand()
 			return nil
 		}}...,
pkg/tnf/pkg/jobs/utils.go (2)
28-50: Simplify redundant timeout handling.

The timeout is applied twice: once via context.WithTimeout and again in PollUntilContextTimeout. The inner context is sufficient.

Consider simplifying:
 func waitWithConditionFunc(ctx context.Context, kubeClient kubernetes.Interface, jobName string, jobNamespace string, timeout time.Duration, conditionFunc func(job batchv1.Job) bool) error {
-	timeoutCtx, cancel := context.WithTimeout(ctx, timeout)
-	defer cancel()
-
-	return wait.PollUntilContextTimeout(timeoutCtx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
+	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
Alternatively, if the double-timeout is intentional for edge cases, add a comment explaining why.

75-79: Potential silent failure during deletion polling.

Line 78 ignores non-NotFound errors from the Get call. If there's a persistent API error (e.g., network issue), the poll will succeed incorrectly when err != nil && !IsNotFound(err).

Handle non-NotFound errors explicitly:
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) {
 		newJob, err := kubeClient.BatchV1().Jobs(jobNamespace).Get(ctx, jobName, v1.GetOptions{})
-		// job might be recreated already, check UID
-		return apierrors.IsNotFound(err) || newJob.GetUID() != oldJobUID, nil
+		if apierrors.IsNotFound(err) {
+			return true, nil
+		}
+		if err != nil {
+			klog.Warningf("error checking job %s deletion status: %v", jobName, err)
+			return false, nil // retry on transient errors
+		}
+		// job might be recreated already, check UID
+		return newJob.GetUID() != oldJobUID, nil
 	})

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e46bfd and f173006.

📒 Files selected for processing (28)

bindata/tnfdeployment/job.yaml (2 hunks)
cmd/tnf-setup-runner/main.go (3 hunks)
pkg/tnf/after-setup/runner.go (3 hunks)
pkg/tnf/auth/runner.go (2 hunks)
pkg/tnf/fencing/runner.go (3 hunks)
pkg/tnf/operator/nodehandler.go (1 hunks)
pkg/tnf/operator/nodehandler_test.go (1 hunks)
pkg/tnf/operator/starter.go (3 hunks)
pkg/tnf/operator/starter_test.go (6 hunks)
pkg/tnf/pkg/config/cluster.go (4 hunks)
pkg/tnf/pkg/etcd/etcd.go (2 hunks)
pkg/tnf/pkg/jobs/jobcontroller.go (3 hunks)
pkg/tnf/pkg/jobs/tnf.go (1 hunks)
pkg/tnf/pkg/jobs/tnf_test.go (1 hunks)
pkg/tnf/pkg/jobs/utils.go (1 hunks)
pkg/tnf/pkg/jobs/utils_test.go (1 hunks)
pkg/tnf/pkg/kubelet/kubelet.go (1 hunks)
pkg/tnf/pkg/pcs/auth.go (1 hunks)
pkg/tnf/pkg/pcs/cluster.go (1 hunks)
pkg/tnf/pkg/pcs/etcd.go (1 hunks)
pkg/tnf/pkg/pcs/fencing.go (4 hunks)
pkg/tnf/pkg/pcs/fencing_test.go (5 hunks)
pkg/tnf/pkg/tools/conditions.go (0 hunks)
pkg/tnf/pkg/tools/jobs.go (3 hunks)
pkg/tnf/pkg/tools/nodes.go (1 hunks)
pkg/tnf/pkg/tools/nodes_test.go (1 hunks)
pkg/tnf/setup/runner.go (3 hunks)
pkg/tnf/update-setup/runner.go (1 hunks)

💤 Files with no reviewable changes (1)

pkg/tnf/pkg/tools/conditions.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/tnf/pkg/pcs/etcd.go
pkg/tnf/setup/runner.go
bindata/tnfdeployment/job.yaml
pkg/tnf/pkg/etcd/etcd.go
pkg/tnf/pkg/tools/nodes.go
pkg/tnf/pkg/pcs/cluster.go
pkg/tnf/auth/runner.go
pkg/tnf/pkg/pcs/fencing.go
cmd/tnf-setup-runner/main.go
pkg/tnf/pkg/jobs/utils_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/update-setup/runner.go
pkg/tnf/pkg/kubelet/kubelet.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/jobs/utils.go
pkg/tnf/pkg/tools/jobs.go
pkg/tnf/after-setup/runner.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/pkg/pcs/fencing_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/jobs/jobcontroller.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/config/cluster.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/nodes_test.go
pkg/tnf/operator/starter.go
pkg/tnf/fencing/runner.go

🧬 Code graph analysis (13)

pkg/tnf/setup/runner.go (3)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeAuth (26-26)

pkg/tnf/pkg/jobs/utils.go (1)

IsConditionTrue (94-96)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

pkg/tnf/pkg/etcd/etcd.go (2)

pkg/etcdcli/interfaces.go (1)

Status (45-47)

pkg/operator/ceohelpers/common.go (1)

CurrentRevision (220-238)

pkg/tnf/auth/runner.go (1)

pkg/tnf/pkg/pcs/auth.go (1)

Authenticate (20-56)

cmd/tnf-setup-runner/main.go (2)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeUpdateSetup (30-30)

pkg/tnf/update-setup/runner.go (1)

RunTnfUpdateSetup (25-178)

pkg/tnf/pkg/jobs/utils_test.go (1)

pkg/tnf/pkg/jobs/utils.go (4)

IsComplete (86-88)

IsFailed (90-92)

WaitForCompletion (23-26)

DeleteAndWait (53-80)

pkg/tnf/update-setup/runner.go (4)

pkg/tnf/pkg/exec/exec.go (1)

Execute (14-47)

pkg/tnf/pkg/config/cluster.go (1)

GetClusterConfig (23-25)

pkg/tnf/pkg/etcd/etcd.go (1)

WaitForUpdatedRevision (69-98)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

pkg/tnf/pkg/kubelet/kubelet.go (1)

pkg/tnf/pkg/exec/exec.go (1)

Execute (14-47)

pkg/tnf/pkg/jobs/tnf_test.go (4)

pkg/testutils/testutils.go (1)

StaticPodOperatorStatus (211-222)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

pkg/tnf/pkg/jobs/tnf.go (2)

RunTNFJobController (37-78)

RestartJobOrRunController (80-136)

pkg/tnf/pkg/jobs/jobcontroller.go (1)

DefaultConditions (30-30)

pkg/tnf/pkg/jobs/jobcontroller.go (1)

pkg/tnf/pkg/jobs/utils.go (2)

IsComplete (86-88)

IsFailed (90-92)

pkg/tnf/pkg/config/cluster.go (1)

pkg/tnf/pkg/tools/nodes.go (1)

GetNodeIPForPacemaker (24-43)

pkg/tnf/pkg/tools/nodes_test.go (1)

pkg/tnf/pkg/tools/nodes.go (2)

IsNodeReady (12-19)

GetNodeIPForPacemaker (24-43)

pkg/tnf/operator/starter.go (5)

pkg/tnf/pkg/tools/nodes.go (1)

IsNodeReady (12-19)

pkg/tnf/pkg/tools/secrets.go (1)

IsFencingSecret (32-34)

pkg/tnf/pkg/jobs/tnf.go (1)

RestartJobOrRunController (80-136)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeFencing (29-29)

pkg/tnf/pkg/jobs/jobcontroller.go (1)

DefaultConditions (30-30)

pkg/tnf/fencing/runner.go (4)

pkg/tnf/pkg/tools/jobs.go (1)

JobTypeSetup (27-27)

pkg/tnf/pkg/jobs/utils.go (1)

IsConditionTrue (94-96)

pkg/tnf/pkg/config/cluster.go (1)

GetClusterConfigIgnoreMissingNode (28-30)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

🔇 Additional comments (52)

pkg/tnf/pkg/pcs/etcd.go (1)

25-26: LGTM!

Adding migration-threshold=5 is a sensible configuration to allow transient failures before Pacemaker migrates the etcd resource. This aligns with the similar change for the kubelet resource in cluster.go.

bindata/tnfdeployment/job.yaml (2)

6-6: LGTM!

The component label improves resource filtering and aligns with Kubernetes labeling conventions.

28-32: LGTM!

Correct use of the Downward API to inject the node name. This enables node-aware operations within TNF setup jobs.

pkg/tnf/pkg/pcs/cluster.go (2)

35-35: LGTM!

Consistent with the migration-threshold=5 added to the etcd resource.

39-39: LGTM!

Setting start-failure-is-fatal=false improves cluster resilience by allowing Pacemaker to retry resource starts after transient failures.

pkg/tnf/pkg/tools/nodes.go (2)

12-19: LGTM!

Clean implementation that correctly handles all cases including missing NodeReady condition.

24-43: LGTM with a minor observation.

The implementation correctly prioritizes internal IPs and validates them with net.ParseIP. The fallback to the first address ensures resilience when no valid internal IP exists.

Note: If an invalid internal IP is present (e.g., malformed string), it's silently skipped and may fall back to the first address (which could be a hostname or external IP). The tests cover this case, so this appears intentional for robustness.

pkg/tnf/pkg/tools/nodes_test.go (2)

11-135: LGTM!

Comprehensive table-driven tests covering all meaningful scenarios for IsNodeReady.

137-354: Excellent test coverage.

The tests thoroughly cover IPv4/IPv6, fallback behavior, error cases, and the IPv4-mapped IPv6 normalization edge case.

pkg/tnf/pkg/etcd/etcd.go (2)

49-66: LGTM!

Clean refactoring that modularizes the revision wait logic. The status update happens only after successful revision synchronization, which is correct.

68-98: LGTM!

The new exported WaitForUpdatedRevision function enables reuse across the TNF orchestration flow. The polling parameters (10s interval, 10min timeout, immediate=false) are reasonable for allowing CEO time to create revisions.

pkg/tnf/pkg/config/cluster.go (1)

12-20: Clarify single-node contract when using ignoreMissingNode

The new getClusterConfig allows a single master node when ignoreMissingNode is true, leaving NodeName2/NodeIP2 empty. That’s reasonable, but it makes the API contract asymmetric with GetClusterConfig. Please ensure all callers of GetClusterConfigIgnoreMissingNode explicitly handle the “only Node1 populated” case (especially where both node names/IPs are used to build pcs commands) and consider documenting this behavior in the function comment for future maintainers.

Also applies to: 22-32, 44-49, 55-71

pkg/tnf/pkg/kubelet/kubelet.go (1)

11-15: Good centralization of kubelet disabling logic

Wrapping the systemctl disable kubelet call in kubelet.Disable improves reuse and testability while relying on the existing exec.Execute plumbing and logging. Looks solid.

pkg/tnf/pkg/jobs/jobcontroller.go (1)

246-267: Consistent job predicates via shared helpers

Switching to IsComplete/IsFailed centralizes job state evaluation and keeps the Available/Progressing/degraded logic consistent with other TNF job flows. The usage here looks correct.

Also applies to: 271-294, 305-310

pkg/tnf/setup/runner.go (1)

22-25: Stricter auth job completion check and updated fencing invocation look sound

Requiring exactly two auth jobs and using jobs.IsConditionTrue(..., JobComplete) should make the setup phase more robust against partial or stale jobs, with the existing timeout guarding against hangs. Passing []string{cfg.NodeName1, cfg.NodeName2} into ConfigureFencing matches the new API and keeps node ordering explicit for pcmk_delay_base handling. No issues from a correctness/maintainability perspective.

Also applies to: 57-81, 102-104

cmd/tnf-setup-runner/main.go (1)

23-24: Update-setup command is wired consistently with existing TNF subcommands

The new NewUpdateSetupCommand mirrors the existing auth/setup/after-setup/fencing commands (same error handling and naming via JobTypeUpdateSetup.GetSubCommand()), so the CLI surface stays consistent. Looks good.

Also applies to: 58-63, 119-130

pkg/tnf/pkg/tools/jobs.go (1)

14-19: Job type and timeout extensions are safe and consistent

Adding JobTypeUpdateSetup at the end of the enum preserves existing values, and wiring it to "update-setup" via GetSubCommand matches the established pattern. The new AfterSetupJobCompletedTimeout follows the existing timeout scheme. No concerns.

Also applies to: 25-31, 33-45

pkg/tnf/auth/runner.go (1)

13-14: PCS auth centralization and improved error logging look correct

Delegating to pcs.Authenticate and logging failures for both cluster config retrieval and authentication simplifies this runner and makes failures more observable. Continuing to use GetClusterConfig (not the ignore-missing variant) is consistent with pcs authentication requiring both nodes. Ignoring the boolean return from Authenticate is fine as long as it remains a simple success flag, which it currently is.

Also applies to: 50-61

pkg/tnf/pkg/pcs/fencing_test.go (1)

275-277: Updated fencing command expectations match the new option and timeout behavior

The revised want strings correctly assert inclusion of pcmk_delay_base and ssl_insecure (with appropriate defaults) and the increased --wait=120 for the affected cases, which should guard against regressions in the updated stonith command construction.

Also applies to: 292-293, 326-327, 344-345, 366-367

pkg/tnf/pkg/jobs/utils_test.go (2)

18-84: LGTM!

Well-structured table-driven tests for IsComplete with good coverage of edge cases including empty conditions and presence of only the opposite condition type.

86-152: LGTM!

Symmetric test coverage for IsFailed matching the IsComplete tests. Good consistency.

pkg/tnf/pkg/pcs/auth.go (2)

19-56: LGTM on the overall authentication flow.

The function properly retrieves the cluster ID, creates a token file with restricted permissions, and executes the PCS authentication sequence with appropriate error handling at each step.

49-53: The command string uses node names and IPs from the Kubernetes API, which enforces naming constraints preventing shell metacharacters.

Node names and IP addresses originate from the Kubernetes API (via node.Name and node.Status.Addresses), not from user input. Kubernetes enforces DNS-1123 naming rules that restrict node names to alphanumeric characters, hyphens, and periods, and IP addresses are validated as proper IP format. These constraints prevent shell injection, making command injection infeasible in this context.

Likely an incorrect or invalid review comment.

pkg/tnf/after-setup/runner.go (2)

51-68: LGTM on the setup job completion check.

The logic correctly validates exactly one setup job exists and checks completion using the centralized jobs.IsConditionTrue utility. Good refactoring to use shared utilities.

76-81: Good refactoring to use dedicated kubelet package.

Switching from direct exec to kubelet.Disable(ctx) improves maintainability and encapsulates the kubelet management logic.

pkg/tnf/operator/nodehandler_test.go (3)

34-158: Comprehensive test coverage for node handling scenarios.

The test table covers all critical paths: insufficient/excess nodes, node readiness combinations, successful flows with and without existing jobs, and error propagation. Well-structured test design.

232-265: Good use of function variable mocking with proper cleanup.

The mock injection pattern using package-level function variables with deferred restoration ensures tests don't leak state. This is a clean approach for testing functions with external dependencies.

302-344: LGTM on helper functions.

The helper functions are minimal and focused, creating test fixtures with just the required fields for the test scenarios.

pkg/tnf/pkg/pcs/fencing.go (3)

71-74: Good defensive check for empty node names.

Skipping empty entries prevents errors when processing partial node lists during replacement scenarios.

85-91: Good differentiation of delay values between first and subsequent nodes.

Applying different pcmk_delay_base values prevents fencing races. The comment explains the rationale clearly.

242-244: Good explicit handling of ssl_insecure in both cases.

Always setting ssl_insecure explicitly (to "1" or "0") ensures consistent configuration state rather than relying on defaults.

pkg/tnf/update-setup/runner.go (1)

25-53: Client initialization and signal handling looks good.

The setup correctly uses in-cluster config, creates necessary clients, and properly wires up SIGTERM/SIGINT handling with context cancellation. The dynamic informers are started and synced before use.

pkg/tnf/fencing/runner.go (2)

52-73: Setup job polling logic is correct and well-structured.

The renamed setupJobs variable improves clarity. The use of jobs.IsConditionTrue aligns with the refactored utility location. The logging improvements provide better debugging context.

84-92: Good use of GetClusterConfigIgnoreMissingNode for node replacement scenarios.

Using the ignore-missing-node variant correctly handles transient states during node replacements where a node might not yet be registered. The explicit node names slice passed to ConfigureFencing aligns with the updated function signature.

pkg/tnf/operator/starter.go (5)

70-87: Node add handler correctly filters by readiness before processing.

The goroutine dispatch prevents blocking the informer, and the readiness check avoids processing nodes that aren't ready yet. This is a good pattern for event-driven handling.

88-105: Update handler correctly detects ready-state transitions.

Only triggering on !oldReady && newReady transition prevents redundant processing while catching the important case of a node becoming ready.

219-239: Secret data change detection logic is thorough.

The byte-level comparison of secret data prevents unnecessary job restarts when non-data metadata changes. The early return when !changed is efficient.

246-250: RestartJobOrRunController consolidates job restart logic well.

Using the centralized restart mechanism with proper locking and timeout handling improves reliability over the previous inline logic.

106-118: Concurrent node deletion calls are properly serialized via mutex.

The handleNodesWithRetry function is protected by handleNodesMutex, ensuring only one execution runs at a time regardless of goroutine count. Rapid node churn will trigger multiple goroutines, but they serialize at the mutex. The retry logic uses exponential backoff (5s initial, 2x factor, 2min cap, ~10 minutes total), and errors are properly surfaced by degrading the operator status. The design is deliberate and safe.

pkg/tnf/pkg/jobs/tnf_test.go (3)

29-153: TestRunTNFJobController provides good coverage of controller lifecycle.

The test cases cover:

Controllers with/without node names

Controller deduplication when already running

Different job types running concurrently

Different nodes for same job type

The global state reset between tests is correctly implemented.

155-291: TestRestartJobOrRunController covers key scenarios effectively.

Test cases properly cover:

Job non-existence (controller only)

Job exists and completes (wait + delete + controller)

API errors propagation

Timeout scenarios

Delete failures

The reactor-based mocking for delete/get behavior is well-structured.

374-485: Parallel execution test validates locking semantics.

The test correctly verifies that concurrent calls result in only one delete operation, validating the job-level locking mechanism works as designed.

pkg/tnf/operator/starter_test.go (3)

35-35: Import addition for corev1listers is necessary for the signature changes.

The import supports the new NodeLister parameter type used throughout the tests.

278-321: Mock function signature updates align with production code changes.

The updated setupMockHandleNodes return type correctly reflects the new handleNodesFunc signature using NodeLister instead of a node slice. This allows the production code to lazily list nodes when needed rather than requiring an upfront snapshot.

328-352: Test scaffolding correctly creates NodeLister from informer.

Creating the controlPlaneNodeLister from the informer's indexer mirrors production behavior. The updated call to handleNodesWithRetry with the new parameter order is correct.

pkg/tnf/operator/nodehandler.go (3)

30-50: Well-structured concurrency control and retry configuration.

The mutex protection and testable function variables are good patterns for this type of orchestration code. The backoff configuration with ~10 minutes total retry time is reasonable for infrastructure operations.

52-102: Solid retry logic with proper operator status updates.

The exponential backoff pattern with status condition updates on both success and failure paths is appropriate for operator-level error handling. Good use of v1helpers.UpdateConditionFn for status management.

168-204: Controller startup sequence looks correct.

The sequence of waiting for informer sync, bootstrap completion, then starting job controllers is appropriate. Starting controllers in a loop per node for Auth/AfterSetup and then Setup/Fencing globally follows the expected TNF topology.

pkg/tnf/pkg/jobs/tnf.go (2)

25-35: Global state tracking is appropriate for this use case.

The dual-map approach with separate mutexes for running controllers and restart locks provides proper isolation. This enables both duplicate prevention and serialized restarts.

80-136: RestartJobOrRunController implements correct serialization pattern.

The per-job locking with lazy initialization properly serializes restarts for the same job while allowing parallel operations on different jobs. The wait-then-delete pattern enables controller-based recreation.

pkg/tnf/pkg/jobs/utils.go (2)

52-80: DeleteAndWait has good UID-based deletion verification.

The pattern of capturing the old UID and checking for either NotFound or different UID handles the recreation race condition correctly. The 1-minute hardcoded timeout is reasonable for deletion operations.

82-105: Clean and reusable condition checking utilities.

The IsStopped, IsComplete, IsFailed, and condition helper functions provide a clear API for job status evaluation. Good separation of concerns.

pkg/tnf/pkg/jobs/tnf.go

pkg/tnf/pkg/pcs/fencing.go

pkg/tnf/update-setup/runner.go

Signed-off-by: Marc Sluiter <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

pkg/tnf/update-setup/runner.go (1)

146-160: Inconsistent error handling for member removal.

Lines 149-150 log the error and continue, while line 157 returns the error. If the member list command fails, code skips removal entirely (acceptable cleanup behavior), but if removal fails after finding a member, it returns error. Consider adding a brief comment explaining this asymmetry, or making the behavior consistent.

🧹 Nitpick comments (1)

pkg/tnf/pkg/jobs/tnf.go (1)

31-35: Consider cleanup for restartJobLocks to prevent unbounded growth.

The restartJobLocks map stores per-job mutexes that are never removed. For long-running operators with many job restarts, this could lead to unbounded memory growth. Consider implementing cleanup or using a bounded cache.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f173006 and 6c686c7.

📒 Files selected for processing (9)

pkg/tnf/after-setup/runner.go (4 hunks)
pkg/tnf/operator/nodehandler.go (1 hunks)
pkg/tnf/pkg/jobs/tnf.go (1 hunks)
pkg/tnf/pkg/jobs/utils.go (1 hunks)
pkg/tnf/pkg/jobs/utils_test.go (1 hunks)
pkg/tnf/pkg/pcs/fencing.go (4 hunks)
pkg/tnf/pkg/pcs/fencing_test.go (6 hunks)
pkg/tnf/pkg/tools/jobs.go (3 hunks)
pkg/tnf/update-setup/runner.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/tnf/pkg/jobs/utils_test.go
pkg/tnf/pkg/tools/jobs.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/jobs/utils.go
pkg/tnf/pkg/pcs/fencing_test.go
pkg/tnf/update-setup/runner.go
pkg/tnf/after-setup/runner.go
pkg/tnf/pkg/pcs/fencing.go

🧬 Code graph analysis (2)

pkg/tnf/operator/nodehandler.go (8)

pkg/tnf/pkg/tools/nodes.go (1)

IsNodeReady (12-19)

pkg/tnf/pkg/jobs/tnf.go (2)

RunTNFJobController (37-86)

RestartJobOrRunController (88-144)

pkg/tnf/pkg/tools/jobs.go (9)

JobTypeAuth (27-27)

JobTypeAfterSetup (29-29)

JobTypeSetup (28-28)

JobTypeFencing (30-30)

AuthJobCompletedTimeout (15-15)

JobTypeUpdateSetup (31-31)

SetupJobCompletedTimeout (16-16)

AfterSetupJobCompletedTimeout (17-17)

AllCompletedTimeout (18-18)

pkg/tnf/pkg/jobs/jobcontroller.go (2)

DefaultConditions (30-30)

AllConditions (31-31)

pkg/operator/ceohelpers/external_etcd_status.go (1)

IsEtcdRunningInCluster (62-80)

pkg/operator/bootstrapteardown/waitforceo.go (1)

WaitForEtcdBootstrap (17-32)

pkg/tnf/pkg/jobs/utils.go (1)

WaitForCompletion (23-26)

pkg/operator/operatorclient/interfaces.go (1)

TargetNamespace (7-7)

pkg/tnf/update-setup/runner.go (4)

pkg/tnf/pkg/exec/exec.go (1)

Execute (14-47)

pkg/tnf/pkg/config/cluster.go (1)

GetClusterConfig (23-25)

pkg/tnf/pkg/etcd/etcd.go (1)

WaitForUpdatedRevision (69-98)

pkg/tnf/pkg/pcs/fencing.go (1)

ConfigureFencing (64-144)

🔇 Additional comments (26)

pkg/tnf/pkg/jobs/tnf.go (1)

77-85: Good fix: Controller cleanup on exit.

The defer block now properly removes the controller key from runningControllers when the goroutine exits, addressing the previous concern about controllers being marked running but never cleared.

pkg/tnf/after-setup/runner.go (2)

50-70: LGTM: Clean refactoring to use centralized job utilities.

The setup job polling logic correctly uses jobs.IsConditionTrue for condition checking, and the variable renaming to setupJobs improves clarity.

76-81: LGTM: Good abstraction for kubelet management.

Using kubelet.Disable(ctx) instead of direct shell execution improves testability and maintainability.

pkg/tnf/pkg/jobs/utils.go (2)

28-47: LGTM: Well-designed polling with resilience to transient failures.

The design choice to ignore errors (including NotFound) during polling is appropriate for handling job deletion/recreation cycles. The clear comments explain the rationale.

49-84: LGTM: Robust deletion handling with UID tracking.

The DeleteAndWait function correctly handles the case where a job might be recreated during deletion by comparing UIDs. This prevents races between controllers.

pkg/tnf/pkg/pcs/fencing_test.go (1)

254-378: LGTM: Comprehensive test coverage for fencing command generation.

The test cases thoroughly cover the updated stonith command generation including:

pcmk_delay_base with empty and explicit values

ssl_insecure with "0" and "1" values

IPv6 address handling

Both create and update scenarios

Increased wait timeout (120s)

pkg/tnf/pkg/pcs/fencing.go (2)

236-248: LGTM: Format string arguments are now correctly aligned.

The getStonithCommand function now correctly includes fc.FencingDeviceOptions[PcmkDelayBase] as the argument for the pcmk_delay_base=%q format specifier. The ssl_insecure handling also ensures a value is always set ("1" or "0"), and the wait timeout is increased to 120s.

64-93: LGTM: Dynamic node handling with proper delay assignment.

The refactoring to accept []string nodeNames and skip empty entries provides flexibility for node replacement scenarios. The staggered pcmk_delay_base assignment (10s for first device, 1s for others) helps prevent simultaneous fencing races.

pkg/tnf/update-setup/runner.go (9)

1-23: LGTM!

Imports are well-organized and include all necessary dependencies for the update-setup workflow.

25-54: LGTM!

Client initialization, signal handling, and informer synchronization follow established patterns in the codebase.

55-69: LGTM!

Environment validation and cluster status check provide appropriate guards. The early exit when cluster is not running is correct behavior for this workflow.

71-89: LGTM!

Configuration loading and node role determination logic is clear and includes a helpful error message when the current node is not found in the cluster config.

91-103: LGTM!

Offline node detection follows the established pattern for pcs interaction. The early exit when no offline node is found is appropriate.

107-124: LGTM!

The wait for etcd revision update before cluster changes is a good safeguard. The node removal/addition sequence is appropriate for the replacement scenario.

126-144: LGTM!

Fencing configuration is correctly sequenced before etcd resource updates. The 300-second wait timeout on the etcd resource update is reasonable for cluster stabilization.

162-176: LGTM!

The stabilization delay is pragmatic given the documented timing issue with etcd start. The final cluster enable/start sequence completes the update workflow correctly.

180-190: LGTM!

Clean helper function with appropriate logging and early return on error.

pkg/tnf/operator/nodehandler.go (9)

1-28: LGTM!

Imports are comprehensive and well-organized for the node handling orchestration logic.

30-50: LGTM!

Package-level mutex and function hooks for testability are well-designed. The backoff configuration provides reasonable retry behavior with approximately 10 minutes total retry time.

52-102: LGTM!

Retry mechanism with exponential backoff is well-implemented. The mutex ensures single execution, and operator status updates provide proper observability for both success and failure cases.

168-204: LGTM!

Job controller startup sequence is well-structured. The informer sync before bootstrap check is correct, and waiting for AfterSetup completion prevents race conditions with subsequent update operations.

206-223: LGTM!

Bootstrap wait logic correctly handles both in-cluster and bootstrap scenarios. Creating a new client config for WaitForEtcdBootstrap aligns with its API requirements.

225-287: LGTM!

Update setup workflow is well-structured with proper phasing: Auth → UpdateSetup → AfterSetup. The parallel start followed by sequential wait pattern is efficient, and error messages include node context for debugging.

289-300: LGTM!

Clean implementation using label selector to detect prior TNF setup.

302-311: LGTM!

Completion wait helper is clean with appropriate logging and error context.

119-138: The event-driven architecture already re-triggers handleNodesWithRetry on node changes.

Returning nil for transient states (<2 nodes, >2 nodes, nodes not ready) is correct. The recovery mechanism is not exponential backoff—it's event-driven re-invocation via node informer handlers (AddFunc when node added, UpdateFunc when node transitions to ready, DeleteFunc when node deleted). Each handler spawns a goroutine to re-invoke handleNodesWithRetry, ensuring timely retry when conditions change.

openshift-ci · 2025-12-16T11:42:35Z

@slintes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jaypoulz · 2025-12-17T18:34:00Z

pkg/tnf/update-setup/runner.go

+	}
+
+	// wait a bit for things to settle
+	// without this the etcd start on the new node fails for some reason...


this is possibly a timing issue with the etcd revision not being present on the node yet. May be fixed by the auto-retrys we're adding podman-etcd in the latest revision.

clobrano · 2025-12-19T10:12:45Z

pkg/tnf/operator/nodehandler.go

+	if !bothReady {
+		return nil
+	}


I'm having trouble understanding the logic here. We log that we're waiting for the node to be ready, but by returning nil, we're preventing handleNodeWithRetry from retrying. Is this the intended behavior?

there is no need to retry here, a condition change will retrigger the complete node handling, see https://github.com/openshift/cluster-etcd-operator/pull/1523/changes#diff-976f22fb9d9391708f396ddc48bc75f432af312729e8ec93307185f4029e9cdbR88-R105

openshift-ci · 2025-12-23T10:11:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~bindata/tnfdeployment/OWNERS~~ [clobrano,slintes]
~~cmd/tnf-setup-runner/OWNERS~~ [clobrano,slintes]
~~pkg/tnf/OWNERS~~ [clobrano,slintes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slintes added 6 commits December 15, 2025 19:20

Move node handling code to own file

4421c7b

For better overview, no actual changes yet Signed-off-by: Marc Sluiter <[email protected]>

Handle node replacement

15caf33

Signed-off-by: Marc Sluiter <[email protected]>

Improve waiting for etcd

73879e1

Signed-off-by: Marc Sluiter <[email protected]>

Fixes and improvements

61befc2

Signed-off-by: Marc Sluiter <[email protected]>

Update unit tests

cb6716a

Signed-off-by: Marc Sluiter <[email protected]>

Retry podman-etcd start on failure

f173006

by Carlo Signed-off-by: Marc Sluiter <[email protected]>

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 15, 2025

openshift-ci bot requested review from clobrano and eggfoobar December 15, 2025 18:26

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 15, 2025

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

pkg/tnf/pkg/jobs/tnf.go Show resolved Hide resolved

pkg/tnf/pkg/pcs/fencing.go Outdated Show resolved Hide resolved

pkg/tnf/update-setup/runner.go Show resolved Hide resolved

Apply CI / AI feedback

6c686c7

Signed-off-by: Marc Sluiter <[email protected]>

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

fonta-rh mentioned this pull request Dec 17, 2025

OCPBUGS-68371: fix bootstrap race condition #1524

Open

jaypoulz reviewed Dec 17, 2025

View reviewed changes

clobrano reviewed Dec 19, 2025

View reviewed changes

clobrano approved these changes Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPEDGE-1880: TNF: handle node replacements #1523

OCPEDGE-1880: TNF: handle node replacements #1523

Uh oh!

slintes commented Dec 15, 2025

Uh oh!

openshift-ci-robot commented Dec 15, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

coderabbitai bot commented Dec 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

jaypoulz Dec 17, 2025

Uh oh!

clobrano Dec 19, 2025

Uh oh!

slintes Dec 19, 2025

Uh oh!

openshift-ci bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OCPEDGE-1880: TNF: handle node replacements #1523

Are you sure you want to change the base?

OCPEDGE-1880: TNF: handle node replacements #1523

Uh oh!

Conversation

slintes commented Dec 15, 2025

Uh oh!

openshift-ci-robot commented Dec 15, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

jaypoulz Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

slintes Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Dec 15, 2025 •

edited by openshift-ci bot

Loading

coderabbitai bot commented Dec 15, 2025 •

edited

Loading