-
Notifications
You must be signed in to change notification settings - Fork 149
OCPEDGE-1880: TNF: handle node replacements #1523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
For better overview, no actual changes yet Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
Signed-off-by: Marc Sluiter <[email protected]>
by Carlo Signed-off-by: Marc Sluiter <[email protected]>
|
@slintes: This pull request references OCPEDGE-1880 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughRefactors Two Node Fencing (TNF): adds job/controller orchestration with concurrency controls, node-handling and event-driven starter logic, new job utilities and tests, updates PCS/fencing and kubelet interactions, and introduces an update-setup runner and related CLI subcommand. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas needing extra attention:
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/after-setup/runner.go (1)
1-1: Package declaration doesn't match directory name.The file at
pkg/tnf/after-setup/runner.godeclarespackage auth, but it should declarepackage aftersetup(or similar) to match its directory. Currently, there is also a separatepkg/tnf/auth/runner.gothat declarespackage auth. Having both files declare the same package name in different directories creates confusion and violates Go naming conventions. Rename the package in after-setup/runner.go to match its directory.
🧹 Nitpick comments (13)
pkg/tnf/pkg/etcd/etcd.go (1)
86-91: Use%dfor integer format verbs.
CurrentRevisionandLatestAvailableRevisionareint32values, but%q(quoted string) is used in the log messages. This will still work but produces unnecessarily quoted output.if nodeStatus.CurrentRevision == status.LatestAvailableRevision { - klog.Infof("node %q is running the latest etcd revision %q", nodeStatus.NodeName, nodeStatus.CurrentRevision) + klog.Infof("node %q is running the latest etcd revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) } else { - klog.Infof("node %q is not running the latest etcd revision yet, expected %q, got %q", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) + klog.Infof("node %q is not running the latest etcd revision yet, expected %d, got %d", nodeStatus.NodeName, status.LatestAvailableRevision, nodeStatus.CurrentRevision) allUpdated = false }pkg/tnf/pkg/jobs/utils_test.go (2)
317-350: Reactor ordering may cause test to succeed for unintended reasons.The
deletedflag is set on the firstGetcall, butDeleteAndWaitcallsGetbeforeDeleteto retrieve the job UID. This meansdeletedbecomes true before deletion actually occurs. The test passes because the fake client's defaultGetreturns the job on the first call (before the reactor setsdeleted = true), but this is fragile.Consider tracking both the initial Get and post-delete Get separately:
- // Make Get return NotFound after deletion - deleted := false - client.PrependReactor("get", "jobs", func(action k8stesting.Action) (handled bool, ret runtime.Object, err error) { - if deleted { - return true, nil, apierrors.NewNotFound(batchv1.Resource("jobs"), "test-job") - } - // First get after deletion initiates the wait loop - deleted = true - return false, nil, nil - }) + // Track Get calls - DeleteAndWait does: Get (for UID) -> Delete -> poll Get + getCallCount := 0 + client.PrependReactor("get", "jobs", func(action k8stesting.Action) (handled bool, ret runtime.Object, err error) { + getCallCount++ + if getCallCount > 2 { + // After initial Get and one poll iteration, return NotFound + return true, nil, apierrors.NewNotFound(batchv1.Resource("jobs"), "test-job") + } + return false, nil, nil + })
385-407: Test will wait for the full 1-minute timeout.The "Job deletion times out" test will block for up to 1 minute (the hardcoded timeout in
DeleteAndWait) since the reactor always returns the job. This may slow down CI runs significantly.Consider whether this timeout scenario is worth the wait time, or if the implementation could accept a timeout parameter for testability.
pkg/tnf/pkg/pcs/auth.go (1)
28-34: Consider using file I/O instead of shell command for token file creation.Using
echovia shell to write the token file works, but writing directly with Go'sos.WriteFilewould be more robust and eliminate any shell escaping concerns, even though%qshould handle the ClusterID safely.import "os" // Direct file write instead of shell command if err := os.WriteFile(TokenPath, []byte(tokenValue), 0600); err != nil { return false, fmt.Errorf("failed to create token file: %w", err) }pkg/tnf/update-setup/runner.go (3)
63-69: Consider propagating cluster status check error details.The cluster status check returns
nilwhen the cluster isn't running on this node, which is correct for the workflow. However, the error fromexec.Executeis logged but silently ignored. If the error indicates something other than "cluster not running" (e.g., command not found, permission denied), this could mask underlying issues.
162-164: Hardcoded sleep for stabilization is a code smell.The 10-second
time.Sleepwith a comment "without this the etcd start on the new node fails for some reason..." suggests a race condition that isn't fully understood. Consider replacing with a proper condition check or at least making this configurable via a constant.+const stabilizationDelay = 10 * time.Second + // wait a bit for things to settle - // without this the etcd start on the new node fails for some reason... - time.Sleep(10 * time.Second) + // stabilization delay required before starting cluster on new node + time.Sleep(stabilizationDelay)
180-190: Helper function signature has unconventional parameter order.The
ctxparameter is conventionally placed first in Go function signatures. This aids readability and follows the pattern used elsewhere in this codebase (e.g.,exec.Execute(ctx, command)).-func runCommands(commands []string, ctx context.Context) error { +func runCommands(ctx context.Context, commands []string) error {Then update the call sites at lines 121, 141, and 172 accordingly.
pkg/tnf/pkg/jobs/tnf_test.go (1)
135-137: Time-based synchronization in tests can cause flakiness.The 100ms sleep to "give the goroutine a moment to start" is fragile. Consider using a synchronization primitive or polling with a short timeout instead.
pkg/tnf/operator/nodehandler.go (2)
119-128: Clarify retry behavior for edge cases.Returning
nilfor unsupported node counts (>2 or <2) exits without retry, which is intentional per comments. However, the >2 case logs at Error level but doesn't propagate an error, which could mask configuration issues.Consider returning an error for the >2 nodes case to surface it as a degraded condition, or downgrade the log level to match the non-error return:
if len(nodeList) > 2 { - klog.Errorf("found more than 2 control plane nodes (%d), unsupported use case, no further steps are taken for now", len(nodeList)) + klog.Warningf("found more than 2 control plane nodes (%d), unsupported use case, no further steps are taken for now", len(nodeList)) // don't retry return nil }
302-311: Hard-coded timeout differs from configurable timeouts used elsewhere.
waitForTnfAfterSetupJobsCompletionuses a hard-coded 20-minute timeout, while other similar waits use constants fromtoolspackage (e.g.,tools.AfterSetupJobCompletedTimeout).For consistency and maintainability:
func waitForTnfAfterSetupJobsCompletion(ctx context.Context, kubeClient kubernetes.Interface, nodeList []*corev1.Node) error { for _, node := range nodeList { jobName := tools.JobTypeAfterSetup.GetJobName(&node.Name) klog.Infof("Waiting for after-setup job %s to complete", jobName) - if err := jobs.WaitForCompletion(ctx, kubeClient, jobName, operatorclient.TargetNamespace, 20*time.Minute); err != nil { + if err := jobs.WaitForCompletion(ctx, kubeClient, jobName, operatorclient.TargetNamespace, tools.AfterSetupJobCompletedTimeout); err != nil { return fmt.Errorf("failed to wait for after-setup job %s to complete: %w", jobName, err) } } return nil }pkg/tnf/pkg/jobs/tnf.go (1)
66-75: Direct index access to container and command arrays could panic.Lines 72-73 assume
Containers[0]exists andCommandhas at least 2 elements. If the job template is malformed or modified, this will panic.Add defensive checks:
func(_ *operatorv1.OperatorSpec, job *batchv1.Job) error { if nodeName != nil { job.Spec.Template.Spec.NodeName = *nodeName } job.SetName(jobType.GetJobName(nodeName)) job.Labels["app.kubernetes.io/name"] = jobType.GetNameLabelValue() + if len(job.Spec.Template.Spec.Containers) == 0 { + return fmt.Errorf("job template has no containers") + } + if len(job.Spec.Template.Spec.Containers[0].Command) < 2 { + return fmt.Errorf("job template container command has fewer than 2 elements") + } job.Spec.Template.Spec.Containers[0].Image = os.Getenv("OPERATOR_IMAGE") job.Spec.Template.Spec.Containers[0].Command[1] = jobType.GetSubCommand() return nil }}...,pkg/tnf/pkg/jobs/utils.go (2)
28-50: Simplify redundant timeout handling.The timeout is applied twice: once via
context.WithTimeoutand again inPollUntilContextTimeout. The inner context is sufficient.Consider simplifying:
func waitWithConditionFunc(ctx context.Context, kubeClient kubernetes.Interface, jobName string, jobNamespace string, timeout time.Duration, conditionFunc func(job batchv1.Job) bool) error { - timeoutCtx, cancel := context.WithTimeout(ctx, timeout) - defer cancel() - - return wait.PollUntilContextTimeout(timeoutCtx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) { + return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {Alternatively, if the double-timeout is intentional for edge cases, add a comment explaining why.
75-79: Potential silent failure during deletion polling.Line 78 ignores non-NotFound errors from the Get call. If there's a persistent API error (e.g., network issue), the poll will succeed incorrectly when
err != nil && !IsNotFound(err).Handle non-NotFound errors explicitly:
return wait.PollUntilContextTimeout(ctx, 5*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) { newJob, err := kubeClient.BatchV1().Jobs(jobNamespace).Get(ctx, jobName, v1.GetOptions{}) - // job might be recreated already, check UID - return apierrors.IsNotFound(err) || newJob.GetUID() != oldJobUID, nil + if apierrors.IsNotFound(err) { + return true, nil + } + if err != nil { + klog.Warningf("error checking job %s deletion status: %v", jobName, err) + return false, nil // retry on transient errors + } + // job might be recreated already, check UID + return newJob.GetUID() != oldJobUID, nil })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (28)
bindata/tnfdeployment/job.yaml(2 hunks)cmd/tnf-setup-runner/main.go(3 hunks)pkg/tnf/after-setup/runner.go(3 hunks)pkg/tnf/auth/runner.go(2 hunks)pkg/tnf/fencing/runner.go(3 hunks)pkg/tnf/operator/nodehandler.go(1 hunks)pkg/tnf/operator/nodehandler_test.go(1 hunks)pkg/tnf/operator/starter.go(3 hunks)pkg/tnf/operator/starter_test.go(6 hunks)pkg/tnf/pkg/config/cluster.go(4 hunks)pkg/tnf/pkg/etcd/etcd.go(2 hunks)pkg/tnf/pkg/jobs/jobcontroller.go(3 hunks)pkg/tnf/pkg/jobs/tnf.go(1 hunks)pkg/tnf/pkg/jobs/tnf_test.go(1 hunks)pkg/tnf/pkg/jobs/utils.go(1 hunks)pkg/tnf/pkg/jobs/utils_test.go(1 hunks)pkg/tnf/pkg/kubelet/kubelet.go(1 hunks)pkg/tnf/pkg/pcs/auth.go(1 hunks)pkg/tnf/pkg/pcs/cluster.go(1 hunks)pkg/tnf/pkg/pcs/etcd.go(1 hunks)pkg/tnf/pkg/pcs/fencing.go(4 hunks)pkg/tnf/pkg/pcs/fencing_test.go(5 hunks)pkg/tnf/pkg/tools/conditions.go(0 hunks)pkg/tnf/pkg/tools/jobs.go(3 hunks)pkg/tnf/pkg/tools/nodes.go(1 hunks)pkg/tnf/pkg/tools/nodes_test.go(1 hunks)pkg/tnf/setup/runner.go(3 hunks)pkg/tnf/update-setup/runner.go(1 hunks)
💤 Files with no reviewable changes (1)
- pkg/tnf/pkg/tools/conditions.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/tnf/pkg/pcs/etcd.gopkg/tnf/setup/runner.gobindata/tnfdeployment/job.yamlpkg/tnf/pkg/etcd/etcd.gopkg/tnf/pkg/tools/nodes.gopkg/tnf/pkg/pcs/cluster.gopkg/tnf/auth/runner.gopkg/tnf/pkg/pcs/fencing.gocmd/tnf-setup-runner/main.gopkg/tnf/pkg/jobs/utils_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/update-setup/runner.gopkg/tnf/pkg/kubelet/kubelet.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/jobs/utils.gopkg/tnf/pkg/tools/jobs.gopkg/tnf/after-setup/runner.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/pkg/pcs/fencing_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/pkg/jobs/jobcontroller.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/config/cluster.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/nodes_test.gopkg/tnf/operator/starter.gopkg/tnf/fencing/runner.go
🧬 Code graph analysis (13)
pkg/tnf/setup/runner.go (3)
pkg/tnf/pkg/tools/jobs.go (1)
JobTypeAuth(26-26)pkg/tnf/pkg/jobs/utils.go (1)
IsConditionTrue(94-96)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
pkg/tnf/pkg/etcd/etcd.go (2)
pkg/etcdcli/interfaces.go (1)
Status(45-47)pkg/operator/ceohelpers/common.go (1)
CurrentRevision(220-238)
pkg/tnf/auth/runner.go (1)
pkg/tnf/pkg/pcs/auth.go (1)
Authenticate(20-56)
cmd/tnf-setup-runner/main.go (2)
pkg/tnf/pkg/tools/jobs.go (1)
JobTypeUpdateSetup(30-30)pkg/tnf/update-setup/runner.go (1)
RunTnfUpdateSetup(25-178)
pkg/tnf/pkg/jobs/utils_test.go (1)
pkg/tnf/pkg/jobs/utils.go (4)
IsComplete(86-88)IsFailed(90-92)WaitForCompletion(23-26)DeleteAndWait(53-80)
pkg/tnf/update-setup/runner.go (4)
pkg/tnf/pkg/exec/exec.go (1)
Execute(14-47)pkg/tnf/pkg/config/cluster.go (1)
GetClusterConfig(23-25)pkg/tnf/pkg/etcd/etcd.go (1)
WaitForUpdatedRevision(69-98)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
pkg/tnf/pkg/kubelet/kubelet.go (1)
pkg/tnf/pkg/exec/exec.go (1)
Execute(14-47)
pkg/tnf/pkg/jobs/tnf_test.go (4)
pkg/testutils/testutils.go (1)
StaticPodOperatorStatus(211-222)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)pkg/tnf/pkg/jobs/tnf.go (2)
RunTNFJobController(37-78)RestartJobOrRunController(80-136)pkg/tnf/pkg/jobs/jobcontroller.go (1)
DefaultConditions(30-30)
pkg/tnf/pkg/jobs/jobcontroller.go (1)
pkg/tnf/pkg/jobs/utils.go (2)
IsComplete(86-88)IsFailed(90-92)
pkg/tnf/pkg/config/cluster.go (1)
pkg/tnf/pkg/tools/nodes.go (1)
GetNodeIPForPacemaker(24-43)
pkg/tnf/pkg/tools/nodes_test.go (1)
pkg/tnf/pkg/tools/nodes.go (2)
IsNodeReady(12-19)GetNodeIPForPacemaker(24-43)
pkg/tnf/operator/starter.go (5)
pkg/tnf/pkg/tools/nodes.go (1)
IsNodeReady(12-19)pkg/tnf/pkg/tools/secrets.go (1)
IsFencingSecret(32-34)pkg/tnf/pkg/jobs/tnf.go (1)
RestartJobOrRunController(80-136)pkg/tnf/pkg/tools/jobs.go (1)
JobTypeFencing(29-29)pkg/tnf/pkg/jobs/jobcontroller.go (1)
DefaultConditions(30-30)
pkg/tnf/fencing/runner.go (4)
pkg/tnf/pkg/tools/jobs.go (1)
JobTypeSetup(27-27)pkg/tnf/pkg/jobs/utils.go (1)
IsConditionTrue(94-96)pkg/tnf/pkg/config/cluster.go (1)
GetClusterConfigIgnoreMissingNode(28-30)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
🔇 Additional comments (52)
pkg/tnf/pkg/pcs/etcd.go (1)
25-26: LGTM!Adding
migration-threshold=5is a sensible configuration to allow transient failures before Pacemaker migrates the etcd resource. This aligns with the similar change for the kubelet resource incluster.go.bindata/tnfdeployment/job.yaml (2)
6-6: LGTM!The component label improves resource filtering and aligns with Kubernetes labeling conventions.
28-32: LGTM!Correct use of the Downward API to inject the node name. This enables node-aware operations within TNF setup jobs.
pkg/tnf/pkg/pcs/cluster.go (2)
35-35: LGTM!Consistent with the
migration-threshold=5added to the etcd resource.
39-39: LGTM!Setting
start-failure-is-fatal=falseimproves cluster resilience by allowing Pacemaker to retry resource starts after transient failures.pkg/tnf/pkg/tools/nodes.go (2)
12-19: LGTM!Clean implementation that correctly handles all cases including missing NodeReady condition.
24-43: LGTM with a minor observation.The implementation correctly prioritizes internal IPs and validates them with
net.ParseIP. The fallback to the first address ensures resilience when no valid internal IP exists.Note: If an invalid internal IP is present (e.g., malformed string), it's silently skipped and may fall back to the first address (which could be a hostname or external IP). The tests cover this case, so this appears intentional for robustness.
pkg/tnf/pkg/tools/nodes_test.go (2)
11-135: LGTM!Comprehensive table-driven tests covering all meaningful scenarios for
IsNodeReady.
137-354: Excellent test coverage.The tests thoroughly cover IPv4/IPv6, fallback behavior, error cases, and the IPv4-mapped IPv6 normalization edge case.
pkg/tnf/pkg/etcd/etcd.go (2)
49-66: LGTM!Clean refactoring that modularizes the revision wait logic. The status update happens only after successful revision synchronization, which is correct.
68-98: LGTM!The new exported
WaitForUpdatedRevisionfunction enables reuse across the TNF orchestration flow. The polling parameters (10s interval, 10min timeout, immediate=false) are reasonable for allowing CEO time to create revisions.pkg/tnf/pkg/config/cluster.go (1)
12-20: Clarify single-node contract when usingignoreMissingNodeThe new
getClusterConfigallows a single master node whenignoreMissingNodeis true, leavingNodeName2/NodeIP2empty. That’s reasonable, but it makes the API contract asymmetric withGetClusterConfig. Please ensure all callers ofGetClusterConfigIgnoreMissingNodeexplicitly handle the “only Node1 populated” case (especially where both node names/IPs are used to build pcs commands) and consider documenting this behavior in the function comment for future maintainers.Also applies to: 22-32, 44-49, 55-71
pkg/tnf/pkg/kubelet/kubelet.go (1)
11-15: Good centralization of kubelet disabling logicWrapping the
systemctl disable kubeletcall inkubelet.Disableimproves reuse and testability while relying on the existingexec.Executeplumbing and logging. Looks solid.pkg/tnf/pkg/jobs/jobcontroller.go (1)
246-267: Consistent job predicates via shared helpersSwitching to
IsComplete/IsFailedcentralizes job state evaluation and keeps the Available/Progressing/degraded logic consistent with other TNF job flows. The usage here looks correct.Also applies to: 271-294, 305-310
pkg/tnf/setup/runner.go (1)
22-25: Stricter auth job completion check and updated fencing invocation look soundRequiring exactly two auth jobs and using
jobs.IsConditionTrue(..., JobComplete)should make the setup phase more robust against partial or stale jobs, with the existing timeout guarding against hangs. Passing[]string{cfg.NodeName1, cfg.NodeName2}intoConfigureFencingmatches the new API and keeps node ordering explicit forpcmk_delay_basehandling. No issues from a correctness/maintainability perspective.Also applies to: 57-81, 102-104
cmd/tnf-setup-runner/main.go (1)
23-24: Update-setup command is wired consistently with existing TNF subcommandsThe new
NewUpdateSetupCommandmirrors the existing auth/setup/after-setup/fencing commands (same error handling and naming viaJobTypeUpdateSetup.GetSubCommand()), so the CLI surface stays consistent. Looks good.Also applies to: 58-63, 119-130
pkg/tnf/pkg/tools/jobs.go (1)
14-19: Job type and timeout extensions are safe and consistentAdding
JobTypeUpdateSetupat the end of the enum preserves existing values, and wiring it to"update-setup"viaGetSubCommandmatches the established pattern. The newAfterSetupJobCompletedTimeoutfollows the existing timeout scheme. No concerns.Also applies to: 25-31, 33-45
pkg/tnf/auth/runner.go (1)
13-14: PCS auth centralization and improved error logging look correctDelegating to
pcs.Authenticateand logging failures for both cluster config retrieval and authentication simplifies this runner and makes failures more observable. Continuing to useGetClusterConfig(not the ignore-missing variant) is consistent with pcs authentication requiring both nodes. Ignoring the boolean return fromAuthenticateis fine as long as it remains a simple success flag, which it currently is.Also applies to: 50-61
pkg/tnf/pkg/pcs/fencing_test.go (1)
275-277: Updated fencing command expectations match the new option and timeout behaviorThe revised
wantstrings correctly assert inclusion ofpcmk_delay_baseandssl_insecure(with appropriate defaults) and the increased--wait=120for the affected cases, which should guard against regressions in the updated stonith command construction.Also applies to: 292-293, 326-327, 344-345, 366-367
pkg/tnf/pkg/jobs/utils_test.go (2)
18-84: LGTM!Well-structured table-driven tests for
IsCompletewith good coverage of edge cases including empty conditions and presence of only the opposite condition type.
86-152: LGTM!Symmetric test coverage for
IsFailedmatching theIsCompletetests. Good consistency.pkg/tnf/pkg/pcs/auth.go (2)
19-56: LGTM on the overall authentication flow.The function properly retrieves the cluster ID, creates a token file with restricted permissions, and executes the PCS authentication sequence with appropriate error handling at each step.
49-53: The command string uses node names and IPs from the Kubernetes API, which enforces naming constraints preventing shell metacharacters.Node names and IP addresses originate from the Kubernetes API (via
node.Nameandnode.Status.Addresses), not from user input. Kubernetes enforces DNS-1123 naming rules that restrict node names to alphanumeric characters, hyphens, and periods, and IP addresses are validated as proper IP format. These constraints prevent shell injection, making command injection infeasible in this context.Likely an incorrect or invalid review comment.
pkg/tnf/after-setup/runner.go (2)
51-68: LGTM on the setup job completion check.The logic correctly validates exactly one setup job exists and checks completion using the centralized
jobs.IsConditionTrueutility. Good refactoring to use shared utilities.
76-81: Good refactoring to use dedicated kubelet package.Switching from direct
exectokubelet.Disable(ctx)improves maintainability and encapsulates the kubelet management logic.pkg/tnf/operator/nodehandler_test.go (3)
34-158: Comprehensive test coverage for node handling scenarios.The test table covers all critical paths: insufficient/excess nodes, node readiness combinations, successful flows with and without existing jobs, and error propagation. Well-structured test design.
232-265: Good use of function variable mocking with proper cleanup.The mock injection pattern using package-level function variables with deferred restoration ensures tests don't leak state. This is a clean approach for testing functions with external dependencies.
302-344: LGTM on helper functions.The helper functions are minimal and focused, creating test fixtures with just the required fields for the test scenarios.
pkg/tnf/pkg/pcs/fencing.go (3)
71-74: Good defensive check for empty node names.Skipping empty entries prevents errors when processing partial node lists during replacement scenarios.
85-91: Good differentiation of delay values between first and subsequent nodes.Applying different
pcmk_delay_basevalues prevents fencing races. The comment explains the rationale clearly.
242-244: Good explicit handling of ssl_insecure in both cases.Always setting
ssl_insecureexplicitly (to "1" or "0") ensures consistent configuration state rather than relying on defaults.pkg/tnf/update-setup/runner.go (1)
25-53: Client initialization and signal handling looks good.The setup correctly uses in-cluster config, creates necessary clients, and properly wires up SIGTERM/SIGINT handling with context cancellation. The dynamic informers are started and synced before use.
pkg/tnf/fencing/runner.go (2)
52-73: Setup job polling logic is correct and well-structured.The renamed
setupJobsvariable improves clarity. The use ofjobs.IsConditionTruealigns with the refactored utility location. The logging improvements provide better debugging context.
84-92: Good use ofGetClusterConfigIgnoreMissingNodefor node replacement scenarios.Using the ignore-missing-node variant correctly handles transient states during node replacements where a node might not yet be registered. The explicit node names slice passed to
ConfigureFencingaligns with the updated function signature.pkg/tnf/operator/starter.go (5)
70-87: Node add handler correctly filters by readiness before processing.The goroutine dispatch prevents blocking the informer, and the readiness check avoids processing nodes that aren't ready yet. This is a good pattern for event-driven handling.
88-105: Update handler correctly detects ready-state transitions.Only triggering on
!oldReady && newReadytransition prevents redundant processing while catching the important case of a node becoming ready.
219-239: Secret data change detection logic is thorough.The byte-level comparison of secret data prevents unnecessary job restarts when non-data metadata changes. The early return when
!changedis efficient.
246-250: RestartJobOrRunController consolidates job restart logic well.Using the centralized restart mechanism with proper locking and timeout handling improves reliability over the previous inline logic.
106-118: Concurrent node deletion calls are properly serialized via mutex.The
handleNodesWithRetryfunction is protected byhandleNodesMutex, ensuring only one execution runs at a time regardless of goroutine count. Rapid node churn will trigger multiple goroutines, but they serialize at the mutex. The retry logic uses exponential backoff (5s initial, 2x factor, 2min cap, ~10 minutes total), and errors are properly surfaced by degrading the operator status. The design is deliberate and safe.pkg/tnf/pkg/jobs/tnf_test.go (3)
29-153: TestRunTNFJobController provides good coverage of controller lifecycle.The test cases cover:
- Controllers with/without node names
- Controller deduplication when already running
- Different job types running concurrently
- Different nodes for same job type
The global state reset between tests is correctly implemented.
155-291: TestRestartJobOrRunController covers key scenarios effectively.Test cases properly cover:
- Job non-existence (controller only)
- Job exists and completes (wait + delete + controller)
- API errors propagation
- Timeout scenarios
- Delete failures
The reactor-based mocking for delete/get behavior is well-structured.
374-485: Parallel execution test validates locking semantics.The test correctly verifies that concurrent calls result in only one delete operation, validating the job-level locking mechanism works as designed.
pkg/tnf/operator/starter_test.go (3)
35-35: Import addition for corev1listers is necessary for the signature changes.The import supports the new NodeLister parameter type used throughout the tests.
278-321: Mock function signature updates align with production code changes.The updated
setupMockHandleNodesreturn type correctly reflects the newhandleNodesFuncsignature usingNodeListerinstead of a node slice. This allows the production code to lazily list nodes when needed rather than requiring an upfront snapshot.
328-352: Test scaffolding correctly creates NodeLister from informer.Creating the
controlPlaneNodeListerfrom the informer's indexer mirrors production behavior. The updated call tohandleNodesWithRetrywith the new parameter order is correct.pkg/tnf/operator/nodehandler.go (3)
30-50: Well-structured concurrency control and retry configuration.The mutex protection and testable function variables are good patterns for this type of orchestration code. The backoff configuration with ~10 minutes total retry time is reasonable for infrastructure operations.
52-102: Solid retry logic with proper operator status updates.The exponential backoff pattern with status condition updates on both success and failure paths is appropriate for operator-level error handling. Good use of
v1helpers.UpdateConditionFnfor status management.
168-204: Controller startup sequence looks correct.The sequence of waiting for informer sync, bootstrap completion, then starting job controllers is appropriate. Starting controllers in a loop per node for Auth/AfterSetup and then Setup/Fencing globally follows the expected TNF topology.
pkg/tnf/pkg/jobs/tnf.go (2)
25-35: Global state tracking is appropriate for this use case.The dual-map approach with separate mutexes for running controllers and restart locks provides proper isolation. This enables both duplicate prevention and serialized restarts.
80-136: RestartJobOrRunController implements correct serialization pattern.The per-job locking with lazy initialization properly serializes restarts for the same job while allowing parallel operations on different jobs. The wait-then-delete pattern enables controller-based recreation.
pkg/tnf/pkg/jobs/utils.go (2)
52-80: DeleteAndWait has good UID-based deletion verification.The pattern of capturing the old UID and checking for either NotFound or different UID handles the recreation race condition correctly. The 1-minute hardcoded timeout is reasonable for deletion operations.
82-105: Clean and reusable condition checking utilities.The
IsStopped,IsComplete,IsFailed, and condition helper functions provide a clear API for job status evaluation. Good separation of concerns.
Signed-off-by: Marc Sluiter <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/tnf/update-setup/runner.go (1)
146-160: Inconsistent error handling for member removal.Lines 149-150 log the error and continue, while line 157 returns the error. If the member list command fails, code skips removal entirely (acceptable cleanup behavior), but if removal fails after finding a member, it returns error. Consider adding a brief comment explaining this asymmetry, or making the behavior consistent.
🧹 Nitpick comments (1)
pkg/tnf/pkg/jobs/tnf.go (1)
31-35: Consider cleanup forrestartJobLocksto prevent unbounded growth.The
restartJobLocksmap stores per-job mutexes that are never removed. For long-running operators with many job restarts, this could lead to unbounded memory growth. Consider implementing cleanup or using a bounded cache.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (9)
pkg/tnf/after-setup/runner.go(4 hunks)pkg/tnf/operator/nodehandler.go(1 hunks)pkg/tnf/pkg/jobs/tnf.go(1 hunks)pkg/tnf/pkg/jobs/utils.go(1 hunks)pkg/tnf/pkg/jobs/utils_test.go(1 hunks)pkg/tnf/pkg/pcs/fencing.go(4 hunks)pkg/tnf/pkg/pcs/fencing_test.go(6 hunks)pkg/tnf/pkg/tools/jobs.go(3 hunks)pkg/tnf/update-setup/runner.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/tnf/pkg/jobs/utils_test.go
- pkg/tnf/pkg/tools/jobs.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
pkg/tnf/pkg/jobs/tnf.gopkg/tnf/operator/nodehandler.gopkg/tnf/pkg/jobs/utils.gopkg/tnf/pkg/pcs/fencing_test.gopkg/tnf/update-setup/runner.gopkg/tnf/after-setup/runner.gopkg/tnf/pkg/pcs/fencing.go
🧬 Code graph analysis (2)
pkg/tnf/operator/nodehandler.go (8)
pkg/tnf/pkg/tools/nodes.go (1)
IsNodeReady(12-19)pkg/tnf/pkg/jobs/tnf.go (2)
RunTNFJobController(37-86)RestartJobOrRunController(88-144)pkg/tnf/pkg/tools/jobs.go (9)
JobTypeAuth(27-27)JobTypeAfterSetup(29-29)JobTypeSetup(28-28)JobTypeFencing(30-30)AuthJobCompletedTimeout(15-15)JobTypeUpdateSetup(31-31)SetupJobCompletedTimeout(16-16)AfterSetupJobCompletedTimeout(17-17)AllCompletedTimeout(18-18)pkg/tnf/pkg/jobs/jobcontroller.go (2)
DefaultConditions(30-30)AllConditions(31-31)pkg/operator/ceohelpers/external_etcd_status.go (1)
IsEtcdRunningInCluster(62-80)pkg/operator/bootstrapteardown/waitforceo.go (1)
WaitForEtcdBootstrap(17-32)pkg/tnf/pkg/jobs/utils.go (1)
WaitForCompletion(23-26)pkg/operator/operatorclient/interfaces.go (1)
TargetNamespace(7-7)
pkg/tnf/update-setup/runner.go (4)
pkg/tnf/pkg/exec/exec.go (1)
Execute(14-47)pkg/tnf/pkg/config/cluster.go (1)
GetClusterConfig(23-25)pkg/tnf/pkg/etcd/etcd.go (1)
WaitForUpdatedRevision(69-98)pkg/tnf/pkg/pcs/fencing.go (1)
ConfigureFencing(64-144)
🔇 Additional comments (26)
pkg/tnf/pkg/jobs/tnf.go (1)
77-85: Good fix: Controller cleanup on exit.The defer block now properly removes the controller key from
runningControllerswhen the goroutine exits, addressing the previous concern about controllers being marked running but never cleared.pkg/tnf/after-setup/runner.go (2)
50-70: LGTM: Clean refactoring to use centralized job utilities.The setup job polling logic correctly uses
jobs.IsConditionTruefor condition checking, and the variable renaming tosetupJobsimproves clarity.
76-81: LGTM: Good abstraction for kubelet management.Using
kubelet.Disable(ctx)instead of direct shell execution improves testability and maintainability.pkg/tnf/pkg/jobs/utils.go (2)
28-47: LGTM: Well-designed polling with resilience to transient failures.The design choice to ignore errors (including NotFound) during polling is appropriate for handling job deletion/recreation cycles. The clear comments explain the rationale.
49-84: LGTM: Robust deletion handling with UID tracking.The
DeleteAndWaitfunction correctly handles the case where a job might be recreated during deletion by comparing UIDs. This prevents races between controllers.pkg/tnf/pkg/pcs/fencing_test.go (1)
254-378: LGTM: Comprehensive test coverage for fencing command generation.The test cases thoroughly cover the updated stonith command generation including:
pcmk_delay_basewith empty and explicit valuesssl_insecurewith "0" and "1" values- IPv6 address handling
- Both create and update scenarios
- Increased wait timeout (120s)
pkg/tnf/pkg/pcs/fencing.go (2)
236-248: LGTM: Format string arguments are now correctly aligned.The
getStonithCommandfunction now correctly includesfc.FencingDeviceOptions[PcmkDelayBase]as the argument for thepcmk_delay_base=%qformat specifier. Thessl_insecurehandling also ensures a value is always set ("1" or "0"), and the wait timeout is increased to 120s.
64-93: LGTM: Dynamic node handling with proper delay assignment.The refactoring to accept
[]stringnodeNames and skip empty entries provides flexibility for node replacement scenarios. The staggeredpcmk_delay_baseassignment (10s for first device, 1s for others) helps prevent simultaneous fencing races.pkg/tnf/update-setup/runner.go (9)
1-23: LGTM!Imports are well-organized and include all necessary dependencies for the update-setup workflow.
25-54: LGTM!Client initialization, signal handling, and informer synchronization follow established patterns in the codebase.
55-69: LGTM!Environment validation and cluster status check provide appropriate guards. The early exit when cluster is not running is correct behavior for this workflow.
71-89: LGTM!Configuration loading and node role determination logic is clear and includes a helpful error message when the current node is not found in the cluster config.
91-103: LGTM!Offline node detection follows the established pattern for pcs interaction. The early exit when no offline node is found is appropriate.
107-124: LGTM!The wait for etcd revision update before cluster changes is a good safeguard. The node removal/addition sequence is appropriate for the replacement scenario.
126-144: LGTM!Fencing configuration is correctly sequenced before etcd resource updates. The 300-second wait timeout on the etcd resource update is reasonable for cluster stabilization.
162-176: LGTM!The stabilization delay is pragmatic given the documented timing issue with etcd start. The final cluster enable/start sequence completes the update workflow correctly.
180-190: LGTM!Clean helper function with appropriate logging and early return on error.
pkg/tnf/operator/nodehandler.go (9)
1-28: LGTM!Imports are comprehensive and well-organized for the node handling orchestration logic.
30-50: LGTM!Package-level mutex and function hooks for testability are well-designed. The backoff configuration provides reasonable retry behavior with approximately 10 minutes total retry time.
52-102: LGTM!Retry mechanism with exponential backoff is well-implemented. The mutex ensures single execution, and operator status updates provide proper observability for both success and failure cases.
168-204: LGTM!Job controller startup sequence is well-structured. The informer sync before bootstrap check is correct, and waiting for AfterSetup completion prevents race conditions with subsequent update operations.
206-223: LGTM!Bootstrap wait logic correctly handles both in-cluster and bootstrap scenarios. Creating a new client config for
WaitForEtcdBootstrapaligns with its API requirements.
225-287: LGTM!Update setup workflow is well-structured with proper phasing: Auth → UpdateSetup → AfterSetup. The parallel start followed by sequential wait pattern is efficient, and error messages include node context for debugging.
289-300: LGTM!Clean implementation using label selector to detect prior TNF setup.
302-311: LGTM!Completion wait helper is clean with appropriate logging and error context.
119-138: The event-driven architecture already re-triggershandleNodesWithRetryon node changes.Returning
nilfor transient states (<2nodes,>2nodes, nodes not ready) is correct. The recovery mechanism is not exponential backoff—it's event-driven re-invocation via node informer handlers (AddFunc when node added, UpdateFunc when node transitions to ready, DeleteFunc when node deleted). Each handler spawns a goroutine to re-invokehandleNodesWithRetry, ensuring timely retry when conditions change.
|
@slintes: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| } | ||
|
|
||
| // wait a bit for things to settle | ||
| // without this the etcd start on the new node fails for some reason... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is possibly a timing issue with the etcd revision not being present on the node yet. May be fixed by the auto-retrys we're adding podman-etcd in the latest revision.
| if !bothReady { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding the logic here. We log that we're waiting for the node to be ready, but by returning nil, we're preventing handleNodeWithRetry from retrying. Is this the intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to retry here, a condition change will retrigger the complete node handling, see https://github.com/openshift/cluster-etcd-operator/pull/1523/changes#diff-976f22fb9d9391708f396ddc48bc75f432af312729e8ec93307185f4029e9cdbR88-R105
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
Refactors and extends TNF (Two Node Fencing) node handling to improve reliability during node lifecycle events, particularly node replacements and failures. Extracts logic into dedicated modules with comprehensive retry mechanisms and proper error handling.
Key Changes
Architecture & Reliability:
Node Replacement Support:
Job Management: