OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30519

clobrano · 2025-11-24T16:16:36Z

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

openshift-ci-robot · 2025-11-24T16:16:40Z

Pipeline controller notification
This repository is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2025-11-24T16:16:42Z

@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue.

In response to this:

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)

Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart

Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano · 2025-11-24T16:17:35Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-11-24T16:17:44Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/174335e0-c951-11f0-8750-599f1dc187cb-0

openshift-ci-robot · 2025-11-24T20:06:01Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2025-11-25T12:53:15Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2025-11-25T20:03:52Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

fonta-rh

This is really clean!!! I added a couple nits, but nothing that affects functionality

fonta-rh · 2025-11-28T14:55:56Z

test/extended/two_node/tnf_recovery.go

+		state, err := services.GetVMState(d.vm, &c.HypervisorConfig, c.HypervisorKnownHostsPath)
+		if err != nil {
+			fmt.Fprintf(g.GinkgoWriter, "Warning: cleanup failed to check VM '%s' state: %v\n", d.vm, err)
+			fmt.Fprintf(g.GinkgoWriter, "Trying to start VM '%s' anyway\n", d.vm)


After inspecting, I understand why we start the machine anyway (virsh handles the edge cases). I would suggest the following, to make the intention clearer: Here, instead of logging "Trying to start VM.. anyway", I would log "Marking VM %s as shutdown", which naturally follows that we "failed to check VM %s state" and then add another log line between 599 and 600 that does say "Trying to start VM %s"

fonta-rh · 2025-11-28T14:57:36Z

test/extended/two_node/utils/common.go

+}
+
+// findClusterOperatorCondition finds a condition in ClusterOperator status
+func findClusterOperatorCondition(conditions []v1.ClusterOperatorStatusCondition, conditionType v1.ClusterStatusConditionType) *v1.ClusterOperatorStatusCondition {


Is this function in both files? Can't we reuse the one in common (or viceversa)

Good catch! This slipped through my rebase

it's not even used anymore in tnf_recovery.go

(It was Claude who caught it, I can't take the credit for that 👯‍♂️ )

openshift-ci-robot · 2025-11-28T16:39:40Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci-robot · 2025-12-01T15:42:09Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

clobrano · 2025-12-01T19:42:35Z

/retest-required

jaypoulz · 2025-12-04T20:53:54Z

Last payload job looks like it hit an issue where etcd didn't recover, so we may want to see if that's resolved in the latest runs.

jaypoulz

Nice improvements to the common utility and service functions. I'd just want to see the output from a payload job running these.

jaypoulz · 2025-12-04T21:01:46Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-12-04T21:02:15Z

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/725f6d80-d154-11f0-9458-8ef479ed1ca7-0

jaypoulz

A few concerns I do have:

We seem to be missing the annotation that declares that we require the HypervisorSSH config for the cold boot tests.
Because these tests should be run in serial, we should also annotate them with [Serial]

See

origin/test/extended/two_node/tnf_node_replacement.go

Line 150 in fc5a9cc

    
           var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica][Suite:openshift/two-node][Slow][Serial][Disruptive][Requires:HypervisorSSHConfig] TNF", func() {

for the exact annotations.

I would reserve [Slow] for tests that take more than 20 minutes.

openshift-ci · 2025-12-04T21:05:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, fonta-rh, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/two_node/OWNERS~~ [jaypoulz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

clobrano · 2025-12-05T07:40:53Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-12-05T07:40:57Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bb33b1c0-d1ad-11f0-9c78-aeea81f4f3be-0

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations: - Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot) - Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart - Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests. This change also includes skipping etcd recovery tests when cluster is unhealthy. Previously, the etcd recovery tests would fail if the cluster was not healthy before the test started. This is problematic because these tests are designed to validate recovery from intentional disruptions, not to debug pre-existing cluster issues.

openshift-ci-robot · 2025-12-05T13:31:52Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci · 2025-12-05T13:31:53Z

@clobrano: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

clobrano · 2025-12-05T13:47:32Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-12-05T13:47:36Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f34b4040-d1e0-11f0-8fdb-30c402463c9b-0

clobrano · 2025-12-09T08:02:35Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

openshift-ci · 2025-12-09T08:03:01Z

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6cd7a600-d4d5-11f0-8da3-7302111ebe8f-0

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 24, 2025

clobrano mentioned this pull request Nov 24, 2025

OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404

Closed

openshift-ci bot requested review from eggfoobar and qJkee November 24, 2025 16:17

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 8376fbb to 000f99c Compare November 24, 2025 16:44

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 000f99c to abaf1be Compare November 25, 2025 09:26

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from abaf1be to 9292848 Compare November 25, 2025 16:36

fonta-rh approved these changes Nov 28, 2025

View reviewed changes

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 9292848 to e0c1e72 Compare November 28, 2025 16:11

feat: add VirshShutdownVM helper for graceful VM shutdown

129b681

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from e0c1e72 to fc5a9cc Compare December 1, 2025 15:05

jaypoulz approved these changes Dec 4, 2025

View reviewed changes

jaypoulz reviewed Dec 4, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2025

clobrano added 2 commits December 5, 2025 14:06

Add skipped tag with known issue reference

d810c36

clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from fc5a9cc to d810c36 Compare December 5, 2025 13:07

OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30519

Are you sure you want to change the base?

OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30519

Uh oh!

Conversation

clobrano commented Nov 24, 2025

Uh oh!

openshift-ci-robot commented Nov 24, 2025

Uh oh!

openshift-ci-robot commented Nov 24, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clobrano commented Nov 24, 2025

Uh oh!

openshift-ci bot commented Nov 24, 2025

Uh oh!

openshift-ci-robot commented Nov 24, 2025

Uh oh!

openshift-ci-robot commented Nov 25, 2025

Uh oh!

openshift-ci-robot commented Nov 25, 2025

Uh oh!

fonta-rh left a comment

Choose a reason for hiding this comment

Uh oh!

fonta-rh Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

fonta-rh Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

clobrano Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

fonta-rh Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Nov 28, 2025

Uh oh!

openshift-ci-robot commented Dec 1, 2025

Uh oh!

clobrano commented Dec 1, 2025

Uh oh!

jaypoulz commented Dec 4, 2025

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

jaypoulz commented Dec 4, 2025

Uh oh!

openshift-ci bot commented Dec 4, 2025

Uh oh!

jaypoulz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 4, 2025

Uh oh!

clobrano commented Dec 5, 2025

Uh oh!

openshift-ci bot commented Dec 5, 2025

Uh oh!

openshift-ci-robot commented Dec 5, 2025

Uh oh!

openshift-ci bot commented Dec 5, 2025

Uh oh!

clobrano commented Dec 5, 2025

Uh oh!

openshift-ci bot commented Dec 5, 2025

Uh oh!

clobrano commented Dec 9, 2025

Uh oh!

openshift-ci bot commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

openshift-ci-robot commented Nov 24, 2025 •

edited by openshift-ci bot

Loading

jaypoulz left a comment •

edited

Loading