Skip to content

Conversation

@clobrano
Copy link
Contributor

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

  • Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
  • Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
  • Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repository is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 24, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 24, 2025

@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue.

In response to this:

Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:

  • Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot)
  • Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart
  • Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clobrano
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 24, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/174335e0-c951-11f0-8750-599f1dc187cb-0

@openshift-ci openshift-ci bot requested review from eggfoobar and qJkee November 24, 2025 16:17
@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 8376fbb to 000f99c Compare November 24, 2025 16:44
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 000f99c to abaf1be Compare November 25, 2025 09:26
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from abaf1be to 9292848 Compare November 25, 2025 16:36
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copy link

@fonta-rh fonta-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really clean!!! I added a couple nits, but nothing that affects functionality

state, err := services.GetVMState(d.vm, &c.HypervisorConfig, c.HypervisorKnownHostsPath)
if err != nil {
fmt.Fprintf(g.GinkgoWriter, "Warning: cleanup failed to check VM '%s' state: %v\n", d.vm, err)
fmt.Fprintf(g.GinkgoWriter, "Trying to start VM '%s' anyway\n", d.vm)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inspecting, I understand why we start the machine anyway (virsh handles the edge cases). I would suggest the following, to make the intention clearer: Here, instead of logging "Trying to start VM.. anyway", I would log "Marking VM %s as shutdown", which naturally follows that we "failed to check VM %s state" and then add another log line between 599 and 600 that does say "Trying to start VM %s"

}

// findClusterOperatorCondition finds a condition in ClusterOperator status
func findClusterOperatorCondition(conditions []v1.ClusterOperatorStatusCondition, conditionType v1.ClusterStatusConditionType) *v1.ClusterOperatorStatusCondition {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function in both files? Can't we reuse the one in common (or viceversa)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This slipped through my rebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not even used anymore in tnf_recovery.go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It was Claude who caught it, I can't take the credit for that 👯‍♂️ )

@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from 9292848 to e0c1e72 Compare November 28, 2025 16:11
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from e0c1e72 to fc5a9cc Compare December 1, 2025 15:05
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@clobrano
Copy link
Contributor Author

clobrano commented Dec 1, 2025

/retest-required

@jaypoulz
Copy link
Contributor

jaypoulz commented Dec 4, 2025

Last payload job looks like it hit an issue where etcd didn't recover, so we may want to see if that's resolved in the latest runs.

Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements to the common utility and service functions. I'd just want to see the output from a payload job running these.

@jaypoulz
Copy link
Contributor

jaypoulz commented Dec 4, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 4, 2025

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/725f6d80-d154-11f0-9458-8ef479ed1ca7-0

Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few concerns I do have:

  1. We seem to be missing the annotation that declares that we require the HypervisorSSH config for the cold boot tests.
  2. Because these tests should be run in serial, we should also annotate them with [Serial]

See

var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica][Suite:openshift/two-node][Slow][Serial][Disruptive][Requires:HypervisorSSHConfig] TNF", func() {
for the exact annotations.

I would reserve [Slow] for tests that take more than 20 minutes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clobrano, fonta-rh, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 4, 2025
@clobrano
Copy link
Contributor Author

clobrano commented Dec 5, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bb33b1c0-d1ad-11f0-9c78-aeea81f4f3be-0

Add three new test cases to validate etcd cluster recovery from cold
boot scenarios reached through different graceful/ungraceful shutdown
combinations:

- Cold boot from double GNS: both nodes gracefully shut down
  simultaneously, then both restart (full cluster cold boot)
- Cold boot from sequential GNS: first node gracefully shut down, then
  second node gracefully shut down, then both restart
- Cold boot from mixed GNS/UGNS: first node gracefully shut down,
  surviving node then ungracefully shut down, then both restart

Note: The inverse case (UGNS first node, then GNS second) is not tested
because in TNF clusters, an ungracefully shut down node is quickly
recovered, preventing the ability to wait and gracefully shut down the
second node later. The double UGNS scenario is already covered by
existing tests.

This change also includes skipping etcd recovery tests when cluster is
unhealthy. Previously, the etcd recovery tests would fail if the cluster
was not healthy before the test started. This is problematic because
these tests are designed to validate recovery from intentional
disruptions, not to debug pre-existing cluster issues.
@clobrano clobrano force-pushed the enhancement/tnf-e2e-cold-boot-from-mixed-gns-ungns-shutdowns branch from fc5a9cc to d810c36 Compare December 5, 2025 13:07
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2025

@clobrano: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@clobrano
Copy link
Contributor Author

clobrano commented Dec 5, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 5, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f34b4040-d1e0-11f0-8fdb-30c402463c9b-0

@clobrano
Copy link
Contributor Author

clobrano commented Dec 9, 2025

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 9, 2025

@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6cd7a600-d4d5-11f0-8da3-7302111ebe8f-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants