-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30519
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/174335e0-c951-11f0-8750-599f1dc187cb-0 |
8376fbb to
000f99c
Compare
|
Scheduling required tests: |
000f99c to
abaf1be
Compare
|
Scheduling required tests: |
abaf1be to
9292848
Compare
|
Scheduling required tests: |
fonta-rh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really clean!!! I added a couple nits, but nothing that affects functionality
| state, err := services.GetVMState(d.vm, &c.HypervisorConfig, c.HypervisorKnownHostsPath) | ||
| if err != nil { | ||
| fmt.Fprintf(g.GinkgoWriter, "Warning: cleanup failed to check VM '%s' state: %v\n", d.vm, err) | ||
| fmt.Fprintf(g.GinkgoWriter, "Trying to start VM '%s' anyway\n", d.vm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After inspecting, I understand why we start the machine anyway (virsh handles the edge cases). I would suggest the following, to make the intention clearer: Here, instead of logging "Trying to start VM.. anyway", I would log "Marking VM %s as shutdown", which naturally follows that we "failed to check VM %s state" and then add another log line between 599 and 600 that does say "Trying to start VM %s"
| } | ||
|
|
||
| // findClusterOperatorCondition finds a condition in ClusterOperator status | ||
| func findClusterOperatorCondition(conditions []v1.ClusterOperatorStatusCondition, conditionType v1.ClusterStatusConditionType) *v1.ClusterOperatorStatusCondition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function in both files? Can't we reuse the one in common (or viceversa)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This slipped through my rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not even used anymore in tnf_recovery.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It was Claude who caught it, I can't take the credit for that 👯♂️ )
9292848 to
e0c1e72
Compare
|
Scheduling required tests: |
e0c1e72 to
fc5a9cc
Compare
|
Scheduling required tests: |
|
/retest-required |
|
Last payload job looks like it hit an issue where etcd didn't recover, so we may want to see if that's resolved in the latest runs. |
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements to the common utility and service functions. I'd just want to see the output from a payload job running these.
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/725f6d80-d154-11f0-9458-8ef479ed1ca7-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few concerns I do have:
- We seem to be missing the annotation that declares that we require the HypervisorSSH config for the cold boot tests.
- Because these tests should be run in serial, we should also annotate them with [Serial]
See
| var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica][Suite:openshift/two-node][Slow][Serial][Disruptive][Requires:HypervisorSSHConfig] TNF", func() { |
I would reserve [Slow] for tests that take more than 20 minutes.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: clobrano, fonta-rh, jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bb33b1c0-d1ad-11f0-9c78-aeea81f4f3be-0 |
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations: - Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot) - Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart - Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests. This change also includes skipping etcd recovery tests when cluster is unhealthy. Previously, the etcd recovery tests would fail if the cluster was not healthy before the test started. This is problematic because these tests are designed to validate recovery from intentional disruptions, not to debug pre-existing cluster issues.
fc5a9cc to
d810c36
Compare
|
Scheduling required tests: |
|
@clobrano: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f34b4040-d1e0-11f0-8fdb-30c402463c9b-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6cd7a600-d4d5-11f0-8da3-7302111ebe8f-0 |
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:
Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.