-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPEDGE-1788: TNF add etcd cold boot recovery tests from graceful node shutdown #30404
Conversation
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery |
|
@jaypoulz: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c4ec1d0-ae9a-11f0-95ed-ad8d5e8a115f-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/df063eb0-b31c-11f0-9588-ce4096893980-0 |
|
Rebasing this to get #30385 /hold |
c653990 to
9083969
Compare
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations: - Cold boot from double GNS: both nodes gracefully shut down simultaneously, then both restart (full cluster cold boot) - Cold boot from sequential GNS: first node gracefully shut down, then second node gracefully shut down, then both restart - Cold boot from mixed GNS/UGNS: first node gracefully shut down, surviving node then ungracefully shut down, then both restart Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.
9083969 to
b6e1384
Compare
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92aaaa10-b4d4-11f0-8ae3-147c7322b463-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f41854d0-b503-11f0-9c7a-52e63142ca96-0 |
Change BeforeEach health checks to skip tests instead of failing them when the cluster is not in a healthy state at the start of the test. Previously, the etcd recovery tests would fail if the cluster was not healthy before the test started. This is problematic because these tests are designed to validate recovery from intentional disruptions, not to debug pre-existing cluster issues. Changes: - Extract health validation functions to common.go for reusability - Add skipIfClusterIsNotHealthy() to consolidate all health checks - Implement internal retry logic in health check functions with timeouts - Add ensureEtcdHasTwoVotingMembers() to validate membership state - Skip tests early if cluster is degraded, pods aren't running, or members are unhealthy This ensures tests only run when the cluster is in a known-good state, reducing false failures due to pre-existing issues while maintaining test coverage for actual recovery scenarios.
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0181f70-b5be-11f0-9b3d-143ce616f56d-0 |
|
payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/878d1590-b629-11f0-97f1-f0b4b63689fd-0 |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: f918765
New tests seen in this PR at sha: f918765
|
|
@clobrano: This pull request references OCPEDGE-1788 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Also unified timeouts for the initial checks and improved logging
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a1c1ae0-b893-11f0-8047-50d9133ea1bb-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4a74ab30-bbda-11f0-80bd-b8a7622b18f9-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8e70d3c0-bbe1-11f0-9f1c-e6c898636467-0 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d9d0c600-be29-11f0-8d69-5a1123152b7a-0 |
failed for OCPEDGE-2213 |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/83eac3a0-be4f-11f0-94b8-fb61590621b1-0 |
The cluster failed to deploy |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@clobrano: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dd9fada0-bf00-11f0-8514-620ca41f9ab2-0 |
cfa18ac to
90b1da6
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clobrano The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the type of comment you're looking for at this point in the life of the PR, but this block seems suspect to me. Available and Degraded are independent, and we are looking for both to be true at the same time to consider the operator healthy. This way of writing the logic seems clearer and also informs on Degraded on the Not Available statuses:
`// Check if etcd operator is healthy (Available and not Degraded)
available := findClusterOperatorCondition(co.Status.Conditions, v1.OperatorAvailable)
degraded := findClusterOperatorCondition(co.Status.Conditions, v1.OperatorDegraded)
if (available != nil && available.Status == v1.ConditionTrue) &&
(degraded == nil || degraded.Status != v1.ConditionTrue) {
framework.Logf("SUCCESS: Cluster operator is healthy")
return nil
}
// Not healthy - report why
var reasons []string
if available == nil {
reasons = append(reasons, "Available condition not found")
} else if available.Status != v1.ConditionTrue {
reasons = append(reasons, fmt.Sprintf("not Available: %s", available.Message))
}
if degraded != nil && degraded.Status == v1.ConditionTrue {
reasons = append(reasons, fmt.Sprintf("Degraded: %s", degraded.Message))
}
return fmt.Errorf("ClusterOperator is unhealthy: %s", strings.Join(reasons, "; "))`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see them independent 🤔
available := findClusterOperatorCondition(co.Status.Conditions, v1.OperatorAvailable)
if available == nil {
err = fmt.Errorf("ClusterOperator Available condition not found") // available == nil ==> err is set, exit from if/else branch
} else if available.Status != v1.ConditionTrue { // here we are sure that (available != nil)
err = fmt.Errorf("ClusterOperator is not Available: %s", available.Message) // available.Status != v1.ConditionTrue ==> err is set, exit from if/else branch
} else { // here we are sure that (available != nil && available.Status == v1.ConditionTrue)
// Check if etcd operator is not Degraded
degraded := findClusterOperatorCondition(co.Status.Conditions, v1.OperatorDegraded)
if degraded != nil && degraded.Status == v1.ConditionTrue {
err = fmt.Errorf("ClusterOperator is Degraded: %s", degraded.Message) // degraded here, err is set, exit from if/else branch
} else {
framework.Logf("SUCCESS: Cluster operator is healthy") // here we are sure that (available.Status == v1.ConditionTrue && degraded.Status != v1.ConditionTrue)
return nil
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we return the error message when the ctxt expires
select {
case <-ctx.Done():
return err
default:
}
time.Sleep(pollInterval)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I've misunderstood the possibilities matrix. Can't the cluster be at the same time Available and Degraded? Doesn't that happen for example if we have 1/2 etcd but kept quorum? (a new cluster where the second member hasn't joined, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in chat. I understand the the nested if/else might seem complex to read, but also reading complex boolean conditions might be and might. We agreed to keep the code as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. We might make it marginally more readable but it's not worth re-doing it when we know it works. The change is not as simple as I though 😓
|
@clobrano: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Replaced by #30519 |
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add three new test cases to validate etcd cluster recovery from cold boot scenarios reached through different graceful/ungraceful shutdown combinations:
Note: The inverse case (UGNS first node, then GNS second) is not tested because in TNF clusters, an ungracefully shut down node is quickly recovered, preventing the ability to wait and gracefully shut down the second node later. The double UGNS scenario is already covered by existing tests.