-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPEDGE-1484: [TNF] kubelet disruption test #30290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Job Failure Risk Analysis for sha: 7d6c813
|
|
Job Failure Risk Analysis for sha: 15714f2
Showing 20 of 23 jobs analysis |
|
Job Failure Risk Analysis for sha: eab19dc
|
|
Job Failure Risk Analysis for sha: 99e3ea0
|
clobrano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I noticed, also, that there are quite a few functions not used that must be cleaned up or integrated.
| return fmt.Errorf("failed to list cluster operators: %v", err) | ||
| } | ||
|
|
||
| for _, operator := range clusterOperators.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a library function that does this in the disruption test shared libraries.
https://github.com/openshift/origin/pull/30298/files#diff-4ed6c69fb287c32c3a2fd0f6a6f1e31eca45ed3267feb03d4e466a3ee4c7f276R1274-R1275
I think this approach is fine, but I would recommend collecting the unhealthy things into slices and then returning an error that lists all of the unavailable and degraded operators if the length of those slices are > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest change does collect unhealthy things now. An example
[FAILED] Timed out after 60.001s.
etcd cluster operator should be healthy before starting test
Unexpected error:
<*errors.errorString | 0xc0017b9740>:
etcd ClusterOperator is not Available: &{Available False 2025-10-13 10:42:14 -0400 EDT EtcdMembers_NoQuorum EtcdMembersAvailable: 1 of 2 members are available, NAME-PENDING-192.168.111.20 has not started}
{
s: "etcd ClusterOperator is not Available: &{Available False 2025-10-13 10:42:14 -0400 EDT EtcdMembers_NoQuorum EtcdMembersAvailable: 1 of 2 members are available, NAME-PENDING-192.168.111.20 has not started}",
}
occurred
In [BeforeEach] at: github.com/openshift/origin/test/extended/two_node/tnf_kubelet_disruption.go:42 @ 10/13/25 10:49:59.218
------------------------------
I think this meets your request above. Please verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Consider moving to common.go
|
Job Failure Risk Analysis for sha: aa19408
|
|
Job Failure Risk Analysis for sha: 19be558
|
|
Job Failure Risk Analysis for sha: d8aad48
|
clobrano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments
|
Job Failure Risk Analysis for sha: 967520e
|
|
Job Failure Risk Analysis for sha: 4ce24e9
|
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor notes. I'm good to merge as is, once it rebased and passes tests.
9e30ab7 to
be91544
Compare
be91544 to
60333c2
Compare
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 612eb9d
New tests seen in this PR at sha: 612eb9d
|
|
/retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhensel-rh, jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@dhensel-rh: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1764d5f0-d118-11f0-9f02-aa3e2e13ee2f-0 |
|
Scheduling required tests: |
eggfoobar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took an initial walk through, looking good so far, just some code duplication we can tighten up a bit. I'll give it another look Monday.
Highnotes,
- Lets use the
utils.GetNodesmethod - Lets keep the state of the nodes in the tests
- Lets use the
openshift-etcdnamespace to run the debug pods for now or create our own privileged namespace
| return false | ||
| } | ||
|
|
||
| // HasNodeRebooted checks if a node has rebooted by comparing its current BootID with a previous snapshot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like nothing changed here, let's move this back to avoid a diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If HasNodeRebooted is removed here, it will negatively impact
https://github.com/openshift/origin/blob/main/test/extended/two_node/tnf_recovery.go#L454-L458
Not sure why this is showing up like this. It is already checked in
| const ( | ||
| AllNodes = "" // No label filter for GetNodes | ||
| LabelNodeRoleControlPlane = "node-role.kubernetes.io/control-plane" // Control plane node label | ||
| LabelNodeRoleMaster = "node-role.kubernetes.io/master" // Legacy master node label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this, at this point all labels should use control-plane and I don't see where this is being used.
| return utils.LogEtcdClusterStatus(oc, "BeforeEach validation") | ||
| }, etcdOperatorIsHealthyTimeout, pollInterval).ShouldNot(o.HaveOccurred(), "etcd cluster should be fully healthy before starting test") | ||
|
|
||
| nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the helper function utils.GetNodes(oc, utils.AllNodes) here and everywhere we do node queries to simplify things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, lets avoid running this query in the beforeEach statement, just run it in the test itself, and we avoid any possible mistakes that can arise from sharing the nodelist between ginkonode runs. This should also help with the fact that the node state is changing enough where we get the fresh state at the test run time.
|
|
||
| g.By("Ensuring both nodes are healthy before starting kubelet disruption test") | ||
| for _, node := range nodes { | ||
| o.Eventually(func() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to make any call out for the individual nodes here, we already have the node object at this point, so we can just check nodeReady state
for _, node := range nodes {
if ready := nodeutil.IsNodeReady(node); !ready {
o.Expect(read).Should(o.BeTrue(), fmt.Sprintf("Node %s should be ready before kubelet disruption", node.Name)
}
}| defer g.GinkgoRecover() | ||
|
|
||
| var ( | ||
| oc = util.NewCLIWithoutNamespace("").AsAdmin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here lets use util.NewCLIWithoutNamespace("").SetNamespace("openshift-etcd").AsAdmin() to avoid permission issues and use the namespace where the tnf pods run in.
|
Scheduling required tests: |
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@dhensel-rh: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d8b78e30-d46c-11f0-90b8-6c3725405946-0 |
|
Scheduling required tests: |
eggfoobar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far so good, got a few suggestions, and just some logging changes, we have a bit too much logging repeating the same information, that's going to make it noisy to debug.
Lets remove duplicating the information during the return statements and remove the success and start logs in the beginning of the helper functions since the call site seems to also log that the helper function is being called.
for example:
this
if err != nil {
framework.Logf("Failed to check resource status: %v, output: %s", err, output)
return false, fmt.Errorf("failed to check resource status: %v", err)
}to this
if err != nil {
return false, fmt.Errorf("failed to check resource status, output: %s error: %v", output, err)
}| g.AfterEach(func() { | ||
| // Cleanup: Remove any resource bans that may have been created during the test | ||
| // This ensures the device under test is in the same state the test started in | ||
| nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use the util function GetNodes(oc, AllNodes)
| // This ensures the device under test is in the same state the test started in | ||
| nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) | ||
| if err != nil { | ||
| framework.Logf("Warning: Failed to retrieve nodes during cleanup: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this fails to clean up? Would that be considered s failed state?
| return | ||
| } | ||
|
|
||
| if len(nodeList.Items) == 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't get two nodes, would that be a failure?
| }) | ||
|
|
||
| g.It("Should recover from single node kubelet service disruption", func() { | ||
| nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use the util function here utils.GetNodes(oc, utils.AllNodes)
| }) | ||
|
|
||
| g.It("Should properly stop kubelet service and verify automatic restart on target node", func() { | ||
| nodeList, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here about the util function
| const ( | ||
| AllNodes = "" // No label filter for GetNodes | ||
| LabelNodeRoleControlPlane = "node-role.kubernetes.io/control-plane" // Control plane node label | ||
| LabelNodeRoleControlPlane = "node-role.kubernetes.io/control-plane" // Control plane node label // Legacy master node label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove this comment about legacy, seems like a left over from the previous line
| // | ||
| // err := AddConstraint(oc, "master-0", "kubelet-clone", "master-1") | ||
| func AddConstraint(oc *exutil.CLI, nodeName string, resourceName string, targetNode string) error { | ||
| framework.Logf("Banning resource %s from running on %s (temporary ban for testing)", resourceName, targetNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this, the call site should make mention of this, the helper functions should be clean of log statements, typically logging opinions should be left to the caller
| "--", "chroot", "/host", "bash", "-c", cmd).Output() | ||
|
|
||
| if err != nil { | ||
| framework.Logf("Failed to ban resource: %v, output: %s", err, output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, the error should describe why it failed and the caller can decide how to log it
| return fmt.Errorf("failed to ban resource: %v", err) | ||
| } | ||
|
|
||
| framework.Logf("Successfully banned resource %s from %s", resourceName, targetNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, lets nix this, it just ends up causing noise in the logs
| } | ||
|
|
||
| // isPodReady checks if a pod is ready based on its conditions | ||
| func isPodReady(pod *corev1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this and use podutils.IsPodReady(pod) from the "k8s.io/kubectl/pkg/util/podutils" package
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@dhensel-rh: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/278ceed0-d60a-11f0-964c-0b149fa05ac5-0 |
|
Scheduling required tests: |
|
@dhensel-rh: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@dhensel-rh: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/27bc3cf0-d69e-11f0-9dd1-df35834b10ac-0 |
|
Scheduling required tests: |
eggfoobar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small naming change, and an observation on the kubelet restart. Also @jaypoulz we might need some guidance on these tests as it relates to the openshift/two-node suite, we are currently grabing parallel, serial, disruptive, degraded in that bucket and I think we might need to re-evaluate that.
origin/pkg/testsuites/standard_suites.go
Lines 423 to 432 in c86e3cd
| { | |
| Name: "openshift/two-node", | |
| Description: templates.LongDesc(` | |
| This test suite runs tests to validate two-node. | |
| `), | |
| Qualifiers: []string{ | |
| `name.contains("[Suite:openshift/two-node") || name.contains("[OCPFeatureGate:DualReplica]") || name.contains("[OCPFeatureGate:HighlyAvailableArbiter]")`, | |
| }, | |
| TestTimeout: 60 * time.Minute, | |
| }, |
| } | ||
| }) | ||
|
|
||
| g.It("Should recover from single node kubelet service disruption", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets' lowerase these Shoulds, dont forget that this word is in the middle of the test tile.
"Should" -> "should"
| }, kubeletRestoreTimeout, pollInterval).ShouldNot(o.HaveOccurred(), "Essential cluster operators should be available after kubelet resource ban removal") | ||
| }) | ||
|
|
||
| g.It("Should properly stop kubelet service and verify automatic restart on target node", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
"Should" -> "should"
|
|
||
| g.By("Waiting for kubelet service to automatically restart on target node") | ||
| o.Eventually(func() bool { | ||
| return utils.IsServiceRunning(oc, targetNode.Name, "kubelet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will work, once you stop the kubelet with systemd, does it start back up? If the kubelet is not up I don't think you'll be able to run a debug pod on that node, you'll need to run the ssh framework that Jeremy implemented in order to check and restart the kubelet.
Ideally you would want to crash the kubelet to test this out, that way the always restart stanza of the kubelet.service will kick in. https://github.com/openshift/machine-config-operator/blob/main/templates/master/01-master-kubelet/_base/units/kubelet.service.yaml#L48-L49
|
Scheduling required tests: |
These tests check what happens when kubelet becomes unavailable.