-
Notifications
You must be signed in to change notification settings - Fork 4.8k
ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present #29236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/test e2e-aws-ovn-etcd-scaling |
7e1de9a to
ee92b18
Compare
|
/test e2e-aws-ovn-etcd-scaling |
ee92b18 to
78840ba
Compare
|
/test e2e-aws-ovn-etcd-scaling |
|
Job Failure Risk Analysis for sha: 78840ba
|
|
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
106d577 to
7c1093b
Compare
|
/test e2e-aws-ovn-etcd-scaling |
1 similar comment
|
/test e2e-aws-ovn-etcd-scaling |
|
Job Failure Risk Analysis for sha: 7c1093b
|
7c1093b to
da81f2a
Compare
|
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-etcd-scaling |
|
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| return false, nil | ||
| } | ||
|
|
||
| return podReadyCondition.Status == corev1.ConditionFalse, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to be a bit careful here, kubelet is the only one updating the status - if you shut it down this condition may never become true. I would just try to fire and forget this pod and wait for the node to become not ready.
|
|
||
| // step 1: stop the kubelet on a node | ||
| framework.Logf("Stopping the kubelet on the node %s", etcdTargetNode.Name) | ||
| err = scalingtestinglibrary.StopKubelet(ctx, oc.AdminKubeClient(), *etcdTargetNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid the pointer deref here, just pass it down into the function and error out if the node is nil
| // step 2: delete the machine on which kubelet is stopped to trigger the CPMSO to create a new one to replace it | ||
| machineToDelete, err := scalingtestinglibrary.NodeNameToMachineName(ctx, kubeClient, machineClient, etcdTargetNode.Name) | ||
| err = errors.Wrapf(err, "failed to get the machine name for the NotReady node: %s", etcdTargetNode.Name) | ||
| o.Expect(err).ToNot(o.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you need the helper. How about you choose the machine you want to stop kubelet with and then just get the the node via the status reference? that should save you a ton of code
| err = errors.Wrap(err, "scale-down: timed out waiting for APIServer pods to stabilize on the same revision") | ||
| o.Expect(err).ToNot(o.HaveOccurred()) | ||
|
|
||
| // step 5: verify member and machine counts go back down to 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love those assertions below, maybe have that as a separate function? could there be some reuse in other tests?
|
Job Failure Risk Analysis for sha: da81f2a
|
|
@jubittajohn: This pull request references ETCD-674 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-ovn-etcd-scaling |
|
/test e2e-aws-ovn-etcd-scaling |
|
/test e2e-azure-ovn-etcd-scaling |
1 similar comment
|
/test e2e-azure-ovn-etcd-scaling |
|
Job Failure Risk Analysis for sha: 1f87a9e
|
1f87a9e to
4001331
Compare
|
New changes are detected. LGTM label has been removed. |
|
/test e2e-aws-ovn-etcd-scaling |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jubittajohn, tjungblu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
|
Job Failure Risk Analysis for sha: 4001331
|
|
Job Failure Risk Analysis for sha: 4001331
|
1 similar comment
|
Job Failure Risk Analysis for sha: 4001331
|
… node and when an unhealthy member is present Signed-off-by: jubittajohn <[email protected]>
4001331 to
f9506e0
Compare
|
/test e2e-aws-ovn-etcd-scaling |
|
Job Failure Risk Analysis for sha: f9506e0
|
|
Job Failure Risk Analysis for sha: f9506e0
|
1 similar comment
|
Job Failure Risk Analysis for sha: f9506e0
|
|
/test e2e-aws-ovn-etcd-scaling |
|
/test e2e-azure-ovn-etcd-scaling |
|
Job Failure Risk Analysis for sha: f9506e0
|
1 similar comment
|
Job Failure Risk Analysis for sha: f9506e0
|
|
/test e2e-aws-ovn-etcd-scaling |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: f9506e0
New tests seen in this PR at sha: f9506e0
|
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: f9506e0
New tests seen in this PR at sha: f9506e0
|
|
@jubittajohn: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.
First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.
The second test covers a vertical scaling scenario when kubelet is not working on a node.
This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199
CPMS should be active for this test scenario