ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present #29236

jubittajohn · 2024-10-25T20:44:01Z

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

If the CPMS is active, first disable it by deleting the CPMS custom resource.
Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.
Delete the machine hosting the node in step 2.
Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.
Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node.
This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199
CPMS should be active for this test scenario

Stop the kubelet on a node.
Delete the machine hosting the node in step 2.
That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.
The operator will first scale-up the new machine's member.
Then scale-down the machine that is pending deletion by removing its member and deletion hook.

jubittajohn · 2024-10-25T20:44:22Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

jubittajohn · 2024-10-25T20:46:57Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

jubittajohn · 2024-10-31T15:42:50Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot · 2024-10-31T20:20:04Z

Job Failure Risk Analysis for sha: 78840ba

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	High [sig-node] Managed cluster should verify that nodes have no unexpected reboots [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io] This test has passed 100.00% of 1 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 14 test results
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility cleanup This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Node / Kubelet"] monitor test kubelet-log-collector collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility collection This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- Showing 4 of 13 test results

openshift-ci-robot · 2024-11-04T16:55:07Z

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jubittajohn · 2024-11-04T17:07:23Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

jubittajohn · 2024-11-08T05:30:53Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

openshift-trt-bot · 2024-11-08T10:07:57Z

Job Failure Risk Analysis for sha: 7c1093b

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days. --- [sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [sig-api-machinery][Feature:APIServer][Late] kubelet terminates kube-apiserver gracefully extended [Suite:openshift/conformance/parallel] This test has passed 100.00% of 2 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days. --- [sig-arch][Late][Jira:"kube-apiserver"] collect certificate data [Suite:openshift/conformance/parallel] This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling'] in the last 14 days.

openshift-ci-robot · 2024-11-12T15:54:02Z

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node.
This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled.

CPMS should be active for this test scenario

Stop the kubelet on a node

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index

The operator will first scale-up the new machine's member

Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jubittajohn · 2024-11-12T16:05:40Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

openshift-ci-robot · 2024-11-12T17:18:02Z

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when kubelet is not working on a node.
This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199

CPMS should be active for this test scenario

Stop the kubelet on a node

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index

The operator will first scale-up the new machine's member

Then scale-down the machine that is pending deletion by removing its member and deletion hook

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

test/extended/etcd/helpers/helpers.go

tjungblu · 2024-11-12T17:50:17Z

test/extended/etcd/helpers/helpers.go

+			return false, nil
+		}
+
+		return podReadyCondition.Status == corev1.ConditionFalse, nil


you have to be a bit careful here, kubelet is the only one updating the status - if you shut it down this condition may never become true. I would just try to fire and forget this pod and wait for the node to become not ready.

tjungblu · 2024-11-12T17:53:05Z

test/extended/etcd/vertical_scaling.go

+
+		// step 1: stop the kubelet on a node
+		framework.Logf("Stopping the kubelet on the node %s", etcdTargetNode.Name)
+		err = scalingtestinglibrary.StopKubelet(ctx, oc.AdminKubeClient(), *etcdTargetNode)


avoid the pointer deref here, just pass it down into the function and error out if the node is nil

test/extended/etcd/helpers/helpers.go

tjungblu · 2024-11-12T17:55:35Z

test/extended/etcd/vertical_scaling.go

+		// step 2: delete the machine on which kubelet is stopped to trigger the CPMSO to create a new one to replace it
+		machineToDelete, err := scalingtestinglibrary.NodeNameToMachineName(ctx, kubeClient, machineClient, etcdTargetNode.Name)
+		err = errors.Wrapf(err, "failed to get the machine name for the NotReady node: %s", etcdTargetNode.Name)
+		o.Expect(err).ToNot(o.HaveOccurred())


I see why you need the helper. How about you choose the machine you want to stop kubelet with and then just get the the node via the status reference? that should save you a ton of code

tjungblu · 2024-11-12T17:56:47Z

test/extended/etcd/vertical_scaling.go

+		err = errors.Wrap(err, "scale-down: timed out waiting for APIServer pods to stabilize on the same revision")
+		o.Expect(err).ToNot(o.HaveOccurred())
+
+		// step 5: verify member and machine counts go back down to 3


love those assertions below, maybe have that as a separate function? could there be some reuse in other tests?

openshift-trt-bot · 2024-11-12T20:10:03Z

Job Failure Risk Analysis for sha: da81f2a

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-vsphere-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 4 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (101) are below the historical average (2543): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 69.23% of 13 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-ci-robot · 2024-11-13T03:36:27Z

@jubittajohn: This pull request references ETCD-674 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

The following test covers a vertical scaling scenario when a member is unhealthy and another scenario when kubelet is not working on a node.

First test validates that scale down happens before scale up if the deleted member is unhealthy.CPMS is disabled to observe that scale-down happens first in this case.

If the CPMS is active, first disable it by deleting the CPMS custom resource.

Remove the static pod manifest from a node and stop the kubelet on the node. This makes the member unhealthy.

Delete the machine hosting the node in step 2.

Verify the member removal and the total voting member count of 2 to ensure scale-down happens first when a member is unhealthy.

Restore the initial cluster state by creating a new machine(scale-up) and re-enabling CPMS.

The second test covers a vertical scaling scenario when kubelet is not working on a node.
This test validates that deleting the machine hosting the node where the kubelet is stopped doesn't get stuck when CPMS is enabled. The case in this bug: https://issues.redhat.com/browse/OCPBUGS-17199
CPMS should be active for this test scenario

Stop the kubelet on a node.

Delete the machine hosting the node in step 2.

That should prompt the ControlPlaneMachineSetOperator(CPMSO) to create a replacement machine and node for that machine index.

The operator will first scale-up the new machine's member.

Then scale-down the machine that is pending deletion by removing its member and deletion hook.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jubittajohn · 2024-11-20T15:41:09Z

/test e2e-aws-ovn-etcd-scaling

jubittajohn · 2024-11-25T16:37:07Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

jubittajohn · 2024-11-25T21:11:08Z

/test e2e-azure-ovn-etcd-scaling

jubittajohn · 2024-11-26T16:02:26Z

/test e2e-azure-ovn-etcd-scaling

openshift-trt · 2024-11-26T21:17:41Z

Job Failure Risk Analysis for sha: 1f87a9e

Job Name	Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn	High [sig-arch] Only known images used by tests This test has passed 100.00% of 18 runs on jobs ['periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 3 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-azure-ovn-etcd-scaling	IncompleteTests Tests for this run (106) are below the historical average (984): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	IncompleteTests Tests for this run (26) are below the historical average (1350): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	Low [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] This test has passed 76.19% of 21 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days. Open Bugs Component Readiness: operators should not create watch channels very often

openshift-ci · 2025-01-06T14:52:00Z

New changes are detected. LGTM label has been removed.

jubittajohn · 2025-01-06T14:52:36Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

openshift-ci · 2025-01-06T14:54:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn, tjungblu
Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~test/extended/etcd/OWNERS~~ [tjungblu]
test/extended/util/annotate/generated/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jubittajohn · 2025-01-06T19:13:07Z

/retest-required

openshift-trt · 2025-01-06T23:56:26Z

Job Failure Risk Analysis for sha: 4001331

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 14 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 60.00% of 5 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2025-02-13T00:10:29Z

Job Failure Risk Analysis for sha: 4001331

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2025-02-13T01:10:07Z

Job Failure Risk Analysis for sha: 4001331

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-etcd-scaling	High [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

… node and when an unhealthy member is present Signed-off-by: jubittajohn <[email protected]>

jubittajohn · 2025-02-19T20:29:05Z

/test e2e-aws-ovn-etcd-scaling
/test e2e-gcp-ovn-etcd-scaling
/test e2e-azure-ovn-etcd-scaling
/test e2e-vsphere-ovn-etcd-scaling

openshift-trt · 2025-02-20T00:52:47Z

Job Failure Risk Analysis for sha: f9506e0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2025-02-20T01:07:48Z

Job Failure Risk Analysis for sha: f9506e0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [sig-node] static pods should start after being created This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-node] node-lifecycle detects unexpected not ready node This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- Showing 4 of 6 test results
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2025-02-20T01:13:52Z

Job Failure Risk Analysis for sha: f9506e0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [sig-node] static pods should start after being created This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-node] node-lifecycle detects unexpected not ready node This test has passed 100.00% of 1 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- Showing 4 of 6 test results
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 6 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

jubittajohn · 2025-02-24T21:42:51Z

/test e2e-aws-ovn-etcd-scaling

jubittajohn · 2025-02-24T21:44:46Z

/test e2e-azure-ovn-etcd-scaling

openshift-trt · 2025-02-25T02:15:09Z

Job Failure Risk Analysis for sha: f9506e0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 57.14% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2025-02-25T03:10:09Z

Job Failure Risk Analysis for sha: f9506e0

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn-etcd-scaling	High [sig-architecture] platform pods in ns/openshift-etcd should not exit an excessive amount of times This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [sig-etcd] etcd leader changes are not excessive [Late] [Suite:openshift/conformance/parallel] This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days. --- [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available This test has passed 100.00% of 2 runs on jobs [periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 57.14% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

dusk125 · 2025-07-24T19:55:18Z

/test e2e-aws-ovn-etcd-scaling

openshift-trt · 2025-08-26T01:15:18Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: f9506e0

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Medium - "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale down when a member is unhealthy [apigroup:machine.openshift.io]" is a new test, and was only seen in one job.
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Medium - "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when kubelet is not running on a node[apigroup:machine.openshift.io]" is a new test, and was only seen in one job.

New tests seen in this PR at sha: f9506e0

"[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale down when a member is unhealthy [apigroup:machine.openshift.io]" [Total: 1, Pass: 1, Fail: 0, Flake: 0]
"[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when kubelet is not running on a node[apigroup:machine.openshift.io]" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

openshift-merge-robot · 2025-10-03T17:15:19Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-trt · 2025-10-03T18:16:35Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: f9506e0

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Medium - "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale down when a member is unhealthy [apigroup:machine.openshift.io]" is a new test, and was only seen in one job.
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling	Medium - "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when kubelet is not running on a node[apigroup:machine.openshift.io]" is a new test, and was only seen in one job.

New tests seen in this PR at sha: f9506e0

"[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale down when a member is unhealthy [apigroup:machine.openshift.io]" [Total: 1, Pass: 1, Fail: 0, Flake: 0]
"[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when kubelet is not running on a node[apigroup:machine.openshift.io]" [Total: 1, Pass: 1, Fail: 0, Flake: 0]

openshift-ci · 2025-11-18T13:02:27Z

@jubittajohn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-ovn-etcd-scaling	`f9506e0`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-openstack-ovn	`f9506e0`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`f9506e0`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	`f9506e0`	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`
ci/prow/e2e-metal-ipi-ovn	`f9506e0`	link	false	`/test e2e-metal-ipi-ovn`
ci/prow/e2e-azure-ovn-etcd-scaling	`f9506e0`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-aws-ovn-edge-zones	`f9506e0`	link	true	`/test e2e-aws-ovn-edge-zones`
ci/prow/e2e-vsphere-ovn-upi	`f9506e0`	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/e2e-aws-ovn-serial-1of2	`f9506e0`	link	true	`/test e2e-aws-ovn-serial-1of2`
ci/prow/e2e-aws-ovn-etcd-scaling	`f9506e0`	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-aws-csi	`f9506e0`	link	true	`/test e2e-aws-csi`
ci/prow/e2e-gcp-csi	`f9506e0`	link	true	`/test e2e-gcp-csi`
ci/prow/go-verify-deps	`f9506e0`	link	true	`/test go-verify-deps`
ci/prow/e2e-aws-ovn-microshift	`f9506e0`	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/e2e-aws-ovn-microshift-serial	`f9506e0`	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/e2e-metal-ipi-ovn-ipv6	`f9506e0`	link	true	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from orenc1 and tjungblu October 25, 2024 20:45

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from 7e1de9a to ee92b18 Compare October 25, 2024 20:46

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from ee92b18 to 78840ba Compare October 31, 2024 15:42

jubittajohn changed the title ~~E2E to vertically scale up and down when kubelet is not running on a node~~ ETCD-674: WIP: E2E to vertically scale up and down when kubelet is not running on a node Nov 4, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 4, 2024

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from 106d577 to 7c1093b Compare November 4, 2024 17:05

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from 7c1093b to da81f2a Compare November 12, 2024 15:50

jubittajohn changed the title ~~ETCD-674: WIP: E2E to vertically scale up and down when kubelet is not running on a node~~ ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node Nov 12, 2024

tjungblu reviewed Nov 12, 2024

View reviewed changes

test/extended/etcd/helpers/helpers.go Outdated Show resolved Hide resolved

tjungblu reviewed Nov 12, 2024

View reviewed changes

test/extended/etcd/helpers/helpers.go Outdated Show resolved Hide resolved

tjungblu reviewed Nov 12, 2024

View reviewed changes

test/extended/etcd/helpers/helpers.go Outdated Show resolved Hide resolved

tjungblu reviewed Nov 12, 2024

View reviewed changes

jubittajohn mentioned this pull request Nov 13, 2024

ETCD-674: Add E2E test for scaling when an unhealthy member is present #29203

Closed

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from 1f87a9e to 4001331 Compare January 6, 2025 14:51

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2025

E2E to vertically scale up and down when kubelet is not runnning on a…

f9506e0

… node and when an unhealthy member is present Signed-off-by: jubittajohn <[email protected]>

jubittajohn force-pushed the vertical-scaling-kubelet-stopped branch from 4001331 to f9506e0 Compare February 19, 2025 20:19

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025

ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present #29236

Are you sure you want to change the base?

ETCD-674: E2E to vertically scale up and down when kubelet is not running on a node and when an unhealthy member is present #29236

Uh oh!

Conversation

jubittajohn commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubittajohn commented Oct 25, 2024

Uh oh!

jubittajohn commented Oct 25, 2024

Uh oh!

jubittajohn commented Oct 31, 2024

Uh oh!

openshift-trt-bot commented Oct 31, 2024

Uh oh!

openshift-ci-robot commented Nov 4, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubittajohn commented Nov 4, 2024

Uh oh!

jubittajohn commented Nov 8, 2024

Uh oh!

openshift-trt-bot commented Nov 8, 2024

Uh oh!

openshift-ci-robot commented Nov 12, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubittajohn commented Nov 12, 2024

Uh oh!

openshift-ci-robot commented Nov 12, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjungblu Nov 12, 2024

Choose a reason for hiding this comment

Uh oh!

tjungblu Nov 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tjungblu Nov 12, 2024

Choose a reason for hiding this comment

Uh oh!

tjungblu Nov 12, 2024

Choose a reason for hiding this comment

Uh oh!

openshift-trt-bot commented Nov 12, 2024

Uh oh!

openshift-ci-robot commented Nov 13, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jubittajohn commented Nov 20, 2024

Uh oh!

jubittajohn commented Nov 25, 2024

Uh oh!

jubittajohn commented Nov 25, 2024

Uh oh!

jubittajohn commented Nov 26, 2024

Uh oh!

openshift-trt bot commented Nov 26, 2024

Uh oh!

openshift-ci bot commented Jan 6, 2025

Uh oh!

jubittajohn commented Jan 6, 2025

Uh oh!

openshift-ci bot commented Jan 6, 2025

Uh oh!

jubittajohn commented Jan 6, 2025

Uh oh!

openshift-trt bot commented Jan 6, 2025

Uh oh!

openshift-trt bot commented Feb 13, 2025

Uh oh!

openshift-trt bot commented Feb 13, 2025

Uh oh!

jubittajohn commented Feb 19, 2025

Uh oh!

openshift-trt bot commented Feb 20, 2025

jubittajohn commented Oct 25, 2024 •

edited

Loading

openshift-ci-robot commented Nov 4, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Nov 12, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Nov 12, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Nov 13, 2024 •

edited by openshift-ci bot

Loading