test: Simplify the worker machine health check test for clarity #25750

smarterclayton · 2020-12-08T22:13:59Z

In order to better debug failures with the machine health check,
clarify the test debugging logic and simplify the checks in place.

On GCP this exposes https://bugzilla.redhat.com/show_bug.cgi?id=1905709

openshift-ci-robot · 2020-12-08T22:14:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2020-12-09T00:45:04Z

/test e2e-aws-disruptive

/assign @marun

hexfusion · 2020-12-09T00:45:58Z

/test e2e-aws-fips

hexfusion · 2020-12-09T00:47:43Z

LGTM thanks @smarterclayton

marun · 2020-12-09T02:59:47Z

test/extended/dr/machine_recover.go

+					ch := make(chan struct{})
+					go func(node *corev1.Node) {
+						defer close(ch)
+						if _, err := ssh("sudo -i systemctl poweroff --force --force", node); err != nil {


err will always be non-nil on successful execution of this command (due to the ssh connection being closed by the server) so I'm not sure there's any value in checking it. In #25707 I've added a new shutdownNode function that calls e2essh.SSH directly so that the output can be checked for the an indication that shutdown was initiated - the string Powering off appearing in stderr.

I don't know that there's even a guarantee that powering off would get returned on all clouds. it was not on GCP at the current time. So I think even that isn't sufficient.

How would it not be returned? And why would it be a function of a particular cloud? It's RHCOS returning Powering off before initiating termination.

It's a function of how the load balancers work. GCP often leaves dangling connections, and since go SSH is limited (can't do idle timeouts correctly) you can't assume you'll ever get a packet (so the SSH hangs forever). GCP cuts the proxy connection before you get the "powering off" packet, unlike AWS.

marun · 2020-12-09T03:08:37Z

@smarterclayton Maybe rebase on #25707 to get related cleanup for free?

marun · 2020-12-09T03:46:20Z

I imagine you're testing with clusterbot, but in case it's handy I've proposed adding e2e-gcp-disruptive: openshift/release#14191

In order to better debug failures with the machine health check, clarify the test debugging logic and simplify the checks in place.

smarterclayton · 2020-12-09T18:02:25Z

This also reproduces bug 1905709 on AWS

smarterclayton · 2020-12-09T22:53:27Z

/retest

smarterclayton · 2020-12-09T22:53:36Z

/test e2e-gcp-disruptive

marun · 2020-12-10T03:02:26Z

fyi quorum restore test is broken due to the addition of a mandatory dependency on the api when taking a backup intended to ensure that only healthy revisions of static pods are backed up. Fix is pending: openshift/cluster-etcd-operator#509

smarterclayton · 2020-12-10T19:05:45Z

/test e2e-gcp-disruptive

openshift-merge-robot · 2020-12-10T20:01:54Z

@smarterclayton: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-disruptive	`8dd97ea`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-disruptive	`b46ada1`	link	`/test e2e-gcp-disruptive`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2020-12-17T20:31:54Z

@smarterclayton: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-03-17T23:17:08Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-04-17T04:19:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-05-17T07:07:48Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-05-17T07:07:55Z

@smarterclayton: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-17T07:08:03Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from deads2k and gabemontero December 8, 2020 22:14

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2020

smarterclayton force-pushed the node_machine_check branch from a7c1f9c to 8dd97ea Compare December 8, 2020 22:18

openshift-ci-robot assigned marun Dec 9, 2020

marun reviewed Dec 9, 2020

View reviewed changes

marun mentioned this pull request Dec 9, 2020

Bug 1886160: Add test of documented backup/restore procedure #25723

Merged

test: Simplify the worker machine health check test for clarity

b46ada1

In order to better debug failures with the machine health check, clarify the test debugging logic and simplify the checks in place.

smarterclayton force-pushed the node_machine_check branch from 8dd97ea to b46ada1 Compare December 9, 2020 18:02

marun mentioned this pull request Dec 10, 2020

Bug 1886160: Improve reliability of e2-aws-disruptive #25707

Merged

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2021

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 17, 2021

openshift-ci bot closed this May 17, 2021

test: Simplify the worker machine health check test for clarity #25750

test: Simplify the worker machine health check test for clarity #25750

Uh oh!

Conversation

smarterclayton commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Dec 8, 2020

Uh oh!

hexfusion commented Dec 9, 2020

Uh oh!

hexfusion commented Dec 9, 2020

Uh oh!

hexfusion commented Dec 9, 2020

Uh oh!

marun Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

marun Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

smarterclayton commented Dec 9, 2020

Uh oh!

smarterclayton commented Dec 9, 2020

Uh oh!

smarterclayton commented Dec 9, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

smarterclayton commented Dec 10, 2020

Uh oh!

openshift-merge-robot commented Dec 10, 2020

Uh oh!

openshift-ci-robot commented Dec 17, 2020

Uh oh!

openshift-bot commented Mar 17, 2021

Uh oh!

openshift-bot commented Apr 17, 2021

Uh oh!

openshift-bot commented May 17, 2021

Uh oh!

openshift-ci bot commented May 17, 2021

Uh oh!

openshift-ci bot commented May 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

smarterclayton commented Dec 8, 2020 •

edited

Loading

marun Dec 9, 2020 •

edited

Loading

marun Dec 9, 2020 •

edited

Loading