-
Notifications
You must be signed in to change notification settings - Fork 4.8k
test: Simplify the worker machine health check test for clarity #25750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
a7c1f9c to
8dd97ea
Compare
|
/test e2e-aws-disruptive /assign @marun |
|
/test e2e-aws-fips |
|
LGTM thanks @smarterclayton |
| ch := make(chan struct{}) | ||
| go func(node *corev1.Node) { | ||
| defer close(ch) | ||
| if _, err := ssh("sudo -i systemctl poweroff --force --force", node); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err will always be non-nil on successful execution of this command (due to the ssh connection being closed by the server) so I'm not sure there's any value in checking it. In #25707 I've added a new shutdownNode function that calls e2essh.SSH directly so that the output can be checked for the an indication that shutdown was initiated - the string Powering off appearing in stderr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that there's even a guarantee that powering off would get returned on all clouds. it was not on GCP at the current time. So I think even that isn't sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it not be returned? And why would it be a function of a particular cloud? It's RHCOS returning Powering off before initiating termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a function of how the load balancers work. GCP often leaves dangling connections, and since go SSH is limited (can't do idle timeouts correctly) you can't assume you'll ever get a packet (so the SSH hangs forever). GCP cuts the proxy connection before you get the "powering off" packet, unlike AWS.
|
@smarterclayton Maybe rebase on #25707 to get related cleanup for free? |
|
I imagine you're testing with clusterbot, but in case it's handy I've proposed adding |
In order to better debug failures with the machine health check, clarify the test debugging logic and simplify the checks in place.
8dd97ea to
b46ada1
Compare
|
This also reproduces bug 1905709 on AWS |
|
/retest |
|
/test e2e-gcp-disruptive |
|
fyi quorum restore test is broken due to the addition of a mandatory dependency on the api when taking a backup intended to ensure that only healthy revisions of static pods are backed up. Fix is pending: openshift/cluster-etcd-operator#509 |
|
/test e2e-gcp-disruptive |
|
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@smarterclayton: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@smarterclayton: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In order to better debug failures with the machine health check,
clarify the test debugging logic and simplify the checks in place.
On GCP this exposes https://bugzilla.redhat.com/show_bug.cgi?id=1905709