-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Bug 1886160: Add test of documented backup/restore procedure #25723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1886160: Add test of documented backup/restore procedure #25723
Conversation
|
/test e2e-aws-disruptive |
1 similar comment
|
/test e2e-aws-disruptive |
|
/retest |
|
/test e2e-aws-disruptive |
3 similar comments
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
test/extended/dr/machine_recover.go
Outdated
| for _, target := range targets { | ||
| framework.Logf("Forcing shutdown of node %s", target.Name) | ||
| _, err = ssh("sudo -i systemctl poweroff --force --force", target) | ||
| execOnNodeOrFail(target, "sudo -i systemctl poweroff --force --force") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrying this operation (as seen from the logs) is not correct - because you are shutting down the node it's going to keep retrying, and it shouldn't have to:
Dec 8 06:18:56.307: INFO: Forcing shutdown of node ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:18:56.307: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:18:56.307: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
Dec 8 06:19:15.497: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:19:15.497: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:19:15.498: INFO: ssh [email protected]:22: stderr: "Powering off.\n"
Dec 8 06:19:15.498: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:19:20.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:19:20.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:19:25.894: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:19:25.894: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:19:25.894: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:19:25.894: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:19:30.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:19:30.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:19:35.813: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:19:35.813: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:19:35.813: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:19:35.813: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:19:40.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:19:40.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:19:45.940: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:19:45.940: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:19:45.940: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:19:45.940: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:19:50.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:19:50.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:19:55.922: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:19:55.922: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:19:55.922: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:19:55.922: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:20:00.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:20:00.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:20:05.852: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:20:05.852: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:20:05.852: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:20:05.852: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:20:10.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:20:10.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:20:15.899: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:20:15.899: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:20:15.899: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:20:15.899: INFO: ssh [email protected]:22: exit code: 0
Dec 8 06:20:15.899: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec 8 06:20:15.899: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec 8 06:20:21.216: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force
Dec 8 06:20:21.216: INFO: ssh [email protected]:22: stdout: ""
Dec 8 06:20:21.216: INFO: ssh [email protected]:22: stderr: ""
Dec 8 06:20:21.216: INFO: ssh [email protected]:22: exit code: 0
STEP: Disruption complete; stopping async validations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed that. It's not from this test, though, as per the leading comment it's from #25707.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per a comment on your PR, I've added special handling for shutdown initiation in #25707 to account for the shutdown command always terminating with an error on success due to the connection being closed by the server.
|
I'm going to make some cleanups to the recovery test while I'm debugging the machine api SLO violation. That will conflict with parts of this, but cleans up the behavior to be more consistent. |
|
/test e2e-aws-disruptive |
1 similar comment
|
/test e2e-aws-disruptive |
|
Skipping machine recovery test to try to get signal out of the other 2. |
|
/test e2e-aws-disruptive |
6 similar comments
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
|
/retest |
|
/test e2e-aws-disruptive |
1 similar comment
|
/test e2e-aws-disruptive |
|
/test e2e-aws-disruptive |
1 similar comment
|
/test e2e-aws-disruptive |
|
/test e2e-gcp-disruptive |
|
/test e2e-gcp-disruptive |
|
/retest |
|
@marun: This pull request references Bugzilla bug 1886160, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| waitForAPIServer(oc.AdminKubeClient(), recoveryNode) | ||
|
|
||
| // Recovery 10,11,12 | ||
| forceOperandRedeployment(oc.AdminOperatorClient().OperatorV1()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is disappointing to have to perform. Is there a reason this is not automatic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just replicating the documented procedure. Is it not in keeping with the current state of the code?
cc: @hexfusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per slack conversation, this requirement is due to the current state of the static pod controller. Changing that is not in scope for this PR.
| framework.Logf("Selecting node %q as the recovery host", recoveryNode.Name) | ||
|
|
||
| // Recovery 2 | ||
| g.By("Verifying that all masters are reachable via ssh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 4's description is insufficient in the docs - not only must the machines be stopped, but if they are lost they need to be incapable of being turned back on. It is the administrators responsibility to ensure the machines do not start once step 4 has been completed.
I would say the docs for 4 should say, instead of :
- Stop the static pods on all other control plane nodes.
be
- Ensure all control plane nodes EXCEPT the recovery host are terminated and cannot be restarted, or have their processes permanently stopped
Failure to ensure that only the recovery host is running instances of the control plane software will result in data corruption and workload disruption. You MUST ensure the non-functional hosts cannot start control plane software by following these instructions.
And probably a discussion of why this is the case (we're ensuring that no older quorum can form).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to simply break the non-recovery hosts into two classes in the description of 4: those that are known functional, and those that are unrecoverable / known faulty. The known functional can be stopped (allows us to potentially recover with them if the current host fails). The known faulty need to be prevented from becoming running again.
|
/approve |
hexfusion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor things to consider overall looks great very clear and concise. un hold if no changes are proposed.
/lgtm
/hold
| g.By(fmt.Sprintf("Waiting for etcd static pod to exit on node %q", master.Name)) | ||
| // Look for 'etcd ' (with trailing space) to be missing to | ||
| // differentiate from pods like etcd-operator. | ||
| sudoExecOnNodeOrFail(master, "crictl ps | grep 'etcd ' | wc -l | grep -q 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: technically we are waiting for the etcd process to exit. other processes could be running in the pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/extended/dr/backup_restore.go
Outdated
| if master.Name == recoveryNode.Name { | ||
| continue | ||
| } | ||
| g.By(fmt.Sprintf("Waiting for kube-apiserver static pod to exit on node %q", master.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // kube-apiserver, kube-controller-manager, kube-scheduler | ||
| // operands. This is a necessary part of restoring a cluster from | ||
| // backup. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think this was meant to be dropped as you have it with the func below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reflect intention of documenting file-level purpose.
| g.By(fmt.Sprintf("Verifying that the etcd container is running on recovery node %q", node.Name)) | ||
| // Look for 'etcd ' (with trailing space) to differentiate from pods | ||
| // like etcd-operator. | ||
| sudoExecOnNodeOrFail(node, "crictl ps | grep -q 'etcd '") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor you can do this directly with the client
cricrl ps -q --name ^etcd$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nmd your looking for non zero
|
well, you have a merge conflict now anyways. /lgtm cancel |
|
/retest |
|
/retest |
|
/test e2e-aws-disruptive |
|
very nice sir, /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hexfusion, marun, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@marun: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with Bugzilla bug 1886160 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-4.7 |
|
@marun: new pull request created: #26110 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Uh oh!
There was an error while loading. Please reload this page.