Bug 1886160: Add test of documented backup/restore procedure #25723

marun · 2020-12-01T06:50:49Z

marun · 2020-12-05T01:16:22Z

/test e2e-aws-disruptive

marun · 2020-12-07T23:19:45Z

/test e2e-aws-disruptive

marun · 2020-12-08T01:28:37Z

/retest

marun · 2020-12-08T02:42:05Z

/test e2e-aws-disruptive

marun · 2020-12-08T03:36:44Z

/test e2e-aws-disruptive

marun · 2020-12-08T03:54:21Z

/test e2e-aws-disruptive

marun · 2020-12-08T16:56:07Z

/test e2e-aws-disruptive

smarterclayton · 2020-12-08T20:26:48Z

test/extended/dr/machine_recover.go

 				for _, target := range targets {
 					framework.Logf("Forcing shutdown of node %s", target.Name)
-					_, err = ssh("sudo -i systemctl poweroff --force --force", target)
+					execOnNodeOrFail(target, "sudo -i systemctl poweroff --force --force")


Retrying this operation (as seen from the logs) is not correct - because you are shutting down the node it's going to keep retrying, and it shouldn't have to:

Dec 8 06:18:56.307: INFO: Forcing shutdown of node ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:18:56.307: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:18:56.307: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) Dec 8 06:19:15.497: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:19:15.497: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:19:15.498: INFO: ssh [email protected]:22: stderr: "Powering off.\n" Dec 8 06:19:15.498: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:19:20.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:19:20.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:19:25.894: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:19:25.894: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:19:25.894: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:19:25.894: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:19:30.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:19:30.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:19:35.813: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:19:35.813: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:19:35.813: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:19:35.813: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:19:40.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:19:40.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:19:45.940: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:19:45.940: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:19:45.940: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:19:45.940: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:19:50.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:19:50.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:19:55.922: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:19:55.922: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:19:55.922: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:19:55.922: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:20:00.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:20:00.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:20:05.852: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:20:05.852: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:20:05.852: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:20:05.852: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:20:10.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:20:10.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:20:15.899: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:20:15.899: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:20:15.899: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:20:15.899: INFO: ssh [email protected]:22: exit code: 0 Dec 8 06:20:15.899: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal Dec 8 06:20:15.899: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22) error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying Dec 8 06:20:21.216: INFO: ssh [email protected]:22: command: sudo -i systemctl poweroff --force --force Dec 8 06:20:21.216: INFO: ssh [email protected]:22: stdout: "" Dec 8 06:20:21.216: INFO: ssh [email protected]:22: stderr: "" Dec 8 06:20:21.216: INFO: ssh [email protected]:22: exit code: 0 STEP: Disruption complete; stopping async validations

Yeah, I noticed that. It's not from this test, though, as per the leading comment it's from #25707.

As per a comment on your PR, I've added special handling for shutdown initiation in #25707 to account for the shutdown command always terminating with an error on success due to the connection being closed by the server.

smarterclayton · 2020-12-08T21:17:18Z

I'm going to make some cleanups to the recovery test while I'm debugging the machine api SLO violation. That will conflict with parts of this, but cleans up the behavior to be more consistent.

marun · 2020-12-09T03:12:13Z

/test e2e-aws-disruptive

marun · 2020-12-09T06:16:52Z

/test e2e-aws-disruptive

marun · 2020-12-09T06:17:11Z

Skipping machine recovery test to try to get signal out of the other 2.

marun · 2020-12-09T17:31:59Z

/test e2e-aws-disruptive

marun · 2020-12-09T17:35:55Z

/test e2e-aws-disruptive

marun · 2020-12-09T21:33:37Z

/test e2e-aws-disruptive

marun · 2020-12-10T03:42:59Z

/test e2e-aws-disruptive

marun · 2020-12-10T16:06:02Z

/test e2e-aws-disruptive

marun · 2020-12-10T17:49:10Z

/test e2e-aws-disruptive

marun · 2020-12-10T21:43:31Z

/test e2e-aws-disruptive

marun · 2020-12-10T22:14:06Z

/retest

marun · 2020-12-10T23:13:27Z

/test e2e-aws-disruptive

marun · 2020-12-11T00:54:19Z

/test e2e-aws-disruptive

marun · 2020-12-11T04:09:58Z

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

marun · 2020-12-11T14:31:13Z

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

marun · 2020-12-11T15:27:34Z

/test e2e-gcp-disruptive

marun · 2020-12-11T18:54:14Z

/test e2e-gcp-disruptive
/test e2e-aws-disruptive

marun · 2020-12-13T19:35:21Z

/retest

openshift-ci-robot · 2021-02-01T16:26:15Z

@marun: This pull request references Bugzilla bug 1886160, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Details

In response to this:

Bug 1886160: Add test of documented backup/restore procedure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2021-02-01T18:26:01Z

test/extended/dr/backup_restore.go

+			waitForAPIServer(oc.AdminKubeClient(), recoveryNode)
+
+			// Recovery 10,11,12
+			forceOperandRedeployment(oc.AdminOperatorClient().OperatorV1())


This is disappointing to have to perform. Is there a reason this is not automatic?

I'm just replicating the documented procedure. Is it not in keeping with the current state of the code?

cc: @hexfusion

As per slack conversation, this requirement is due to the current state of the static pod controller. Changing that is not in scope for this PR.

smarterclayton · 2021-02-01T18:33:34Z

test/extended/dr/backup_restore.go

+		framework.Logf("Selecting node %q as the recovery host", recoveryNode.Name)
+
+		// Recovery 2
+		g.By("Verifying that all masters are reachable via ssh")


I think 4's description is insufficient in the docs - not only must the machines be stopped, but if they are lost they need to be incapable of being turned back on. It is the administrators responsibility to ensure the machines do not start once step 4 has been completed.

I would say the docs for 4 should say, instead of :

Stop the static pods on all other control plane nodes.

be

Ensure all control plane nodes EXCEPT the recovery host are terminated and cannot be restarted, or have their processes permanently stopped
Failure to ensure that only the recovery host is running instances of the control plane software will result in data corruption and workload disruption. You MUST ensure the non-functional hosts cannot start control plane software by following these instructions.

And probably a discussion of why this is the case (we're ensuring that no older quorum can form).

We might want to simply break the non-recovery hosts into two classes in the description of 4: those that are known functional, and those that are unrecoverable / known faulty. The known functional can be stopped (allows us to potentially recover with them if the current host fails). The known faulty need to be prevented from becoming running again.

smarterclayton · 2021-02-05T20:01:44Z

/approve

hexfusion

few minor things to consider overall looks great very clear and concise. un hold if no changes are proposed.

/lgtm
/hold

hexfusion · 2021-02-05T21:01:26Z

test/extended/dr/backup_restore.go

+				g.By(fmt.Sprintf("Waiting for etcd static pod to exit on node %q", master.Name))
+				// Look for 'etcd ' (with trailing space) to be missing to
+				// differentiate from pods like etcd-operator.
+				sudoExecOnNodeOrFail(master, "crictl ps | grep 'etcd ' | wc -l | grep -q 0")


nit: technically we are waiting for the etcd process to exit. other processes could be running in the pod.

hexfusion · 2021-02-05T21:01:55Z

test/extended/dr/backup_restore.go

+				if master.Name == recoveryNode.Name {
+					continue
+				}
+				g.By(fmt.Sprintf("Waiting for kube-apiserver static pod to exit on node %q", master.Name))


hexfusion · 2021-02-05T21:06:39Z

test/extended/dr/force_redeploy.go

+// kube-apiserver, kube-controller-manager, kube-scheduler
+// operands. This is a necessary part of restoring a cluster from
+// backup.
+


minor: I think this was meant to be dropped as you have it with the func below

Updated to reflect intention of documenting file-level purpose.

hexfusion · 2021-02-05T21:41:16Z

test/extended/dr/backup_restore.go

+	g.By(fmt.Sprintf("Verifying that the etcd container is running on recovery node %q", node.Name))
+	// Look for 'etcd ' (with trailing space) to differentiate from pods
+	// like etcd-operator.
+	sudoExecOnNodeOrFail(node, "crictl ps | grep -q 'etcd '")


minor you can do this directly with the client

cricrl ps -q --name ^etcd$

nmd your looking for non zero

hexfusion · 2021-02-05T21:53:55Z

well, you have a merge conflict now anyways.

/lgtm cancel

smarterclayton · 2021-02-06T23:13:50Z

/retest
/test e2e-aws-serial

smarterclayton · 2021-02-07T18:45:22Z

/retest

hexfusion · 2021-02-08T16:42:41Z

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

hexfusion · 2021-02-08T19:37:22Z

very nice sir,

/lgtm

openshift-ci-robot · 2021-02-08T19:37:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, marun, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hexfusion · 2021-02-08T19:38:21Z

/hold cancel

marun · 2021-02-09T00:27:03Z

/retest

openshift-bot · 2021-02-09T00:53:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2021-02-09T02:27:41Z

@marun: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/origin#25774 is open

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1886160 has not been moved to the MODIFIED state.

Details

In response to this:

Bug 1886160: Add test of documented backup/restore procedure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

marun · 2021-04-28T02:10:17Z

/cherry-pick release-4.7

openshift-cherrypick-robot · 2021-04-28T02:13:11Z

@marun: new pull request created: #26110

Details

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2020

openshift-ci-robot requested review from deads2k and smarterclayton December 1, 2020 06:53

smarterclayton reviewed Dec 8, 2020

View reviewed changes

smarterclayton reviewed Feb 1, 2021

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2021

hexfusion reviewed Feb 5, 2021

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2021

openshift-ci-robot assigned hexfusion Feb 5, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2021

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2021

marun added 3 commits February 5, 2021 14:31

dr: Add test of documented backup and restore procedure

b953586

dr: Skip quorum restore test pending the fix in #25774

b929ae1

Update test annotations

1fd327e

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2021

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2021

openshift-merge-robot merged commit 9ccef40 into openshift:master Feb 9, 2021

openshift-cherrypick-robot mentioned this pull request Apr 28, 2021

[release-4.7] Bug 1947705: Add test of documented backup/restore procedure #26110

Merged

marun mentioned this pull request Jul 29, 2021

Bug 1988176: [release-4.6] Add test of documented backup/restore procedure #26356

Merged

Bug 1886160: Add test of documented backup/restore procedure #25723

Bug 1886160: Add test of documented backup/restore procedure #25723

Uh oh!

Conversation

marun commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marun commented Dec 5, 2020

Uh oh!

marun commented Dec 7, 2020

Uh oh!

marun commented Dec 8, 2020

Uh oh!

marun commented Dec 8, 2020

Uh oh!

marun commented Dec 8, 2020

Uh oh!

marun commented Dec 8, 2020

Uh oh!

marun commented Dec 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Dec 8, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 9, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 10, 2020

Uh oh!

marun commented Dec 11, 2020

Uh oh!

marun commented Dec 11, 2020

Uh oh!

marun commented Dec 11, 2020

Uh oh!

marun commented Dec 11, 2020

Uh oh!

marun commented Dec 11, 2020

Uh oh!

marun commented Dec 13, 2020

Uh oh!

openshift-ci-robot commented Feb 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Feb 5, 2021

Uh oh!

hexfusion left a comment

marun commented Dec 1, 2020 •

edited

Loading

hexfusion commented Feb 5, 2021 •

edited

Loading