Skip to content

Conversation

@marun
Copy link
Contributor

@marun marun commented Dec 1, 2020

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2020
@marun
Copy link
Contributor Author

marun commented Dec 5, 2020

/test e2e-aws-disruptive

1 similar comment
@marun
Copy link
Contributor Author

marun commented Dec 7, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 8, 2020

/retest

@marun
Copy link
Contributor Author

marun commented Dec 8, 2020

/test e2e-aws-disruptive

3 similar comments
@marun
Copy link
Contributor Author

marun commented Dec 8, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 8, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 8, 2020

/test e2e-aws-disruptive

for _, target := range targets {
framework.Logf("Forcing shutdown of node %s", target.Name)
_, err = ssh("sudo -i systemctl poweroff --force --force", target)
execOnNodeOrFail(target, "sudo -i systemctl poweroff --force --force")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying this operation (as seen from the logs) is not correct - because you are shutting down the node it's going to keep retrying, and it shouldn't have to:

Dec  8 06:18:56.307: INFO: Forcing shutdown of node ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:18:56.307: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:18:56.307: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
Dec  8 06:19:15.497: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:19:15.497: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:19:15.498: INFO: ssh [email protected]:22: stderr:    "Powering off.\n"
Dec  8 06:19:15.498: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:19:20.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:19:20.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:19:25.894: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:19:25.894: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:19:25.894: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:19:25.894: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:19:30.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:19:30.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:19:35.813: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:19:35.813: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:19:35.813: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:19:35.813: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:19:40.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:19:40.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:19:45.940: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:19:45.940: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:19:45.940: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:19:45.940: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:19:50.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:19:50.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:19:55.922: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:19:55.922: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:19:55.922: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:19:55.922: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:20:00.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:20:00.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:20:05.852: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:20:05.852: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:20:05.852: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:20:05.852: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:20:10.498: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:20:10.498: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:20:15.899: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:20:15.899: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:20:15.899: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:20:15.899: INFO: ssh [email protected]:22: exit code: 0
Dec  8 06:20:15.899: INFO: Getting external IP address for ip-10-0-155-158.us-west-1.compute.internal
Dec  8 06:20:15.899: INFO: SSH "sudo -i systemctl poweroff --force --force" on ip-10-0-155-158.us-west-1.compute.internal(10.0.155.158:22)
error dialing core@ae845ee70baa5457cb71e56356b4f460-523924702.us-west-1.elb.amazonaws.com:22: 'ssh: handshake failed: EOF', retrying
Dec  8 06:20:21.216: INFO: ssh [email protected]:22: command:   sudo -i systemctl poweroff --force --force
Dec  8 06:20:21.216: INFO: ssh [email protected]:22: stdout:    ""
Dec  8 06:20:21.216: INFO: ssh [email protected]:22: stderr:    ""
Dec  8 06:20:21.216: INFO: ssh [email protected]:22: exit code: 0
STEP: Disruption complete; stopping async validations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed that. It's not from this test, though, as per the leading comment it's from #25707.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per a comment on your PR, I've added special handling for shutdown initiation in #25707 to account for the shutdown command always terminating with an error on success due to the connection being closed by the server.

@smarterclayton
Copy link
Contributor

I'm going to make some cleanups to the recovery test while I'm debugging the machine api SLO violation. That will conflict with parts of this, but cleans up the behavior to be more consistent.

@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

/test e2e-aws-disruptive

1 similar comment
@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

Skipping machine recovery test to try to get signal out of the other 2.

@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

/test e2e-aws-disruptive

6 similar comments
@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 9, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/retest

@marun
Copy link
Contributor Author

marun commented Dec 10, 2020

/test e2e-aws-disruptive

1 similar comment
@marun
Copy link
Contributor Author

marun commented Dec 11, 2020

/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 11, 2020

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

1 similar comment
@marun
Copy link
Contributor Author

marun commented Dec 11, 2020

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 11, 2020

/test e2e-gcp-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 11, 2020

/test e2e-gcp-disruptive
/test e2e-aws-disruptive

@marun
Copy link
Contributor Author

marun commented Dec 13, 2020

/retest

@openshift-ci-robot
Copy link

@marun: This pull request references Bugzilla bug 1886160, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1886160: Add test of documented backup/restore procedure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

waitForAPIServer(oc.AdminKubeClient(), recoveryNode)

// Recovery 10,11,12
forceOperandRedeployment(oc.AdminOperatorClient().OperatorV1())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is disappointing to have to perform. Is there a reason this is not automatic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just replicating the documented procedure. Is it not in keeping with the current state of the code?

cc: @hexfusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per slack conversation, this requirement is due to the current state of the static pod controller. Changing that is not in scope for this PR.

framework.Logf("Selecting node %q as the recovery host", recoveryNode.Name)

// Recovery 2
g.By("Verifying that all masters are reachable via ssh")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 4's description is insufficient in the docs - not only must the machines be stopped, but if they are lost they need to be incapable of being turned back on. It is the administrators responsibility to ensure the machines do not start once step 4 has been completed.

I would say the docs for 4 should say, instead of :

  1. Stop the static pods on all other control plane nodes.

be

  1. Ensure all control plane nodes EXCEPT the recovery host are terminated and cannot be restarted, or have their processes permanently stopped
    Failure to ensure that only the recovery host is running instances of the control plane software will result in data corruption and workload disruption. You MUST ensure the non-functional hosts cannot start control plane software by following these instructions.

And probably a discussion of why this is the case (we're ensuring that no older quorum can form).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to simply break the non-recovery hosts into two classes in the description of 4: those that are known functional, and those that are unrecoverable / known faulty. The known functional can be stopped (allows us to potentially recover with them if the current host fails). The known faulty need to be prevented from becoming running again.

@smarterclayton
Copy link
Contributor

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2021
Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor things to consider overall looks great very clear and concise. un hold if no changes are proposed.

/lgtm
/hold

g.By(fmt.Sprintf("Waiting for etcd static pod to exit on node %q", master.Name))
// Look for 'etcd ' (with trailing space) to be missing to
// differentiate from pods like etcd-operator.
sudoExecOnNodeOrFail(master, "crictl ps | grep 'etcd ' | wc -l | grep -q 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: technically we are waiting for the etcd process to exit. other processes could be running in the pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if master.Name == recoveryNode.Name {
continue
}
g.By(fmt.Sprintf("Waiting for kube-apiserver static pod to exit on node %q", master.Name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// kube-apiserver, kube-controller-manager, kube-scheduler
// operands. This is a necessary part of restoring a cluster from
// backup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think this was meant to be dropped as you have it with the func below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reflect intention of documenting file-level purpose.

g.By(fmt.Sprintf("Verifying that the etcd container is running on recovery node %q", node.Name))
// Look for 'etcd ' (with trailing space) to differentiate from pods
// like etcd-operator.
sudoExecOnNodeOrFail(node, "crictl ps | grep -q 'etcd '")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor you can do this directly with the client

cricrl ps -q --name ^etcd$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nmd your looking for non zero

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2021
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2021
@hexfusion
Copy link
Contributor

hexfusion commented Feb 5, 2021

well, you have a merge conflict now anyways.

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2021
@smarterclayton
Copy link
Contributor

/retest
/test e2e-aws-serial

@smarterclayton
Copy link
Contributor

/retest

@hexfusion
Copy link
Contributor

/test e2e-aws-disruptive
/test e2e-gcp-disruptive

@hexfusion
Copy link
Contributor

very nice sir,

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, marun, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2021
@marun
Copy link
Contributor Author

marun commented Feb 9, 2021

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9ccef40 into openshift:master Feb 9, 2021
@openshift-ci-robot
Copy link

@marun: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1886160 has not been moved to the MODIFIED state.

Details

In response to this:

Bug 1886160: Add test of documented backup/restore procedure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marun
Copy link
Contributor Author

marun commented Apr 28, 2021

/cherry-pick release-4.7

@openshift-cherrypick-robot

@marun: new pull request created: #26110

Details

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants