Skip to content

Conversation

@hexfusion
Copy link
Contributor

@hexfusion hexfusion commented May 7, 2019

Currently, etcd has 2 CAs EtcdCA and EtcdSigner . This PR removed the deprecated EtcdCA and promotes EtcdSigner as the one signer for etcd server TLS assets. To maintain backward compatibility we will honor the old etcd-client Secret naming for ApiServer to consume.

This is important because currently, we do not store the etcd CA key on the cluster making disaster recovery very complicated. By having the key on cluster etcd server certs can be regenerated.

/cc @deads2k @wking @abhinavdahiya

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 7, 2019
@hexfusion hexfusion force-pushed the remove_etcd_ca branch 2 times, most recently from ee10d76 to 300857d Compare May 7, 2019 17:47
@richm
Copy link

richm commented May 7, 2019

@eparis eparis changed the title *: remove deprecated EtcdCA and promote EtcdSigner Bug 1707573: *: remove deprecated EtcdCA and promote EtcdSigner May 7, 2019
@wking
Copy link
Member

wking commented May 7, 2019

From the bootstrap gather:

$ wget https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1720/pull-ci-openshift-installer-master-e2e-aws/5706/artifacts/e2e-aws/installer/bootstrap-logs.tar
$ tar xf bootstrap-logs.tar
$ tail -n3 control-plane/ip-10-0-137-34.ec2.internal/containers/etcd-member.log
2019-05-07 20:05:04.609226 I | embed: rejected connection from "10.0.137.34:53114" (error "remote error: tls: bad certificate", ServerName "etcd-0.ci-op-l9xy353n-1d3f3.origin-ci-int-aws.dev.rhcloud.com")
2019-05-07 20:07:11.852718 I | mvcc: store.index: compact 10800
2019-05-07 20:07:11.858636 I | mvcc: finished scheduled compaction at 10800 (took 4.682041ms)

So something seems broken here.

@hexfusion
Copy link
Contributor Author

hexfusion commented May 7, 2019

So something seems broken here.

@abhinavdahiya found we were base64 encoding configmap

--- a/pkg/asset/manifests/operators.go
+++ b/pkg/asset/manifests/operators.go
@@ -164,7 +164,7 @@ func (m *Manifests) generateBootKubeManifests(dependencies asset.Parents) []*ass
 
        templateData := &bootkubeTemplateData{
                CVOClusterID:               clusterID.UUID,
-               EtcdCaBundle:               base64.StdEncoding.EncodeToString(etcdCABundle.Cert()),
+               EtcdCaBundle:               string(etcdCABundle.Cert()),

@abhinavdahiya
Copy link
Contributor

/lgtm

This should be okay to merge if it goes green. :)

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, hexfusion

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2019
@hexfusion
Copy link
Contributor Author

fail [github.com/openshift/origin/test/extended/machines/machines.go:98]: May 7 21:09:24.210: Machine resources missing for nodes: host-10-0-0-9

/test e2e-openstack

@abhinavdahiya
Copy link
Contributor

e2e-aws: known flake :(

Failing tests:

[Feature:Builds][Conformance] oc new-app  should succeed with a --name of 58 characters [Suite:openshift/conformance/parallel/minimal]

/retest

@abhinavdahiya
Copy link
Contributor

e2e-aws-upgrade failing due to load balancer not getting setup:

May  7 21:16:00.443: INFO: Got error testing for reachability of http://ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com:80/echo?msg=hello: Get http://ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com:80/echo?msg=hello: dial tcp: lookup ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com on 10.142.15.200:53: no such host
May  7 21:16:02.443: INFO: Got error testing for reachability of http://ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com:80/echo?msg=hello: Get http://ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com:80/echo?msg=hello: dial tcp: lookup ae83e7799710c11e9bc730aa36987d0d-113645526.us-east-1.elb.amazonaws.com on 10.142.15.200:53: no such host
STEP: continuously hitting the pod through the service's LoadBalancer

/retest

@hexfusion hexfusion deleted the remove_etcd_ca branch May 7, 2019 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants