Skip to content

Conversation

@sosiouxme
Copy link
Member

No description provided.

@wking
Copy link
Member

wking commented Mar 13, 2020

No need for a new 4.3 release, because we already have 4.3.5 in this channel, and:

$ oc adm release info quay.io/openshift-release-dev/ocp-release:4.4.0-rc.1-x86_64 | grep Upgrades
  Upgrades: 4.3.5, 4.4.0-rc.0

Single AWS rc.0 -> rc.1 update passed. Looking at the 4.3.5 -> rc.1 jobs, three AWS jobs passed, one failed on setup (throttling), and one had short (<2m) unreachable during disruption issues. None of those are candidate-promotion blockers.

I'll launch some CI jobs on GCP and Azure...

@wking
Copy link
Member

wking commented Mar 14, 2020

New results:

  • 4.3.5 -> 4.4.0-rc.1 GCP failed with some unreachable during disruption, all less that 4m, which we don't consider edge-blocking.
  • 4.3.5 -> 4.4.0-rc.1 GCP failed with Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-rc.1: 12% complete. Needs more investigation.
  • 4.3.5 -> 4.4.0-rc.1 Azure failed with failed to acquire lease: status 503 Service Unavailable, which is pre-update, so no impact on edge stability.
  • 4.3.5 -> 4.4.0-rc.1 Azure failed with Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-rc.1: 18% complete. Needs more investigation.
  • 4.4.0-rc.0 -> 4.4.0-rc.1 GCP failed with failed to acquire lease: status 503 Service Unavailable.
  • 4.4.0-rc.0 -> 4.4.0-rc.1 Azure had some unreachable during disruption, all less than 2m, and was counted as a success.

Launching replacements for the two Boskos 503s...

@wking
Copy link
Member

wking commented Mar 14, 2020

The 4.3.5 -> 4.4.0-rc.1 GCP timeout job has:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/443/artifacts/e2e-gcp-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f1ffcdcbd684afd61ff1874b47e1c61a0f7adab93b7a21123a1e29b041d3dabf/namespaces/openshift-cluster-version/pods/cluster-version-operator-5fffc549d9-shbf9/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'Running sync.*in state\|Result of work' | tail -n4
2020-03-13T22:46:45.34504345Z I0313 22:46:45.345010       1 task_graph.go:596] Result of work: [Cluster operator etcd is reporting a failure: EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists]
2020-03-13T22:49:55.347779833Z I0313 22:49:55.347709       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:6fa3e6520d6668737d29a68ef7d7189642b07dba9b17511316210f336e9492b0 (force=true) on generation 2 in state Updating at attempt 8
2020-03-13T22:55:40.40044101Z I0313 22:55:40.399557       1 task_graph.go:596] Result of work: [Cluster operator etcd is reporting a failure: EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists]
2020-03-13T22:58:57.58632797Z I0313 22:58:57.586215       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:6fa3e6520d6668737d29a68ef7d7189642b07dba9b17511316210f336e9492b0 (force=true) on generation 2 in state Updating at attempt 9

The Azure job has:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/92/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-69d0187ff513f67a3dbe37c843e318a59ff27699f3c38dcd2d79df13bc176def/namespaces/openshift-cluster-version/pods/cluster-version-operator-5fffc549d9-kmq5k/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'Running sync.*in state\|Result of work' | tail -n3
2020-03-13T23:01:42.3198065Z I0313 23:01:42.319768       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 346:
2020-03-13T23:04:40.0548178Z I0313 23:04:40.054748       1 sync_worker.go:471] Running sync registry.svc.ci.openshift.org/ocp/release@sha256:6fa3e6520d6668737d29a68ef7d7189642b07dba9b17511316210f336e9492b0 (force=true) on generation 2 in state Updating at attempt 9
2020-03-13T23:10:25.1064177Z I0313 23:10:25.106409       1 task_graph.go:596] Result of work: [Cluster operator kube-controller-manager is reporting a failure: NodeInstallerDegraded: 1 nodes are failing on revision 393:

I'll hunt around for 4.3 -> 4.4 bugs mentioning EtcdMemberIPMigratorDegraded from etcd or NodeInstallerDegraded from kube-controller-manager.

@wking
Copy link
Member

wking commented Mar 14, 2020

EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists is rhbz#1811706. I've asked the etcd team for an impact statement.

@wking
Copy link
Member

wking commented Mar 14, 2020

Created rhbz#1813512 for the NodeInstallerDegraded.

@wking
Copy link
Member

wking commented Mar 14, 2020

4.3.5 -> 4.4.0-rc.1 Azure died in setup with ReferencedResourceNotProvisioned, rhbz#1813513. I've launced a replacement.

@wking
Copy link
Member

wking commented Mar 14, 2020

4.4.0-rc.0 -> 4.4.0-rc.1 GCP had some unreachable during disruption, all less than 2m, and was counted as a success.

@wking
Copy link
Member

wking commented Mar 14, 2020

4.3.5 -> 4.4.0-rc.1 Azure failed with Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.4.0-rc.1: 12% complete with the EtcdMemberIPMigratorDegraded: etcdserver: Peer URLs already exists issue from rhbz#1811706. We may want to get that triaged before we add rc.1 to candidate-4.4. Or block 4.3 -> 4.4.0-rc.1?

@wking
Copy link
Member

wking commented Mar 16, 2020

My EtcdMemberIPMigratorDegraded bug was closed as a dup of rhbz#1812584, which is VERIFIED today. So I'd guess the next RC will have the fix. I'm agnostic about whether we pull 4.3 -> 4.4 edges for the current 4.4 RCs from candidate-4.4 or not.

@wking
Copy link
Member

wking commented Mar 17, 2020

I haven't mentioned the NodeInstallerDegraded job in the past few comments, but preliminary noises from @tnozicka make it sound like a potential upgrade blocker as well. Still not clear on whether it's common enough to call for excluding rc.1 from candidate-4.4 .

@eparis
Copy link
Member

eparis commented Mar 17, 2020

What do we think about merging this, then pull edges when the next rc comes along with a note to 1812584?

@sdodson
Copy link
Member

sdodson commented Mar 18, 2020

What do we think about merging this, then pull edges when the next rc comes along with a note to 1812584?

That's fine, I was hoping that there'd be a new RC first thing this morning but there isn't.

@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Mar 18, 2020

Right, we should be fine to merge this as we know what are the issues around this RC.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LalatenduMohanty, sosiouxme
To complete the pull request process, please assign smarterclayton
You can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Mar 18, 2020

/hold as I want to put a PR for blocking the edges to 4.4 first

These are the UpgradeBlockers

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 18, 2020
@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Mar 18, 2020

Created #123

Once we merge #123 we can remove the hold and merge this PR

@wking
Copy link
Member

wking commented Mar 18, 2020

/hold

rc.1 may be impacted by #125. Let's keep it out of channels until we know for sure. See also the bugs linked from #123.

@wking
Copy link
Member

wking commented Mar 18, 2020

Tombstoned in #127.

/close

@openshift-ci-robot
Copy link

@wking: Closed this PR.

Details

In response to this:

Tombstoned in #127.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LalatenduMohanty
Copy link
Member

/reopen
As we do not hold the PR in candidates because of upgrade blockers.

@openshift-ci-robot
Copy link

@LalatenduMohanty: Reopened this PR.

Details

In response to this:

/reopen
As we do not hold the PR in candidates because of upgrade blockers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LalatenduMohanty
Copy link
Member

@sosiouxme Can you reopen the PR please?

@wking
Copy link
Member

wking commented Mar 19, 2020

/close

Getting handled in #127 (although that's now adding to candidate, but with different motivation)

@openshift-ci-robot
Copy link

@wking: Closed this PR.

Details

In response to this:

/close

Getting handled in #127 (although that's now adding to candidate, but with different motivation)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants