candidate-4.3: add 4.3.3 #57

thiagoalessio · 2020-02-19T19:07:54Z

No description provided.

wking · 2020-02-19T19:11:36Z

publish failed with:

error: failed to load configuration: Response from configresolver == 504 (Gateway Timeout)

Smells like a CI flake.

/retest

wking · 2020-02-19T20:54:54Z

Waiting on some CI update jobs to complete...

eparis · 2020-02-20T00:40:33Z

/approve
/hold

wking · 2020-02-20T19:50:32Z

Summarizing * -> 4.3.3 CI:

Each baked in update source had at least one green CI run.
.* -> 4.3.3 shuffles through aws, gcp, and azure,mirror and all of them had at least some passing. But there are no successful azure,mirror on 4.3 -> 4.3.3 and no successful gcp on 4.2.20 -> 4.3.3. Kicking off some more jobs now.

4.2.20 -> 4.3.3 has failures for:

AWS failed with:

fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 19 21:07:31.731: Service was unreachable during upgrade for at least 1m29s:

That sounds a bit like rhbz#1801885 and rhbz#1802246 (both API was unreachable...), but I don't see an existing bug for Service was unreachable....

AWS failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 21:09:59.119: API was unreachable during upgrade for at least 2m47s:

That's the two bugs linked above.

AWS failed in setup with:
```
...listing hosted zones: Throttling: Rate exceeded...
```
That's rhbz#1767936 and the setup-time flake means it's not a 4.3.3 or update issue.

GCP failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 20:51:44.664: API was unreachable during upgrade for at least 2m32s:

discussed above. But for GCP, that might also be new symptoms for rhbz#1793635. Same result in this other GCP job.

GCP failed with:
```
Pools did not complete upgrade: timed out waiting for the condition
```
Mentioned on Azure in rhbz#1768262, but not sure if that's what's going on here.

4.3.1 -> 4.3.3 has failures for AWS, Azure, and GCP all failed with:

the following tags from the release could not be imported to stable-initial after five minutes...

which is a CI-infra flake.

4.3.2 -> 4.3.3 has failed for:

Azure failed with platform-throttling at setup-time:

level=error msg="Error: Error waiting for Azure Storage Account \"clusterypay8\" to be created: Future#WaitForCompletion: the number of retries has been exceeded: StatusCode=429 -- Original Error: Code=\"TooManyRequests\" Message=\"The request is being throttled as the limit has been reached for operation type - Read. For more information, see - https://aka.ms/srpthrottlinglimits\""

Azure failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 00:01:36.132: API was unreachable during upgrade for at least 2m45s:

which is discussed above.

wking · 2020-02-20T20:48:36Z

I'm going to tentatively hang the ...was unreachable during upgrade for at least... on Missing CNI default network and rhbz#1802246.

wking · 2020-02-21T00:04:58Z

Checking in on the replacement jobs I launched earlier:

4.2.20 -> 4.3.3:

GCP failed with:

fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 20 21:33:12.461: Service was unreachable during upgrade for at least 15m34s:

GCP failed with:

fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 20 21:33:06.040: Service was unreachable during upgrade for at least 18m5s:

GCP failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:26:33.317: API was unreachable during upgrade for at least 2m24s:

4.3.0 -> 4.3.3:

Azure failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:43:49.222: API was unreachable during upgrade for at least 3m2s:

Azure failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:57:57.456: API was unreachable during upgrade for at least 2m37s:

4.3.1 -> 4.3.3:

Azure failed with:

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 22:08:17.206: API was unreachable during upgrade for at least 2m32s:

This is not filling me with confidence in this release ;).

wking · 2020-02-25T17:16:10Z

Ok, Checking the 4.3->4.3.3 update failures again, they're all on Azure which struggles with slow disks (rhbz#1798785). Looking for that in these jobs:

$ # 4.3.0 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/64/build-log.txt | grep -c 'etcdserver: leader changed'
5
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/65/build-log.txt | grep -c 'etcdserver: leader changed'
17
$ # 4.3.1 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/66/build-log.txt | grep -c 'etcdserver: leader changed'
14
$ # 4.3.2 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/61/build-log.txt | grep -c 'etcdserver: leader changed'
16

Those numbers aren't higher than normal for Azure, based on the existing examples in the bug.

wking · 2020-02-25T19:32:39Z

Other bugs turned up during review:

Turns out invalid memory address or nil pointer dereference panics out of the openshift-kube-scheduler-operator pods wasn't actually fixed in 4.3.0, rhbz#1774212. This is not a regression; the claimed fix had not landed in 4.3 yet.
killing connection/stream because serving request timed out and response had been started panics out of kube-apiserver pods, rhbz#1807192. Not clear on whether this is a regression or not yet.

Both of those turned up in the Azure 4.3.2->4.3.3 update job.

wking · 2020-02-25T20:36:57Z

Also some of the nominally-OOMKilled reports discussed in rhbz#1782601.

wking · 2020-02-25T20:56:57Z

Also for the 4.2 -> 4.3.3 transitions, there's the etcd tls: bad certificate flake, rhbz#1805569.

wking · 2020-02-25T22:20:27Z

New round of 4.3.2 -> 4.3.3 CI:

GCP
- Passed here, here, and here.
- API was unreachable during upgrade for at least 2m7s here.
AWS
- Passed here and here.
- Died in setup here

One AWS job and three Azure jobs are still running.

So summary of 4.3 -> 4.3.3 jobs to date:

4.3.2 -> 4.3.3. 11 success (7 AWS, 4 GCP). 1 GCP and one Azure unreachable failure. 1 AWS and one Azure setup-time error (irrelevant to 4.3.3 promotion)
4.3.1 -> 4.3.3. 2 success (both AWS). 1 Azure unreachable failure. 3 image-build failures (the following tags from the release could not be imported , irrelevant to 4.3.3 promotion)
4.3.0 -> 4.3.3. 1 success (AWS). 2 Azure unreachable failures.

wking · 2020-02-25T22:36:39Z

4.3.2 -> 4.3.3 Azure jobs back. Two passed; one died API was unreachable during upgrade for at least 8m18s (!). The outstanding AWS job is in teardown, but the test container completed successfully. So updated 4.3 -> 4.3.3 jobs to date:

4.3.2 -> 4.3.3. 14 success (8 AWS, 4 GCP, 2 Azure). 1 GCP and 2 Azure unreachable failure. 1 AWS and one Azure setup-time error (irrelevant to 4.3.3 promotion)
4.3.1 -> 4.3.3. 2 success (both AWS). 1 Azure unreachable failure. 3 image-build failures (the following tags from the release could not be imported , irrelevant to 4.3.3 promotion)
4.3.0 -> 4.3.3. 1 success (AWS). 2 Azure unreachable failures.

sttts · 2020-02-26T08:28:14Z

Azure job that died has super slow etcd responses around the issue at 21:17.

LalatenduMohanty · 2020-02-26T19:14:07Z

We are still not sure what we should with this PR because of the bugs (as mentioned above comments) found in recent CI testing.

sdodson · 2020-02-27T14:01:19Z

The service and api availability checks are all new tests added in openshift/origin#24479 so it's very hard to compare them to previous results and know whether or not they're regressions.

LalatenduMohanty · 2020-02-27T14:14:48Z

As per the above discussion we expect flakes for gcp, azure. Also this PR is adding the release to candidate. So we should this to candidate and do further testing.
/lgtm

LalatenduMohanty · 2020-02-27T17:08:57Z

/hold

eparis · 2020-02-27T17:35:54Z

/approve
/hold cancel

openshift-ci-robot · 2020-02-27T17:36:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, LalatenduMohanty, thiagoalessio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~channels/OWNERS~~ [eparis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

candidate-4.3: add 4.3.3

ffd9dfb

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 19, 2020

openshift-ci-robot requested review from lucab and wking February 19, 2020 19:08

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2020

wking mentioned this pull request Feb 21, 2020

channels/fast-4.3: Promote 4.3.3 #71

Closed

openshift-ci-robot assigned LalatenduMohanty Feb 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2020

openshift-merge-robot merged commit 6a550a1 into openshift:master Feb 27, 2020

wking mentioned this pull request Mar 20, 2020

blocked-edges: Details on bugs for 4.3.2 and 4.3.3 #119

Merged

candidate-4.3: add 4.3.3 #57

candidate-4.3: add 4.3.3 #57

Uh oh!

Conversation

thiagoalessio commented Feb 19, 2020

Uh oh!

wking commented Feb 19, 2020

Uh oh!

wking commented Feb 19, 2020

Uh oh!

eparis commented Feb 20, 2020

Uh oh!

wking commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Feb 20, 2020

Uh oh!

wking commented Feb 21, 2020

Uh oh!

wking commented Feb 25, 2020

Uh oh!

wking commented Feb 25, 2020

Uh oh!

wking commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Feb 25, 2020

Uh oh!

wking commented Feb 25, 2020

Uh oh!

wking commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sttts commented Feb 26, 2020

Uh oh!

LalatenduMohanty commented Feb 26, 2020

Uh oh!

sdodson commented Feb 27, 2020

Uh oh!

LalatenduMohanty commented Feb 27, 2020

Uh oh!

LalatenduMohanty commented Feb 27, 2020

Uh oh!

eparis commented Feb 27, 2020

Uh oh!

openshift-ci-robot commented Feb 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wking commented Feb 20, 2020 •

edited

Loading

wking commented Feb 25, 2020 •

edited

Loading

wking commented Feb 25, 2020 •

edited

Loading