Skip to content

Conversation

@thiagoalessio
Copy link
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 19, 2020
@wking
Copy link
Member

wking commented Feb 19, 2020

publish failed with:

error: failed to load configuration: Response from configresolver == 504 (Gateway Timeout)

Smells like a CI flake.

/retest

@wking
Copy link
Member

wking commented Feb 19, 2020

Waiting on some CI update jobs to complete...

@eparis
Copy link
Member

eparis commented Feb 20, 2020

/approve
/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 20, 2020
@wking
Copy link
Member

wking commented Feb 20, 2020

Summarizing * -> 4.3.3 CI:

  • Each baked in update source had at least one green CI run.
  • .* -> 4.3.3 shuffles through aws, gcp, and azure,mirror and all of them had at least some passing. But there are no successful azure,mirror on 4.3 -> 4.3.3 and no successful gcp on 4.2.20 -> 4.3.3. Kicking off some more jobs now.
  • 4.2.20 -> 4.3.3 has failures for:
    • AWS failed with:
      fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 19 21:07:31.731: Service was unreachable during upgrade for at least 1m29s:
      
      That sounds a bit like rhbz#1801885 and rhbz#1802246 (both API was unreachable...), but I don't see an existing bug for Service was unreachable....
    • AWS failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 21:09:59.119: API was unreachable during upgrade for at least 2m47s:
      
      That's the two bugs linked above.
    • AWS failed in setup with:
      ...listing hosted zones: Throttling: Rate exceeded...
      
      That's rhbz#1767936 and the setup-time flake means it's not a 4.3.3 or update issue.
    • GCP failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 19 20:51:44.664: API was unreachable during upgrade for at least 2m32s:
      
      discussed above. But for GCP, that might also be new symptoms for rhbz#1793635. Same result in this other GCP job.
    • GCP failed with:
      Pools did not complete upgrade: timed out waiting for the condition
      
      Mentioned on Azure in rhbz#1768262, but not sure if that's what's going on here.
  • 4.3.1 -> 4.3.3 has failures for AWS, Azure, and GCP all failed with:
    the following tags from the release could not be imported to stable-initial after five minutes...
    
    which is a CI-infra flake.
  • 4.3.2 -> 4.3.3 has failed for:
    • Azure failed with platform-throttling at setup-time:
      level=error msg="Error: Error waiting for Azure Storage Account \"clusterypay8\" to be created: Future#WaitForCompletion: the number of retries has been exceeded: StatusCode=429 -- Original Error: Code=\"TooManyRequests\" Message=\"The request is being throttled as the limit has been reached for operation type - Read. For more information, see - https://aka.ms/srpthrottlinglimits\""
      
    • Azure failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 00:01:36.132: API was unreachable during upgrade for at least 2m45s:
      
      which is discussed above.

@wking
Copy link
Member

wking commented Feb 20, 2020

I'm going to tentatively hang the ...was unreachable during upgrade for at least... on Missing CNI default network and rhbz#1802246.

@wking
Copy link
Member

wking commented Feb 21, 2020

Checking in on the replacement jobs I launched earlier:

  • 4.2.20 -> 4.3.3:
    • GCP failed with:
      fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 20 21:33:12.461: Service was unreachable during upgrade for at least 15m34s:
      
    • GCP failed with:
      fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:124]: Feb 20 21:33:06.040: Service was unreachable during upgrade for at least 18m5s:
      
    • GCP failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:26:33.317: API was unreachable during upgrade for at least 2m24s:
      
  • 4.3.0 -> 4.3.3:
    • Azure failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:43:49.222: API was unreachable during upgrade for at least 3m2s:
      
    • Azure failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 21:57:57.456: API was unreachable during upgrade for at least 2m37s:
      
  • 4.3.1 -> 4.3.3:
    • Azure failed with:
      fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 20 22:08:17.206: API was unreachable during upgrade for at least 2m32s:
      

This is not filling me with confidence in this release ;).

@wking
Copy link
Member

wking commented Feb 25, 2020

Ok, Checking the 4.3->4.3.3 update failures again, they're all on Azure which struggles with slow disks (rhbz#1798785). Looking for that in these jobs:

$ # 4.3.0 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/64/build-log.txt | grep -c 'etcdserver: leader changed'
5
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/65/build-log.txt | grep -c 'etcdserver: leader changed'
17
$ # 4.3.1 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/66/build-log.txt | grep -c 'etcdserver: leader changed'
14
$ # 4.3.2 -> 4.3.3
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade/61/build-log.txt | grep -c 'etcdserver: leader changed'
16

Those numbers aren't higher than normal for Azure, based on the existing examples in the bug.

@wking
Copy link
Member

wking commented Feb 25, 2020

Other bugs turned up during review:

  • Turns out invalid memory address or nil pointer dereference panics out of the openshift-kube-scheduler-operator pods wasn't actually fixed in 4.3.0, rhbz#1774212. This is not a regression; the claimed fix had not landed in 4.3 yet.
  • killing connection/stream because serving request timed out and response had been started panics out of kube-apiserver pods, rhbz#1807192. Not clear on whether this is a regression or not yet.

Both of those turned up in the Azure 4.3.2->4.3.3 update job.

@wking
Copy link
Member

wking commented Feb 25, 2020

Also some of the nominally-OOMKilled reports discussed in rhbz#1782601.

@wking
Copy link
Member

wking commented Feb 25, 2020

Also for the 4.2 -> 4.3.3 transitions, there's the etcd tls: bad certificate flake, rhbz#1805569.

@wking
Copy link
Member

wking commented Feb 25, 2020

New round of 4.3.2 -> 4.3.3 CI:

One AWS job and three Azure jobs are still running.

So summary of 4.3 -> 4.3.3 jobs to date:

  • 4.3.2 -> 4.3.3. 11 success (7 AWS, 4 GCP). 1 GCP and one Azure unreachable failure. 1 AWS and one Azure setup-time error (irrelevant to 4.3.3 promotion)
  • 4.3.1 -> 4.3.3. 2 success (both AWS). 1 Azure unreachable failure. 3 image-build failures (the following tags from the release could not be imported , irrelevant to 4.3.3 promotion)
  • 4.3.0 -> 4.3.3. 1 success (AWS). 2 Azure unreachable failures.

@wking
Copy link
Member

wking commented Feb 25, 2020

4.3.2 -> 4.3.3 Azure jobs back. Two passed; one died API was unreachable during upgrade for at least 8m18s (!). The outstanding AWS job is in teardown, but the test container completed successfully. So updated 4.3 -> 4.3.3 jobs to date:

  • 4.3.2 -> 4.3.3. 14 success (8 AWS, 4 GCP, 2 Azure). 1 GCP and 2 Azure unreachable failure. 1 AWS and one Azure setup-time error (irrelevant to 4.3.3 promotion)
  • 4.3.1 -> 4.3.3. 2 success (both AWS). 1 Azure unreachable failure. 3 image-build failures (the following tags from the release could not be imported , irrelevant to 4.3.3 promotion)
  • 4.3.0 -> 4.3.3. 1 success (AWS). 2 Azure unreachable failures.

@sttts
Copy link

sttts commented Feb 26, 2020

Azure job that died has super slow etcd responses around the issue at 21:17.

@LalatenduMohanty
Copy link
Member

We are still not sure what we should with this PR because of the bugs (as mentioned above comments) found in recent CI testing.

@sdodson
Copy link
Member

sdodson commented Feb 27, 2020

The service and api availability checks are all new tests added in openshift/origin#24479 so it's very hard to compare them to previous results and know whether or not they're regressions.

@LalatenduMohanty
Copy link
Member

As per the above discussion we expect flakes for gcp, azure. Also this PR is adding the release to candidate. So we should this to candidate and do further testing.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2020
@LalatenduMohanty
Copy link
Member

/hold

@eparis
Copy link
Member

eparis commented Feb 27, 2020

/approve
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, LalatenduMohanty, thiagoalessio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6a550a1 into openshift:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants