Skip to content

Conversation

@thiagoalessio
Copy link
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 21, 2020
@eparis
Copy link
Member

eparis commented Feb 21, 2020

/approve
/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 21, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, thiagoalessio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 21, 2020
@vrutkovs
Copy link

Starter clusters have encountered node issues on 4.3.2 update

@vrutkovs
Copy link

Starter cluster have hit a few issues:

  • us-east2 was a false alarm - a new crashlooping pod has been started in openshift-* namespace and thus triggered KubePodCrashLooping alert
  • us-east1 had two nodes stuck - one was unresponsive, the other one had several important services restarting.

@bradmwilliams any update on us-east-1 starter cluster situation?

@bradmwilliams
Copy link

Starter cluster have hit a few issues:

  • us-east2 was a false alarm - a new crashlooping pod has been started in openshift-* namespace and thus triggered KubePodCrashLooping alert
  • us-east1 had two nodes stuck - one was unresponsive, the other one had several important services restarting.

@bradmwilliams any update on us-east-1 starter cluster situation?

@vrutkovs There are 2 4.3.2 starter clusters (us-east-1 and us-west-1). Both have been struggling with nodes falling into NotReady states and ultimately resulting in multiple degraded operators. It is possible to manually combat the NotReady nodes by power cycling, in the AWS Console, the nodes, but inevitably it turns into a game of whack-a-mole. Eventually, the operators catch up and things resume operating as usual, but only for a short period of time. There definitely seems to be some correlation between the impacted node and the prometheus-k8s pods in the openshift-monitoring namespace. The only other information that I have is prior to the node loosing SSH connectivity, there are multiple system service failures on the node:

[systemd]
Failed Units: 6
  chronyd.service
  irqbalance.service
  polkit.service
  rhsmcertd.service
  rpc-statd.service
  sssd.service

I have also observed that the prometheus alerts api appears to be not functioning any longer. Code that worked previously, now simply returns a 503.

@LalatenduMohanty
Copy link
Member

LalatenduMohanty commented Feb 28, 2020

@LalatenduMohanty
Copy link
Member

#87 supersedes this.

@LalatenduMohanty
Copy link
Member

/close

@openshift-ci-robot
Copy link

@LalatenduMohanty: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants