-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Problems to solve
Eventually a self-hosted runner is killed by OOM or some issue. It is called "lost communication with the server" error.
- Ephemeral (single use) runner registrations actions/runner#510
- Dealing with jobs failing with "lost communication with the server" errors actions/actions-runner-controller#466
When the error occurred, GitHub Actions adds an annotation with the following message:
The self-hosted runner: POD_NAME lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
Currently, we send the annotation message to Slack by this action:
https://github.com/int128/workflow-run-summary-action/blob/216f94dd10d099652cfb393e598c2a8f604c3bd0/src/run.ts#L60
How to solve
It would be nice to monitor the count of "lost communication with the server" errors for fact-based decision.