Skip to content

Metric of "lost communication with the server" error #444

@int128

Description

@int128

Problems to solve

Eventually a self-hosted runner is killed by OOM or some issue. It is called "lost communication with the server" error.

When the error occurred, GitHub Actions adds an annotation with the following message:

The self-hosted runner: POD_NAME lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Currently, we send the annotation message to Slack by this action:
https://github.com/int128/workflow-run-summary-action/blob/216f94dd10d099652cfb393e598c2a8f604c3bd0/src/run.ts#L60

How to solve

It would be nice to monitor the count of "lost communication with the server" errors for fact-based decision.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions