Skip to content

[Feature]: Fabric Manager service-level health monitoring #889

@dmvevents

Description

@dmvevents

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Feature Description

Related to #883 — NVSentinel's existing health monitors (DCGM-based, syslog-based) cannot detect when Fabric Manager itself is down or degraded because individual GPUs appear healthy to DCGM even when FM is not running.

Problem

A customer's Fabric Manager failed after maintenance and stayed broken for 20 days undetected. NVSentinel was deployed and running, but no health event was generated because:

  1. DCGM reports per-GPU metrics — temperature, power, XID errors are all per-GPU. When FM is down, GPUs still report normal telemetry.
  2. Syslog monitor watches for specific patterns — FM failure logs go to the FM journal, not always to syslog in a pattern the monitor catches.
  3. fabric.state stays "In Progress" indefinitely — nvidia-smi shows fabric.state: In Progress but this isn't surfaced as a health event.

Proposed Solution

Add a health check that monitors Fabric Manager at the service level:

  1. Service status: Check systemctl status nvidia-fabricmanager (or the GPU Operator container equivalent) — detect not-running, failed, flapping
  2. Fabric state: Query nvidia-smi --query-gpu=fabric.state,fabric.status — detect stuck "In Progress" or "Unknown Error" states
  3. NVLink correlation: When FM is down AND NVLink bandwidth is zero, generate a critical health event

Workaround

We built a standalone DaemonSet (fabric-manager-monitor) that performs these checks and exposes Prometheus metrics. It runs on every GPU node via nsenter -t 1 -m to check host systemd services. Validated on P4d.24xlarge with 8x A100 GPUs.

Source: https://github.com/dmvevents/nvsentinel-eks-deployment/tree/master/fabric-manager-monitor

Additional Context

This class of failure (service-level rather than telemetry-level) also applies to:

  • nvidia-persistenced — persistence daemon crashes silently
  • nv-hostengine (DCGM) — if DCGM itself dies, no GPU metrics are generated but NVSentinel doesn't detect the absence

Component

Health Monitor

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions