-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Feature Description
Related to #883 — NVSentinel's existing health monitors (DCGM-based, syslog-based) cannot detect when Fabric Manager itself is down or degraded because individual GPUs appear healthy to DCGM even when FM is not running.
Problem
A customer's Fabric Manager failed after maintenance and stayed broken for 20 days undetected. NVSentinel was deployed and running, but no health event was generated because:
- DCGM reports per-GPU metrics — temperature, power, XID errors are all per-GPU. When FM is down, GPUs still report normal telemetry.
- Syslog monitor watches for specific patterns — FM failure logs go to the FM journal, not always to syslog in a pattern the monitor catches.
- fabric.state stays "In Progress" indefinitely — nvidia-smi shows
fabric.state: In Progressbut this isn't surfaced as a health event.
Proposed Solution
Add a health check that monitors Fabric Manager at the service level:
- Service status: Check
systemctl status nvidia-fabricmanager(or the GPU Operator container equivalent) — detect not-running, failed, flapping - Fabric state: Query
nvidia-smi --query-gpu=fabric.state,fabric.status— detect stuck "In Progress" or "Unknown Error" states - NVLink correlation: When FM is down AND NVLink bandwidth is zero, generate a critical health event
Workaround
We built a standalone DaemonSet (fabric-manager-monitor) that performs these checks and exposes Prometheus metrics. It runs on every GPU node via nsenter -t 1 -m to check host systemd services. Validated on P4d.24xlarge with 8x A100 GPUs.
Additional Context
This class of failure (service-level rather than telemetry-level) also applies to:
nvidia-persistenced— persistence daemon crashes silentlynv-hostengine(DCGM) — if DCGM itself dies, no GPU metrics are generated but NVSentinel doesn't detect the absence
Component
Health Monitor