[Feature]: Fabric Manager service-level health monitoring

### Prerequisites

- [x] I searched existing issues
- [x] I can reproduce this issue

### Feature Description

Related to #883 — NVSentinel's existing health monitors (DCGM-based, syslog-based) cannot detect when Fabric Manager itself is down or degraded because individual GPUs appear healthy to DCGM even when FM is not running.

### Problem

A customer's Fabric Manager failed after maintenance and stayed broken for **20 days undetected**. NVSentinel was deployed and running, but no health event was generated because:

1. **DCGM reports per-GPU metrics** — temperature, power, XID errors are all per-GPU. When FM is down, GPUs still report normal telemetry.
2. **Syslog monitor watches for specific patterns** — FM failure logs go to the FM journal, not always to syslog in a pattern the monitor catches.
3. **fabric.state stays "In Progress" indefinitely** — nvidia-smi shows `fabric.state: In Progress` but this isn't surfaced as a health event.

### Proposed Solution

Add a health check that monitors Fabric Manager at the service level:

1. **Service status**: Check `systemctl status nvidia-fabricmanager` (or the GPU Operator container equivalent) — detect not-running, failed, flapping
2. **Fabric state**: Query `nvidia-smi --query-gpu=fabric.state,fabric.status` — detect stuck "In Progress" or "Unknown Error" states
3. **NVLink correlation**: When FM is down AND NVLink bandwidth is zero, generate a critical health event

### Workaround

We built a standalone DaemonSet (`fabric-manager-monitor`) that performs these checks and exposes Prometheus metrics. It runs on every GPU node via `nsenter -t 1 -m` to check host systemd services. Validated on P4d.24xlarge with 8x A100 GPUs.

Source: https://github.com/dmvevents/nvsentinel-eks-deployment/tree/master/fabric-manager-monitor

### Additional Context

This class of failure (service-level rather than telemetry-level) also applies to:
- `nvidia-persistenced` — persistence daemon crashes silently
- `nv-hostengine` (DCGM) — if DCGM itself dies, no GPU metrics are generated but NVSentinel doesn't detect the absence

### Component

Health Monitor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Fabric Manager service-level health monitoring #889

Prerequisites

Feature Description

Problem

Proposed Solution

Workaround

Additional Context

Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Fabric Manager service-level health monitoring #889

Description

Prerequisites

Feature Description

Problem

Proposed Solution

Workaround

Additional Context

Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions