Skip to content

feat(health-monitor): host-side silent-fail detection and operator alerting#2498

Open
alexli-77 wants to merge 1 commit into
nanocoai:mainfrom
alexli-77:feat/health-monitor
Open

feat(health-monitor): host-side silent-fail detection and operator alerting#2498
alexli-77 wants to merge 1 commit into
nanocoai:mainfrom
alexli-77:feat/health-monitor

Conversation

@alexli-77
Copy link
Copy Markdown

Summary

  • Adds src/modules/health-monitor/ module: 5-min timer that detects silent task failures and posts direct Discord alerts
  • Adds pre-spawn OAuth token refresh from macOS Keychain in buildMounts()

Problem

Stuck-container detection catches hung containers but misses containers that complete with zero output — the silent 401 auth failure pattern:

  1. Token expires → container spawns → Claude binary 401
  2. Agent-runner writes processing_ack = completed (processed the message, result was empty)
  3. Container idles → 30 min later absolute-ceiling kills it
  4. Operator has no idea. Scheduled task looks like it ran.

Changes

src/modules/health-monitor/ (new module):

  • setup.ts — idempotent DB bootstrap (agent group + messaging group + wiring + named destination)
  • checks.tscheckSilentFail() (ack with no output in 2h window, container stopped) + checkTokenExpiry()
  • alert.ts — direct Discord REST to configurable keepalive channel (bypasses routing — works even if routing is broken) + task injection into health-monitor session
  • index.ts — 5-min timer, 1h dedup per issue key, startHealthMonitor() called after initDb() via MODULE-HOOK

src/container-runner.ts: pre-spawn Keychain token refresh in buildMounts(). try/catch, no-op on non-macOS.

src/index.ts + src/modules/index.ts: MODULE-HOOK wiring.

Test plan

  • Service starts cleanly, [health-monitor] Started in logs
  • DB rows created on startup (idempotent)
  • Health-monitor container spawns and delivers <message to="keepalive"> to Discord
  • pnpm run build passes

Related upstream issues

🤖 Generated with Claude Code

…erting

Detects "can't run" level failures that the existing stuck-container
detection misses: sessions that produce a processing_ack=completed but
zero messages_out in the same 2-hour window — the signature of a silent
OAuth 401 auth failure swallowed by the agent-runner.

New module: src/modules/health-monitor/
  - setup.ts: idempotent DB bootstrap — agent group, Discord messaging
    group, wiring, named 'keepalive' destination. Discord guild/channel
    IDs read from HEALTH_MONITOR_DISCORD_GUILD_ID and
    HEALTH_MONITOR_KEEPALIVE_CHANNEL_ID in .env.
  - checks.ts: checkSilentFail() (ack=completed + messages_out=0 in 2h,
    container stopped) + checkTokenExpiry() (minutesLeft < 60)
  - alert.ts: direct Discord REST POST to keepalive channel (bypasses
    nanoclaw routing, fires even when the host is degraded) + task
    injection into the health-monitor agent session for investigation
  - index.ts: 5-min timer, 1h dedup per issue key, startHealthMonitor()
    (must run after initDb)

src/index.ts: MODULE-HOOK to start health-monitor after DB init
src/modules/index.ts: import health-monitor module
src/container-runner.ts: pre-spawn Keychain copy — reads
  'Claude Code-credentials' before every container spawn so the
  claude.json always reflects the current token

Upstream issues: nanocoai#730 (token expiry), nanocoai#2492 (health-monitor proposal)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant