Skip to content

fix(agent): reap agent process group on normal exit, not just cancellation#339

Open
Christopher-Li wants to merge 1 commit into
kunchenguid:mainfrom
Christopher-Li:fix/reap-agent-process-group-on-exit
Open

fix(agent): reap agent process group on normal exit, not just cancellation#339
Christopher-Li wants to merge 1 commit into
kunchenguid:mainfrom
Christopher-Li:fix/reap-agent-process-group-on-exit

Conversation

@Christopher-Li

Copy link
Copy Markdown

Problem

shellenv.ConfigureShellCommand isolates each agent/command in its own process group (Setpgid) and installs cmd.Cancel to SIGKILL the group - but cmd.Cancel only fires on context cancellation. The success path is never covered.

When the test step's agent (or a configured test/lint command) exits 0, nothing kills the group. Grandchildren that outlived the leader - a vitest tinypool worker pool, a build watcher, a dev server - reparent to init (PPID 1) and keep running. A clean vitest run leaks its worker pool every time, which is the observed RAM-exhaustion (17×~1 GB orphan workers).

So cancellation was handled; clean completion was not.

Fix

  • Add shellenv.TerminateShellCommandGroup across the build-tag split:
    • unix: SIGKILL of -pgid (the group persists while any member is alive, so it still reaps grandchildren that outlived the leader)
    • windows: best-effort taskkill /T /F /PID
    • other: no-op stub
  • defer shellenv.TerminateShellCommandGroup(cmd) right after Start() in every native agent runner (claude, pi, acpx, codex) and the shell-command path (runShellCommandWithEnv). The defer covers success, parse-error, and wait-error returns; cmd.Cancel still covers cancellation, where the defer is a harmless ESRCH no-op.
  • Install a 5s WaitDelay backstop on unix (mirroring the existing windows backstop) so a grandchild holding an inherited stdout/stderr pipe open can't wedge Wait forever.

This fixes the general class, not just vitest: any agent-spawned grandchild leaking after a clean step is now reaped.

Tests

  • internal/shellenv: TerminateShellCommandGroup reaps a backgrounded grandchild left by a normally-exited leader; nil/never-started cmd is a no-op.
  • internal/agent: drives claudeAgent.runOnce against a fake claude that backgrounds a grandchild and exits 0; asserts the grandchild is gone after runOnce returns. Confirmed this test fails without the defer (grandchild survives) and passes with it.

Verification

gofmt clean · make lint · go test -race ./... · make e2e · Windows + wasip1 cross-builds all pass.

…ation

ConfigureShellCommand isolates each agent/command in its own process group
(Setpgid) and installs cmd.Cancel to SIGKILL the group, but cmd.Cancel only
fires on context cancellation. The success path was never covered: when the
test step's agent (or a configured test/lint command) exits 0, nothing kills
the group, so grandchildren that outlived the leader - a vitest tinypool
worker pool, a build watcher, a dev server - reparent to init and keep
running. A clean vitest run leaked its worker pool every time, exhausting RAM.

Add TerminateShellCommandGroup (unix SIGKILL of -pgid, windows best-effort
taskkill /T, no-op elsewhere) and defer it after Start in every agent runner
(claude, pi, acpx, codex) and the shell command path (runShellCommandWithEnv).
The defer covers success, parse-error, and wait-error returns; cmd.Cancel still
covers cancellation, where the defer is a harmless ESRCH no-op. Also install a
5s WaitDelay backstop on unix (mirroring windows) so a grandchild holding an
inherited stdout/stderr pipe open can't wedge Wait forever.

This fixes the general class, not just vitest: any agent-spawned grandchild
leaking after a clean step is now reaped.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant