fix(agent): reap agent process group on normal exit, not just cancellation by Christopher-Li · Pull Request #339 · kunchenguid/no-mistakes

Christopher-Li · 2026-06-26T19:37:01Z

Problem

shellenv.ConfigureShellCommand isolates each agent/command in its own process group (Setpgid) and installs cmd.Cancel to SIGKILL the group - but cmd.Cancel only fires on context cancellation. The success path is never covered.

When the test step's agent (or a configured test/lint command) exits 0, nothing kills the group. Grandchildren that outlived the leader - a vitest tinypool worker pool, a build watcher, a dev server - reparent to init (PPID 1) and keep running. A clean vitest run leaks its worker pool every time, which is the observed RAM-exhaustion (17×~1 GB orphan workers).

So cancellation was handled; clean completion was not.

Fix

Add shellenv.TerminateShellCommandGroup across the build-tag split:
- unix: SIGKILL of -pgid (the group persists while any member is alive, so it still reaps grandchildren that outlived the leader)
- windows: best-effort taskkill /T /F /PID
- other: no-op stub
defer shellenv.TerminateShellCommandGroup(cmd) right after Start() in every native agent runner (claude, pi, acpx, codex) and the shell-command path (runShellCommandWithEnv). The defer covers success, parse-error, and wait-error returns; cmd.Cancel still covers cancellation, where the defer is a harmless ESRCH no-op.
Install a 5s WaitDelay backstop on unix (mirroring the existing windows backstop) so a grandchild holding an inherited stdout/stderr pipe open can't wedge Wait forever.

This fixes the general class, not just vitest: any agent-spawned grandchild leaking after a clean step is now reaped.

Tests

internal/shellenv: TerminateShellCommandGroup reaps a backgrounded grandchild left by a normally-exited leader; nil/never-started cmd is a no-op.
internal/agent: drives claudeAgent.runOnce against a fake claude that backgrounds a grandchild and exits 0; asserts the grandchild is gone after runOnce returns. Confirmed this test fails without the defer (grandchild survives) and passes with it.

Verification

gofmt clean · make lint · go test -race ./... · make e2e · Windows + wasip1 cross-builds all pass.

…ation ConfigureShellCommand isolates each agent/command in its own process group (Setpgid) and installs cmd.Cancel to SIGKILL the group, but cmd.Cancel only fires on context cancellation. The success path was never covered: when the test step's agent (or a configured test/lint command) exits 0, nothing kills the group, so grandchildren that outlived the leader - a vitest tinypool worker pool, a build watcher, a dev server - reparent to init and keep running. A clean vitest run leaked its worker pool every time, exhausting RAM. Add TerminateShellCommandGroup (unix SIGKILL of -pgid, windows best-effort taskkill /T, no-op elsewhere) and defer it after Start in every agent runner (claude, pi, acpx, codex) and the shell command path (runShellCommandWithEnv). The defer covers success, parse-error, and wait-error returns; cmd.Cancel still covers cancellation, where the defer is a harmless ESRCH no-op. Also install a 5s WaitDelay backstop on unix (mirroring windows) so a grandchild holding an inherited stdout/stderr pipe open can't wedge Wait forever. This fixes the general class, not just vitest: any agent-spawned grandchild leaking after a clean step is now reaped.

Christopher-Li mentioned this pull request Jun 26, 2026

no run started for branch #338

Open

kunchenguid mentioned this pull request Jun 28, 2026

fix(agent): reap agent process group on clean exit to prevent daemon OOM crash #357

Merged

This was referenced Jun 29, 2026

[no-mistakes#339] fix(agent): reap agent process group on normal exit, not just cance... kunchenguid/wheelhouse#35

Closed

[no-mistakes#338] no run started for branch kunchenguid/wheelhouse#166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): reap agent process group on normal exit, not just cancellation#339

fix(agent): reap agent process group on normal exit, not just cancellation#339
Christopher-Li wants to merge 1 commit into
kunchenguid:mainfrom
Christopher-Li:fix/reap-agent-process-group-on-exit

Christopher-Li commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Christopher-Li commented Jun 26, 2026

Problem

Fix

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant