cocoon CLI orphans on parent (sudo) death — propose PR_SET_PDEATHSIG=SIGTERM

## Problem

When `cocoon` runs under `sudo` (the common deployment), and the caller of `sudo`
SIGKILLs sudo to cancel the operation (`exec.CommandContext` does this on
ctx cancellation), `sudo` dies but the cocoon grandchild does **not** get a
signal — the kernel reparents it to PID 1 and it keeps running until its
in-progress work finishes naturally.

Concrete observation in vk-cocoon's `runPostCloneSetup`:

1. vk-cocoon spawns `sudo cocoon vm exec <vmid> -- powershell ...` via
   `exec.CommandContext(ctx, "sudo", ...)`.
2. cocoon dials cocoon-agent over hybrid-vsock and waits for the agent to
   complete the PowerShell PnP-rebind (sometimes 60s+ on Windows clones).
3. vk-cocoon's loopCtx (180s budget) fires; `cmd.Cancel()` → SIGKILL to sudo.
4. sudo dies. cocoon survives, holds vsock UDS open, keeps streaming bytes
   to/from the agent.
5. vk-cocoon's `cmd.Wait()` would block on the orphan's stdout pipe; we
   work around that with a `select { case <-done: case <-loopCtx.Done(): }`
   — but the orphan cocoon process (and its FDs) leak until the agent finally
   answers. With many stuck clones this accumulates orphan processes.

`cocoon vm exec` is in-process today (no subprocess from cocoon's side), so
the leak is bounded to one orphan per stuck call. Still ugly under load.

## Why this isn't fully a caller-side fix

The caller (vk-cocoon, or any sudo-wrapped invocation) can use
`SysProcAttr{Setpgid: true}` and kill the whole pgid — that does work and we
plan to do it on vk-cocoon's side regardless. But:

- It's a per-caller mitigation; every cocoon CLI consumer has to remember to
  do it.
- SIGKILL gives cocoon no chance to flush state (close vsock connection
  cleanly, release agent-side resources, write final logs).
- Setting `Pdeathsig` on the immediate child only gives sudo the signal, not
  cocoon — Pdeathsig propagates one level only.

## Proposed fix in cocoon

Add `prctl(PR_SET_PDEATHSIG, SIGTERM)` in `main()` (Linux only, build-tagged).
When cocoon's parent process dies — sudo crashing, killed by ctx, or a
caller force-quitting — the kernel signals cocoon directly. cocoon already has
`signal.NotifyContext(ctx, SIGINT, SIGTERM)` at `cmd/root.go:86`, so the
existing ctx-cancellation paths (including `f1f641a`'s vsock CONNECT honor-ctx
fix and any future cancellable IO) take over and shut down gracefully.

Sketch:

```go
// main_linux.go (//go:build linux)
package main

import (
    "syscall"
    "golang.org/x/sys/unix"
)

func init() {
    // Ask the kernel to send SIGTERM if our parent dies. Inherited across
    // exec/fork only by the calling thread, which is fine here because
    // main runs on the locked main goroutine before any work.
    _ = unix.Prctl(unix.PR_SET_PDEATHSIG, uintptr(syscall.SIGTERM), 0, 0, 0)
}

// main_other.go (//go:build !linux)
package main

func init() {} // no-op
```

This:

- Makes cocoon robust under any sudo / supervisor / docker-style parent
  without each caller having to engineer pgid handling.
- Plays well with the SIGINT/SIGTERM handler already at `cmd/root.go:86` —
  ctx is canceled, in-flight `vm exec` / `vm clone` / `snapshot save` paths
  unwind through their existing ctx-aware code, vsock connections close
  cleanly, agent sees EOF and reaps its child.

## Out of scope for this issue

Long-running internal subprocesses (cloud-hypervisor, firecracker) already
use `Setpgid: true` and survive cocoon's death intentionally — that's a
separate design decision and not affected by adding PR_SET_PDEATHSIG to
cocoon's own main.

## Test plan

- Add a test that runs `cocoon vm exec` against a stub agent, kills the
  parent (cocoon's grandparent test harness), confirms the cocoon process
  exits with non-zero within ~1s.
- Manual: run `sudo cocoon vm exec <stuck-vm> -- some-hung-cmd`, kill -9
  the sudo, verify cocoon exits (currently: stays alive until the cmd finishes).

## Related

- Fix from caller side (vk-cocoon `feat/post-clone-auto-exec` branch)
  Setpgid + pgid-kill — will land separately, complements this fix.
- cocoonv2 `f1f641a` made vsock CONNECT honor ctx — same theme of
  graceful cancellation; this issue extends that to caller-driven exits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cocoon CLI orphans on parent (sudo) death — propose PR_SET_PDEATHSIG=SIGTERM #36

Problem

Why this isn't fully a caller-side fix

Proposed fix in cocoon

Out of scope for this issue

Test plan

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cocoon CLI orphans on parent (sudo) death — propose PR_SET_PDEATHSIG=SIGTERM #36

Description

Problem

Why this isn't fully a caller-side fix

Proposed fix in cocoon

Out of scope for this issue

Test plan

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions