Skip to content

fix(testgen): kill the whole subprocess tree on tg start/install shutdown#88

Merged
aarthy-dk merged 3 commits into
mainfrom
fix-tg-orphan-postgres
May 7, 2026
Merged

fix(testgen): kill the whole subprocess tree on tg start/install shutdown#88
aarthy-dk merged 3 commits into
mainfrom
fix-tg-orphan-postgres

Conversation

@aarthy-dk
Copy link
Copy Markdown
Contributor

@aarthy-dk aarthy-dk commented May 6, 2026

Summary

  • proc.terminate() on Windows calls TerminateProcess, which kills only the parent and leaves pixeltable-pgserver's postgres.exe children running. The visible symptom is tg delete failing with "Could not remove TestGen data directory" because postgres holds the data files open.
  • New stop_app_tree(proc, timeout=...) helper walks the descendant tree: taskkill /F /T /PID on Windows, os.killpg(SIGTERM) (with SIGKILL escalation) on POSIX.
  • start_testgen_app now spawns testgen with start_new_session=True on POSIX so the whole group can be signaled together. All three termination sites (port-readiness timeout, KeyboardInterrupt handler, finally cleanup) route through the new helper.

Scope note

This is a force-kill safety net for uncatchable termination paths: installer crash, terminal closed via the X button, hang in testgen, port timeout. Graceful shutdown on Ctrl+C is a complementary fix in pixeltable-pgserver / testgen's own signal handling — that change is happening separately on the TestGen side and doesn't make this one redundant. The two fixes cover different failure modes.

Also in this PR: explicit Docker Compose prereq check

Previously a machine with the docker CLI + a running engine but no Compose plugin (e.g. distro docker.io without docker-compose-plugin) would pass the auto-mode picker's prereq display, then fail later inside ComposeStartStep / docker compose exec.

  • New REQ_DOCKER_COMPOSE (probes docker compose version).
  • Threaded into every action that actually shells out to docker compose: Obs install/upgrade/delete and the Docker-mode TestGen actions (install/upgrade/start/delete/run-demo/delete-demo). DemoContainerAction and the pip+--export branch of tg run-demo keep the original Docker-only list since those paths only use docker run.
  • The auto-mode picker hides REQ_DOCKER from its rendered prereq line (Docker installed is implied by Docker Compose installed) — purely cosmetic; the underlying check still runs so a tg install --docker invocation on a Docker-less box still gets the precise "Install Docker and try again" message.

Test plan

  • pytest — 182 passing locally (4 new unit tests for the kill-tree helper covering: no-op when proc already exited, Windows taskkill path, POSIX killpg path, SIGKILL escalation on timeout)
  • ruff check + ruff format --check clean
  • Reviewer manual check on Windows: install → Ctrl+C → Get-Process postgres -ErrorAction SilentlyContinue (some orphans expected until the testgen-side graceful shutdown lands; this PR ensures the installer's own unilateral termination paths don't leak)
  • Reviewer manual check on a Linux box with docker.io but no docker-compose-plugin: tg install should now fail prereq check with "Install the Docker Compose plugin and try again" instead of crashing inside the install steps

🤖 Generated with Claude Code

aarthy-dk and others added 2 commits May 6, 2026 00:36
…down

`proc.terminate()` on Windows calls `TerminateProcess` and only kills the
parent — pixeltable-pgserver's `postgres.exe` children get orphaned. The
symptom: `tg delete` fails with "Could not remove TestGen data directory"
because postgres holds the data files open.

New `stop_app_tree(proc, timeout=...)` walks the descendant tree:
- Windows: shells out to `taskkill /F /T /PID` to walk the WMI process tree.
- POSIX: sends SIGTERM to the process group (parent is now spawned with
  `start_new_session=True`), escalating to SIGKILL on timeout.

Routed all three termination sites in `start_testgen_app` through the
helper (port-readiness timeout, KeyboardInterrupt handler, finally
cleanup). Added unit tests covering the platform branching and the
SIGKILL escalation path.

Note: this is a force-kill safety net for uncatchable termination paths
(installer crash, terminal X-button close, hang). Graceful shutdown on
Ctrl+C is a separate fix in pixeltable-pgserver / testgen's signal
handling — see the issue we drafted for the testgen team.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub's Linux runner runs the test container as root. Existing
Popen-mocked tests (like ``test_start_testgen_app_happy_path``) routed
through ``stop_app_tree`` in the new ``finally`` block, where
``os.killpg(os.getpgid(proc.pid), SIGTERM)`` resolved to ``pgid 1``
(``MagicMock.__index__`` defaults to 1) and actually signalled init —
killing the container and producing "The runner has received a shutdown
signal". macOS swallowed the same call as ``EPERM``; CI didn't.

Adds an autouse fixture that patches ``os.killpg`` and ``os.getpgid`` so
no test can accidentally signal a real process group. The 4 unit tests
for ``stop_app_tree`` already wrap their own ``patch`` over these and
keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aarthy-dk aarthy-dk requested review from diogodk, luis-dk and rboni-dk May 6, 2026 04:54
@aarthy-dk aarthy-dk force-pushed the fix-tg-orphan-postgres branch from 168a27e to 32d5775 Compare May 7, 2026 20:11
Previously the installer only verified that the docker CLI and engine
were available. A user with the CLI but no Compose plugin (e.g. distro
docker.io without docker-compose-plugin) would pass the picker's
prereq display, then fail later inside ComposeStartStep / docker
compose exec. Adds REQ_DOCKER_COMPOSE (probes docker compose version)
and threads it through every action that shells out to docker compose
— Obs install/upgrade/delete and the Docker-mode TestGen actions
(install/upgrade/start/delete/run-demo/delete-demo). DemoContainerAction
and the pip+--export branch of tg run-demo keep the original Docker-only
list since those paths only use docker run.

The auto-mode picker hides REQ_DOCKER from its prereq line — its failure
is implied by REQ_DOCKER_COMPOSE failing, and a precise per-prereq fix
message still fires at command-line invocation time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aarthy-dk aarthy-dk merged commit c4b1e70 into main May 7, 2026
3 checks passed
@aarthy-dk aarthy-dk deleted the fix-tg-orphan-postgres branch May 7, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants