Fix/orphaned worker processes#3481
Conversation
When pnpm sends SIGKILL to the CLI process tree, SIGINT/SIGTERM handlers never run, leaving trigger-dev-run-worker child processes alive as zombies consuming significant CPU. The existing watchdog (triggerdotdev#3191) handles server-side run cancellation but does not kill the OS-level worker processes. This fix: - Adds getAllPids() to TaskRunProcessPool to collect PIDs from both available and busy process maps - Periodically refreshes active-runs.json (every 2s) so workerPids stays current as processes are spawned and returned to the pool - Extends the watchdog's onParentDied() to SIGTERM all tracked worker PIDs, wait 3s, then SIGKILL any survivors before calling disconnect Fixes triggerdotdev#2909
🦋 Changeset detectedLatest commit: 85952d8 The changes in this PR will be included in the next version bump. This PR includes changesets to release 29 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughThis PR implements cleanup logic for orphaned Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Key observations
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
|
Hi @thegodtune, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details. |
|
Note on approach vs. previous attempts (#2993, #3041): |
Closes #2909
✅ Checklist
Testing
taskRunProcess.test.tssuite, passes cleantaskRunProcessPool.test.tscoveringgetAllPids()on a fresh poolhttp://localhost:3030active-runs.jsoncontained real PIDs inworkerPidskill -9 <CLI_PID>, bypassing all signal handlers, exactly what pnpm doesps -p <pid1>,<pid2>,<pid3>-> All dead ✔️Changelog
Fix orphaned trigger-dev-run-worker processes that accumulate and consume significant CPU when the CLI is killed ungracefully via SIGKILL. The watchdog now reads worker PIDs from active-runs.json and kills them when the parent CLI process dies.
Screenshots
Fix: Kill orphaned worker processes when the CLI is killed ungracefully
Problem
When running
trigger.dev devthrough pnpm and the session is stopped,trigger-dev-run-workerchild processes are left alive on the machine, consuming significant CPU, up to 450%+ combined after several restarts.The root cause is how pnpm handles process termination. When you do Ctrl+C, pnpm sends SIGKILL directly to the CLI process, not SIGTERM. SIGKILL cannot be caught or handled. Node.js signal handlers (
process.on("SIGINT", ...),process.on("SIGTERM", ...)) never run. The gracefulshutdown()path, which callstaskRunProcessPool.shutdown()and kills all tracked worker processes, is bypassed entirely.On Linux and macOS, child processes are not automatically killed when their parent dies. So the
trigger-dev-run-workerprocesses spawned byTaskRunProcess.initialize()viafork()continue running indefinitely, orphaned, with no parent to report to and no work to do.What already existed
PR #3191 introduced a detached watchdog process (
devWatchdog.ts) that survives SIGKILL and handles server-side cleanup. It polls for parent death, then calls/engine/v1/dev/disconnectto cancel in-flight runs on the server. This is correct and important.However, the watchdog only addresses the server's view of those runs. It does not kill the actual OS-level worker processes on the user's machine. Those processes keep running regardless of what the API call does.
How I found it
Tracing the codebase from the issue report:
DevSupervisor.init()registers SIGINT/SIGTERM handlers and spawns the watchdog, but those handlers are unreachable under SIGKILL.TaskRunProcessPoolmanages two maps:availableProcessesByVersion(idle, reusable processes) andbusyProcessesByVersion(actively executing). Both are populated withTaskRunProcessinstances, each wrapping a forked child process with a known PID.DevSupervisor.#updateActiveRunsFile()writesactive-runs.jsonto.trigger/in the user's project directory, the file the watchdog reads on parent death. It containedparentPidandrunFriendlyIdsbut not the worker PIDs.devWatchdog.tsreads that file inonParentDied(), calls disconnect, and exits. No process killing.The gap: the watchdog had everything it needed to cancel runs on the server, but no information about which OS processes to kill locally.
What I changed
Three files, one new test file.
1.
packages/cli-v3/src/dev/taskRunProcessPool.tsAdded
getAllPids(), which collects PIDs from both the available and busy process maps:This includes both idle pooled processes and actively executing ones; both are orphaned under SIGKILL.
2.
packages/cli-v3/src/dev/devSupervisor.tsTwo changes here:
#updateActiveRunsFile(), now includesworkerPidsalongside the existing fields:Periodic refresh interval, I discovered during testing that a timing issue exists: worker processes are spawned after
#updateActiveRunsFile()is first called when a run is dequeued, so the file would be written before the PID existed, leavingworkerPidsempty. A 2-second refresh interval keeps the file current as processes enter and leave the pool:The interval is cleared on clean shutdown, so it doesn't interfere with the normal Ctrl+C exit path.
3.
packages/cli-v3/src/dev/devWatchdog.tsUpdated
readActiveRuns()to return the newworkerPidsfield, and addedkillWorkerProcesses()called at the start ofonParentDied():Worker processes are killed before the disconnect API call; there's no dependency between the two, but it makes sense to handle the local machine first.
How I tested it
Unit tests: ran the existing
taskRunProcess.test.tssuite (passes clean) and addedtaskRunProcessPool.test.tscoveringgetAllPids()returning an empty array on a fresh pool and returning only defined numeric values.Integration test against a local self-hosted instance: ran the full Docker stack and webapp locally, then:
http://localhost:3030active-runs.jsoncontained real PIDs inworkerPidskill -9 <pid>) — bypassing all signal handlers, exactly what pnpm doesps -p <pid1>,<pid2>,<pid3>→ All dead ✓Before this fix, the same sequence left all worker processes running indefinitely. After, they're gone within the watchdog's poll interval.
Backward compatibility
The change to
active-runs.jsonis additive;workerPidsdefaults to[]if the field is missing, so any existing watchdog reading an old-format file degrades gracefully. The periodic interval only runs during an active dev session and is always cleared on clean shutdown, leaving the normal Ctrl+C path completely unaffected.Fixes #2909