Skip to content

Fix/orphaned worker processes#3481

Closed
thegodtune wants to merge 2 commits intotriggerdotdev:mainfrom
thegodtune:fix/orphaned-worker-processes
Closed

Fix/orphaned worker processes#3481
thegodtune wants to merge 2 commits intotriggerdotdev:mainfrom
thegodtune:fix/orphaned-worker-processes

Conversation

@thegodtune
Copy link
Copy Markdown

Closes #2909

✅ Checklist

  • I have followed every step in the contributing guide
  • The PR title follows the convention
  • I ran and tested the code works

Testing

  • Ran existing taskRunProcess.test.ts suite, passes clean
  • Added taskRunProcessPool.test.ts covering getAllPids() on a fresh pool
  • Integration test against local self-hosted instance:
    1. Started CLI with locally built binary against http://localhost:3030
    2. Triggered a 60-second sleep task to force worker processes to spawn
    3. Confirmed active-runs.json contained real PIDs in workerPids
    4. Sent kill -9 <CLI_PID>, bypassing all signal handlers, exactly what pnpm does
    5. Waited 5 seconds for watchdog poll cycle
    6. ps -p <pid1>,<pid2>,<pid3> -> All dead ✔️

Changelog

Fix orphaned trigger-dev-run-worker processes that accumulate and consume significant CPU when the CLI is killed ungracefully via SIGKILL. The watchdog now reads worker PIDs from active-runs.json and kills them when the parent CLI process dies.


Screenshots


Fix: Kill orphaned worker processes when the CLI is killed ungracefully

Problem

When running trigger.dev dev through pnpm and the session is stopped, trigger-dev-run-worker child processes are left alive on the machine, consuming significant CPU, up to 450%+ combined after several restarts.

The root cause is how pnpm handles process termination. When you do Ctrl+C, pnpm sends SIGKILL directly to the CLI process, not SIGTERM. SIGKILL cannot be caught or handled. Node.js signal handlers (process.on("SIGINT", ...), process.on("SIGTERM", ...)) never run. The graceful shutdown() path, which calls taskRunProcessPool.shutdown() and kills all tracked worker processes, is bypassed entirely.

On Linux and macOS, child processes are not automatically killed when their parent dies. So the trigger-dev-run-worker processes spawned by TaskRunProcess.initialize() via fork() continue running indefinitely, orphaned, with no parent to report to and no work to do.

What already existed

PR #3191 introduced a detached watchdog process (devWatchdog.ts) that survives SIGKILL and handles server-side cleanup. It polls for parent death, then calls /engine/v1/dev/disconnect to cancel in-flight runs on the server. This is correct and important.

However, the watchdog only addresses the server's view of those runs. It does not kill the actual OS-level worker processes on the user's machine. Those processes keep running regardless of what the API call does.

How I found it

Tracing the codebase from the issue report:

  1. DevSupervisor.init() registers SIGINT/SIGTERM handlers and spawns the watchdog, but those handlers are unreachable under SIGKILL.
  2. TaskRunProcessPool manages two maps: availableProcessesByVersion (idle, reusable processes) and busyProcessesByVersion (actively executing). Both are populated with TaskRunProcess instances, each wrapping a forked child process with a known PID.
  3. DevSupervisor.#updateActiveRunsFile() writes active-runs.json to .trigger/ in the user's project directory, the file the watchdog reads on parent death. It contained parentPid and runFriendlyIds but not the worker PIDs.
  4. devWatchdog.ts reads that file in onParentDied(), calls disconnect, and exits. No process killing.

The gap: the watchdog had everything it needed to cancel runs on the server, but no information about which OS processes to kill locally.

What I changed

Three files, one new test file.

1. packages/cli-v3/src/dev/taskRunProcessPool.ts

Added getAllPids(), which collects PIDs from both the available and busy process maps:

getAllPids(): number[] {
  const pids: number[] = [];
  for (const processes of this.availableProcessesByVersion.values()) {
    for (const process of processes) {
      if (process.pid !== undefined) pids.push(process.pid);
    }
  }
  for (const processSet of this.busyProcessesByVersion.values()) {
    for (const process of processSet) {
      if (process.pid !== undefined) pids.push(process.pid);
    }
  }
  return pids;
}

This includes both idle pooled processes and actively executing ones; both are orphaned under SIGKILL.

2. packages/cli-v3/src/dev/devSupervisor.ts

Two changes here:

#updateActiveRunsFile(), now includes workerPids alongside the existing fields:

const data = {
  parentPid: process.pid,
  runFriendlyIds: Array.from(this.runControllers.keys()),
  workerPids: this.taskRunProcessPool?.getAllPids() ?? [],
};

Periodic refresh interval, I discovered during testing that a timing issue exists: worker processes are spawned after #updateActiveRunsFile() is first called when a run is dequeued, so the file would be written before the PID existed, leaving workerPids empty. A 2-second refresh interval keeps the file current as processes enter and leave the pool:

// In init():
this.activeRunsUpdateInterval = setInterval(() => {
  this.#updateActiveRunsFile();
}, 2_000);

// In shutdown():
if (this.activeRunsUpdateInterval) {
  clearInterval(this.activeRunsUpdateInterval);
}

The interval is cleared on clean shutdown, so it doesn't interfere with the normal Ctrl+C exit path.

3. packages/cli-v3/src/dev/devWatchdog.ts

Updated readActiveRuns() to return the new workerPids field, and added killWorkerProcesses() called at the start of onParentDied():

async function killWorkerProcesses(pids: number[]): Promise<void> {
  for (const pid of pids) {
    try { process.kill(pid, "SIGTERM"); } catch { /* Already dead */ }
  }

  if (pids.length === 0) return;

  await new Promise((resolve) => setTimeout(resolve, 3_000));

  for (const pid of pids) {
    try {
      process.kill(pid, 0);
      process.kill(pid, "SIGKILL");
    } catch { /* Already dead, good */ }
  }
}

Worker processes are killed before the disconnect API call; there's no dependency between the two, but it makes sense to handle the local machine first.

How I tested it

Unit tests: ran the existing taskRunProcess.test.ts suite (passes clean) and added taskRunProcessPool.test.ts covering getAllPids() returning an empty array on a fresh pool and returning only defined numeric values.

Integration test against a local self-hosted instance: ran the full Docker stack and webapp locally, then:

  1. Started the CLI with the locally built binary pointed at http://localhost:3030
  2. Triggered a long-running task (60-second sleep) to force worker processes to spawn
  3. Confirmed active-runs.json contained real PIDs in workerPids
  4. Sent SIGKILL to the CLI process (kill -9 <pid>) — bypassing all signal handlers, exactly what pnpm does
  5. Waited 5 seconds for the watchdog's poll cycle to detect parent death
  6. Checked all worker PIDs: ps -p <pid1>,<pid2>,<pid3>All dead ✓

Before this fix, the same sequence left all worker processes running indefinitely. After, they're gone within the watchdog's poll interval.

Backward compatibility

The change to active-runs.json is additive; workerPids defaults to [] if the field is missing, so any existing watchdog reading an old-format file degrades gracefully. The periodic interval only runs during an active dev session and is always cleared on clean shutdown, leaving the normal Ctrl+C path completely unaffected.

Fixes #2909


When pnpm sends SIGKILL to the CLI process tree, SIGINT/SIGTERM
handlers never run, leaving trigger-dev-run-worker child processes
alive as zombies consuming significant CPU.

The existing watchdog (triggerdotdev#3191) handles server-side run cancellation
but does not kill the OS-level worker processes.

This fix:
- Adds getAllPids() to TaskRunProcessPool to collect PIDs from both
  available and busy process maps
- Periodically refreshes active-runs.json (every 2s) so workerPids
  stays current as processes are spawned and returned to the pool
- Extends the watchdog's onParentDied() to SIGTERM all tracked worker
  PIDs, wait 3s, then SIGKILL any survivors before calling disconnect

Fixes triggerdotdev#2909
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 30, 2026

🦋 Changeset detected

Latest commit: 85952d8

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 29 packages
Name Type
trigger.dev Major
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Major
@trigger.dev/core Major
@trigger.dev/python Major
@trigger.dev/react-hooks Major
@trigger.dev/redis-worker Major
@trigger.dev/rsc Major
@trigger.dev/schema-to-json Major
@trigger.dev/sdk Major
@trigger.dev/database Major
@trigger.dev/otlp-importer Major
@internal/cache Patch
@internal/clickhouse Patch
@internal/llm-model-catalog Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 43efd465-a958-4eb4-818d-e856f54b4c41

📥 Commits

Reviewing files that changed from the base of the PR and between 19c1675 and 85952d8.

📒 Files selected for processing (5)
  • .changeset/silly-planes-march.md
  • packages/cli-v3/src/dev/devSupervisor.ts
  • packages/cli-v3/src/dev/devWatchdog.ts
  • packages/cli-v3/src/dev/taskRunProcessPool.test.ts
  • packages/cli-v3/src/dev/taskRunProcessPool.ts

Walkthrough

This PR implements cleanup logic for orphaned trigger-dev-run-worker processes when the CLI is forcefully terminated. It introduces a getAllPids() method to TaskRunProcessPool to enumerate process identifiers, updates devSupervisor to periodically record worker process IDs in active-runs.json, and modifies devWatchdog to terminate tracked worker processes using SIGTERM with a 3-second grace period followed by SIGKILL for any remaining processes. A new changeset documents this as a major release behavior, and a test suite validates the PID enumeration functionality.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Key observations

  • The changes introduce process lifecycle management across multiple modules with interdependent file I/O and signal handling
  • SIGTERM/SIGKILL termination logic in devWatchdog requires careful review for timing and edge cases (e.g., processes that die between SIGTERM and SIGKILL checks)
  • The periodic update interval in devSupervisor (2-second cadence) and its interaction with watchdog lifecycle should be validated
  • New getAllPids() method is straightforward but relies on correct iteration of both available and busy process maps
  • Test coverage for the new method is present but limited to empty-pool and numeric-values assertions
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

Hi @thegodtune, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/triggerdotdev/trigger.dev/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this Apr 30, 2026
@thegodtune
Copy link
Copy Markdown
Author

Note on approach vs. previous attempts (#2993, #3041):
Previous PRs addressing this issue added SIGINT/SIGTERM signal handlers in dev.ts to ensure clean worker shutdown on polite termination. That approach is correct for the graceful exit path.
This PR specifically targets the SIGKILL path, which is what pnpm uses when you press Ctrl+C during pnpm dlx trigger.dev dev. SIGKILL cannot be caught or handled by any signal handler, so the cleanup code in dev.ts never runs regardless of what handlers are registered. The existing watchdog introduced in #3191 already survives SIGKILL (since it's detached and unref'd), but it only handled server-side run cancellation, not the OS-level worker processes still running on the user's machine.
This fix extends the watchdog to also kill those processes directly, which is the only path available after SIGKILL. The two approaches are complementary, signal handlers for graceful stops, watchdog PID cleanup for hard kills.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug:

1 participant