Skip to content

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58

Draft
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:bb/fix-host-daemon-env-watch-lifecycle-leak-destroy-thr_c5xxwwvknt
Draft

fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58
brsbl wants to merge 1 commit into
ymichael:mainfrom
brsbl:bb/fix-host-daemon-env-watch-lifecycle-leak-destroy-thr_c5xxwwvknt

Conversation

@brsbl
Copy link
Copy Markdown

@brsbl brsbl commented May 30, 2026

Summary

The desktop app (bb 0.0.12) hard-crashed repeatedly today with a native EXC_BAD_ACCESS/SIGSEGV inside the file-watcher addon (@parcel/watcher@2.5.6, stack FSEventsCallback -> DirTree::add/DirTree::find). The host-daemon logs flooded with Workspace status watch unavailable; retrying in background … Path is not a git repository: …/moss and repeated workspace.status … code: not_git_repo for environments that are already status='destroyed' in the prod DB.

Root cause (confirmed in code)

The host-daemon resurrects destroyed environments. requireWorkspaceEnvironment (apps/host-daemon/src/command-dispatch-support.ts) → RuntimeManager.ensureEnvironment (apps/host-daemon/src/runtime-manager.ts) lazily re-provisions the workspace and re-subscribes an FSEvents watcher for any environment referenced by a workspace.* command, with no guard against environments the daemon already tore down. destroyEnvironment removed the entry and stopped the watcher, but the very next workspace.status poll recreated both. With ~300 destroyed managed worktrees (moss), the daemon perpetually re-created runtimes + watchers — an in-memory watch-lifecycle leak that churns FSEvents and feeds the native @parcel/watcher crash. (A native segfault can't be caught from JS; eliminating the churn is the real fix.)

What changed

Daemon-owned watch/runtime lifecycle:

  • RuntimeManager tracks a destroyedEnvironmentIds tombstone set. destroyEnvironment records it (even when there is no live entry); requireWorkspaceEnvironment refuses to reconnect a tombstoned env (ExpectedCommandDispatchError("environment_destroyed")) so it is never re-watched. ensureEnvironment clears the tombstone only when an env is explicitly (re)provisioned.
  • environment.destroy is now idempotent — a repeat destroy returns success instead of resurrecting the workspace.
  • New RuntimeManager.reconcileLiveEnvironments(liveIds) runs on every session (re)connect, driven by a new liveEnvironmentIds field on the session-open response (server sends all non-destroyed env ids for the host). It drops watchers + runtimes for idle environments the server no longer considers live (destroyed while the daemon was disconnected, whose environment.destroy never arrived) and tombstones them. Environments with active threads or terminals are never dropped (guards against transient gaps in the live set).
  • WorkspaceStatusWatcher retries are now bounded (give up after a capped number of attempts) so a permanently missing/invalid/non-git path stops re-subscribing instead of retrying forever.

Contract / server wiring:

  • hostDaemonSessionOpenResponseSchema gains required liveEnvironmentIds: string[].
  • New listLiveEnvironmentIdsOnHost DB query (status != 'destroyed' for the host); server /session/open returns it; daemon onSessionOpened calls reconcileLiveEnvironments.

Changed files

apps/host-daemon/src/runtime-manager.ts                 tombstone + reconcileLiveEnvironments + idempotent destroy
apps/host-daemon/src/command-dispatch-support.ts        resurrection guard in requireWorkspaceEnvironment
apps/host-daemon/src/command-dispatch.ts                environment.destroy idempotency
apps/host-daemon/src/app.ts                             onSessionOpened -> reconcileLiveEnvironments
packages/host-watcher/src/workspace-status-watcher.ts   bounded retry
packages/host-daemon-contract/src/session.ts            liveEnvironmentIds on session-open response
apps/server/src/internal/session.ts                     populate liveEnvironmentIds
packages/db/src/data/environments.ts (+ data/index.ts)  listLiveEnvironmentIdsOnHost
+ tests (runtime-manager, workspace-dispatch, watch-status, internal-session-correctness,
  contract, test-server helper, app.test fixtures)

Tests added + results

  • runtime-manager.test.ts: destroyed env is tombstoned & not resurrected; destroy with no live runtime still tombstones; explicit re-provision clears the tombstone; reconcile drops stale watchers/runtimes + tombstones them while keeping live ones; reconcile never drops envs with active work.
  • workspace-dispatch.test.ts: workspace.status on a destroyed env throws environment_destroyed with no re-provision and no further status read (no new watcher); repeated environment.destroy is idempotent (no resurrection).
  • watch-status.test.ts: workspace subscriptions give up after a bounded number of retries instead of looping forever.
  • internal-session-correctness.test.ts: session-open reports only non-destroyed environments as live.

All via Turbo:

  • typecheck: 30/30 packages pass (full repo).
  • test: @bb/host-daemon 362, @bb/host-watcher 23, @bb/host-daemon-contract 33, @bb/db 290, @bb/server 842 (+4 skipped) — all pass.
  • prettier --check clean on all changed files.

Worktree teardown (investigated — no code change)

removeWorktree (packages/host-workspace/src/provisioning.ts) already runs git worktree remove --force and fs.rm(path, { recursive, force }) (+ prunes the empty parent), so the normal teardown path fully removes the directory. The observed ".git removed but the dir remains" state is consistent with the process crashing mid-teardown (the segfault this PR removes), not a teardown logic bug. Eliminating the churn (above) prevents the crash that strands those dirs. A follow-up could add a startup sweep that fs.rms leftover destroyed-env worktree directories.

Caveats / follow-ups (intentionally NOT bundled)

  • Deferred: @parcel/watcher. Upgrading/patching past 2.5.6 and the darwin-x64-vs-arm64 prebuild mismatch are out of scope. A native segfault can't be caught from JS; this PR removes the churn that triggers it. (Note: in the test environment, better-sqlite3 and @parcel/watcher native addons had to be rebuilt for the running arch before the native-dependent test suites could execute.)
  • Deploy note: the currently running prod app must be restarted on a new build to pick up this fix — the in-memory leak persists in the live process until then.
  • Related stale-state bugs (separate modules → follow-ups, not bundled):
    • thread.rename on a provider-less thread (thr_nxddvksekb) → No provider associated with thread. Server queues thread.rename (apps/server/src/routes/threads/base.ts) gated only on environment.status === "ready", not on the thread having a registered provider identity.
    • status app posting app-data for a non-existent thread (thr_ryku96bvfd) → 404 Thread not found (apps/server/src/routes/threads/apps.ts). App-data write path has no destroyed/missing-thread guard.

Safety-review follow-ups (review came back GO — no P0/P1)

  • P2-A — reconcile does not heal a stuck tombstone (idle managed-worktree only, recoverable): destroyEnvironment (runtime-manager.ts) tombstones before the teardown that can throw (destroyedEnvironmentIds.add(...) then runtime.shutdown() / workspace.destroy()). If teardown throws anything other than path_not_found, the command fails and the server reverts the env destroying → ready, but the daemon stays tombstoned — so every workspace.status / workspace.diff for that idle env returns environment_destroyed until a thread.start/terminal lifts the tombstone via ensureEnvironment. reconcileLiveEnvironments only adds tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect does not heal it. Suggested fix: in reconcileLiveEnvironments, also remove from destroyedEnvironmentIds any id present in liveEnvironmentIds; or only tombstone after teardown succeeds. Impact: idle managed-worktree only, recoverable, never affects active threads.
  • P2-B — mixed-version session-open is incompatible by design: the session-open response liveEnvironmentIds field is now required + strict on both sides, so an old-daemon ↔ new-server (or vice-versa) reconnect fails session-open. Fine for the bundled desktop app, which restarts server + daemon together (the hot-swap quits + relaunches the whole app), but noted for any independent/rolling deploy.
  • P3 (minor) — thread.start/terminal lifts the tombstone unconditionally via ensureEnvironment. A thread.start racing a just-processed destroy can lift the tombstone; it self-heals on the next reconcile and causes no FSEvents leak (createEntry provisions before subscribing).

🤖 Generated with Claude Code

The desktop app hard-crashed with a native @parcel/watcher segfault
(FSEventsCallback -> DirTree::add/find). Root cause is an in-memory
watch-lifecycle leak in the host-daemon: requireWorkspaceEnvironment ->
RuntimeManager.ensureEnvironment re-provisions and re-subscribes an
FSEvents watcher for ANY environment referenced by a workspace.* command,
with no guard against environments the daemon already destroyed. With ~300
destroyed managed worktrees in the moss project, every workspace.status
poll resurrected a dead environment + watcher, churning FSEvents and
feeding the native crash.

Fix (daemon-owned watch/runtime lifecycle):
- RuntimeManager tombstones destroyed environments; destroyEnvironment
  records the tombstone (even with no live entry) and requireWorkspaceEnvironment
  refuses to reconnect a tombstoned env (ExpectedCommandDispatchError
  "environment_destroyed"), so it is never re-watched. ensureEnvironment
  clears the tombstone when an env is explicitly (re)provisioned.
- environment.destroy is idempotent: a repeat destroy returns success
  instead of resurrecting the workspace.
- reconcileLiveEnvironments(liveIds), driven by a new liveEnvironmentIds
  field on the session-open response, runs on every (re)connect. It drops
  watchers + runtimes for idle environments the server no longer considers
  live (destroyed while the daemon was disconnected, whose destroy command
  never arrived) and tombstones them. Environments with active threads or
  terminals are never dropped.
- WorkspaceStatusWatcher retries are now bounded (give up after a capped
  number of attempts) so a permanently-missing/invalid path stops
  re-subscribing instead of retrying forever.

Tests: RuntimeManager tombstone + reconcile behavior; dispatch-level
resurrection guard + idempotent destroy; bounded watcher retry; server
session-open returns only non-destroyed environments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@brsbl
Copy link
Copy Markdown
Author

brsbl commented May 30, 2026

Safety review: GO — no P0/P1. Tracking the reviewer's P2/P3 follow-ups here (documentation only; no code change in this PR). Also appended to the PR description's follow-ups section.

Safety-review follow-ups (review came back GO — no P0/P1)

  • P2-A — reconcile does not heal a stuck tombstone (idle managed-worktree only, recoverable): destroyEnvironment (runtime-manager.ts) tombstones before the teardown that can throw (destroyedEnvironmentIds.add(...) then runtime.shutdown() / workspace.destroy()). If teardown throws anything other than path_not_found, the command fails and the server reverts the env destroying → ready, but the daemon stays tombstoned — so every workspace.status / workspace.diff for that idle env returns environment_destroyed until a thread.start/terminal lifts the tombstone via ensureEnvironment. reconcileLiveEnvironments only adds tombstones (it iterates entries, and a tombstoned env has no entry), so reconnect does not heal it. Suggested fix: in reconcileLiveEnvironments, also remove from destroyedEnvironmentIds any id present in liveEnvironmentIds; or only tombstone after teardown succeeds. Impact: idle managed-worktree only, recoverable, never affects active threads.
  • P2-B — mixed-version session-open is incompatible by design: the session-open response liveEnvironmentIds field is now required + strict on both sides, so an old-daemon ↔ new-server (or vice-versa) reconnect fails session-open. Fine for the bundled desktop app, which restarts server + daemon together (the hot-swap quits + relaunches the whole app), but noted for any independent/rolling deploy.
  • P3 (minor) — thread.start/terminal lifts the tombstone unconditionally via ensureEnvironment. A thread.start racing a just-processed destroy can lift the tombstone; it self-heals on the next reconcile and causes no FSEvents leak (createEntry provisions before subscribing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant