fix: stop host-daemon from resurrecting destroyed environments (native watcher crash)#58
Draft
brsbl wants to merge 1 commit into
Conversation
The desktop app hard-crashed with a native @parcel/watcher segfault (FSEventsCallback -> DirTree::add/find). Root cause is an in-memory watch-lifecycle leak in the host-daemon: requireWorkspaceEnvironment -> RuntimeManager.ensureEnvironment re-provisions and re-subscribes an FSEvents watcher for ANY environment referenced by a workspace.* command, with no guard against environments the daemon already destroyed. With ~300 destroyed managed worktrees in the moss project, every workspace.status poll resurrected a dead environment + watcher, churning FSEvents and feeding the native crash. Fix (daemon-owned watch/runtime lifecycle): - RuntimeManager tombstones destroyed environments; destroyEnvironment records the tombstone (even with no live entry) and requireWorkspaceEnvironment refuses to reconnect a tombstoned env (ExpectedCommandDispatchError "environment_destroyed"), so it is never re-watched. ensureEnvironment clears the tombstone when an env is explicitly (re)provisioned. - environment.destroy is idempotent: a repeat destroy returns success instead of resurrecting the workspace. - reconcileLiveEnvironments(liveIds), driven by a new liveEnvironmentIds field on the session-open response, runs on every (re)connect. It drops watchers + runtimes for idle environments the server no longer considers live (destroyed while the daemon was disconnected, whose destroy command never arrived) and tombstones them. Environments with active threads or terminals are never dropped. - WorkspaceStatusWatcher retries are now bounded (give up after a capped number of attempts) so a permanently-missing/invalid path stops re-subscribing instead of retrying forever. Tests: RuntimeManager tombstone + reconcile behavior; dispatch-level resurrection guard + idempotent destroy; bounded watcher retry; server session-open returns only non-destroyed environments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
Safety review: GO — no P0/P1. Tracking the reviewer's P2/P3 follow-ups here (documentation only; no code change in this PR). Also appended to the PR description's follow-ups section. Safety-review follow-ups (review came back GO — no P0/P1)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The desktop app (bb 0.0.12) hard-crashed repeatedly today with a native
EXC_BAD_ACCESS/SIGSEGVinside the file-watcher addon (@parcel/watcher@2.5.6, stackFSEventsCallback -> DirTree::add/DirTree::find). The host-daemon logs flooded withWorkspace status watch unavailable; retrying in background … Path is not a git repository: …/mossand repeatedworkspace.status … code: not_git_repofor environments that are alreadystatus='destroyed'in the prod DB.Root cause (confirmed in code)
The host-daemon resurrects destroyed environments.
requireWorkspaceEnvironment(apps/host-daemon/src/command-dispatch-support.ts) →RuntimeManager.ensureEnvironment(apps/host-daemon/src/runtime-manager.ts) lazily re-provisions the workspace and re-subscribes an FSEvents watcher for any environment referenced by aworkspace.*command, with no guard against environments the daemon already tore down.destroyEnvironmentremoved the entry and stopped the watcher, but the very nextworkspace.statuspoll recreated both. With ~300 destroyed managed worktrees (moss), the daemon perpetually re-created runtimes + watchers — an in-memory watch-lifecycle leak that churns FSEvents and feeds the native @parcel/watcher crash. (A native segfault can't be caught from JS; eliminating the churn is the real fix.)What changed
Daemon-owned watch/runtime lifecycle:
RuntimeManagertracks adestroyedEnvironmentIdstombstone set.destroyEnvironmentrecords it (even when there is no live entry);requireWorkspaceEnvironmentrefuses to reconnect a tombstoned env (ExpectedCommandDispatchError("environment_destroyed")) so it is never re-watched.ensureEnvironmentclears the tombstone only when an env is explicitly (re)provisioned.environment.destroyis now idempotent — a repeat destroy returns success instead of resurrecting the workspace.RuntimeManager.reconcileLiveEnvironments(liveIds)runs on every session (re)connect, driven by a newliveEnvironmentIdsfield on the session-open response (server sends all non-destroyed env ids for the host). It drops watchers + runtimes for idle environments the server no longer considers live (destroyed while the daemon was disconnected, whoseenvironment.destroynever arrived) and tombstones them. Environments with active threads or terminals are never dropped (guards against transient gaps in the live set).WorkspaceStatusWatcherretries are now bounded (give up after a capped number of attempts) so a permanently missing/invalid/non-git path stops re-subscribing instead of retrying forever.Contract / server wiring:
hostDaemonSessionOpenResponseSchemagains requiredliveEnvironmentIds: string[].listLiveEnvironmentIdsOnHostDB query (status != 'destroyed'for the host); server/session/openreturns it; daemononSessionOpenedcallsreconcileLiveEnvironments.Changed files
Tests added + results
runtime-manager.test.ts: destroyed env is tombstoned & not resurrected; destroy with no live runtime still tombstones; explicit re-provision clears the tombstone; reconcile drops stale watchers/runtimes + tombstones them while keeping live ones; reconcile never drops envs with active work.workspace-dispatch.test.ts:workspace.statuson a destroyed env throwsenvironment_destroyedwith no re-provision and no further status read (no new watcher); repeatedenvironment.destroyis idempotent (no resurrection).watch-status.test.ts: workspace subscriptions give up after a bounded number of retries instead of looping forever.internal-session-correctness.test.ts: session-open reports only non-destroyed environments as live.All via Turbo:
typecheck: 30/30 packages pass (full repo).test:@bb/host-daemon362,@bb/host-watcher23,@bb/host-daemon-contract33,@bb/db290,@bb/server842 (+4 skipped) — all pass.prettier --checkclean on all changed files.Worktree teardown (investigated — no code change)
removeWorktree(packages/host-workspace/src/provisioning.ts) already runsgit worktree remove --forceandfs.rm(path, { recursive, force })(+ prunes the empty parent), so the normal teardown path fully removes the directory. The observed ".gitremoved but the dir remains" state is consistent with the process crashing mid-teardown (the segfault this PR removes), not a teardown logic bug. Eliminating the churn (above) prevents the crash that strands those dirs. A follow-up could add a startup sweep thatfs.rms leftover destroyed-env worktree directories.Caveats / follow-ups (intentionally NOT bundled)
2.5.6and thedarwin-x64-vs-arm64prebuild mismatch are out of scope. A native segfault can't be caught from JS; this PR removes the churn that triggers it. (Note: in the test environment,better-sqlite3and@parcel/watchernative addons had to be rebuilt for the running arch before the native-dependent test suites could execute.)thread.renameon a provider-less thread (thr_nxddvksekb) →No provider associated with thread. Server queuesthread.rename(apps/server/src/routes/threads/base.ts) gated only onenvironment.status === "ready", not on the thread having a registered provider identity.thr_ryku96bvfd) → 404Thread not found(apps/server/src/routes/threads/apps.ts). App-data write path has no destroyed/missing-thread guard.Safety-review follow-ups (review came back GO — no P0/P1)
destroyEnvironment(runtime-manager.ts) tombstones before the teardown that can throw (destroyedEnvironmentIds.add(...)thenruntime.shutdown()/workspace.destroy()). If teardown throws anything other thanpath_not_found, the command fails and the server reverts the envdestroying → ready, but the daemon stays tombstoned — so everyworkspace.status/workspace.difffor that idle env returnsenvironment_destroyeduntil athread.start/terminal lifts the tombstone viaensureEnvironment.reconcileLiveEnvironmentsonly adds tombstones (it iteratesentries, and a tombstoned env has no entry), so reconnect does not heal it. Suggested fix: inreconcileLiveEnvironments, also remove fromdestroyedEnvironmentIdsany id present inliveEnvironmentIds; or only tombstone after teardown succeeds. Impact: idle managed-worktree only, recoverable, never affects active threads.liveEnvironmentIdsfield is now required + strict on both sides, so an old-daemon ↔ new-server (or vice-versa) reconnect fails session-open. Fine for the bundled desktop app, which restarts server + daemon together (the hot-swap quits + relaunches the whole app), but noted for any independent/rolling deploy.thread.start/terminal lifts the tombstone unconditionally viaensureEnvironment. Athread.startracing a just-processeddestroycan lift the tombstone; it self-heals on the next reconcile and causes no FSEvents leak (createEntryprovisions before subscribing).🤖 Generated with Claude Code