Skip to content

fix(agents): reclaim zombie sandbox containers and create them lazily#4513

Open
msfstef wants to merge 4 commits into
mainfrom
worktree-zombie-sandbox-containers-fix
Open

fix(agents): reclaim zombie sandbox containers and create them lazily#4513
msfstef wants to merge 4 commits into
mainfrom
worktree-zombie-sandbox-containers-fix

Conversation

@msfstef

@msfstef msfstef commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What & why

Opening the desktop app could leave 15+ leftover electric-sbx-* Docker containers running that the user never started. They piled up because, in several places, a sandbox container could outlive the work it was created for. This PR closes each of those gaps so a container only exists while something is actually using it.

Why the containers piled up

Sandbox containers are meant to be short-lived: created for an agent's work, then torn down. Three gaps let them survive instead:

  1. We created containers we didn't need. Every time an agent woke up — even for trivial bookkeeping (a scheduled tick with nothing to do, a drained message queue) — it started a container before knowing whether it would run anything. When the app reconnected and replayed a backlog of these wakes, each one spun up its own container at once.
  2. We didn't clean up on quit. Idle containers are torn down by a short (~2-minute) delay timer. On quit, the app exits before those timers fire — so every quit left a container behind.
  3. We didn't clean up leftovers at the next startup either. The startup cleanup deliberately skipped running containers (so it wouldn't kill one another process is actively using). But a container left by a crash or quit is still running — its main process is an idle loop that never exits on its own — so the cleanup never touched the very leftovers it was meant to reclaim.

Two smaller bugs made it worse: a container could be stranded if its one-time setup step failed, and a Docker error code was occasionally misread so we'd "reconnect" to a container that had already been removed.

How we fix it

The approach attacks all three gaps so a container can't outlive its purpose:

  1. Create only when actually used (lazy creation). A container is now started the first time an agent really uses it — runs a command, or reads/writes a file. Trivial wakes create nothing. (Edge cases preserved: an agent finishing for good still cleans up any workspace earlier runs created, and a child agent that "inherits" its parent's sandbox still makes sure that sandbox exists first.)
  2. Clean up on quit. On shutdown we now run the pending teardowns immediately instead of letting the delay timers die with the process. This covers every exit path (desktop quit, restarts, Ctrl-C / kill) and is time-bounded so a stuck Docker daemon can't hang the quit.
  3. Reclaim true leftovers at startup. Every container is tagged with the process that created it. At startup we check whether that process is still alive: if it's gone, the container is a genuine leftover and we reclaim it — deleting throwaway ones and stopping reusable ones (so their files survive for later reuse). Containers a live process is still using are left untouched, so we never disturb a peer.

As a quality-of-life touch, every container is also labelled so Docker Desktop groups them under one collapsible electric-sandboxes entry you can stop or delete together (docker compose -p electric-sandboxes down).

Implementation details (for reviewers)

Zombie reclamation

  • Containers carry a com.electric.sandbox.owner-pid label; the boot sweep probes it (kill(pid, 0)) and reclaims running orphans whose owner is dead — remove ephemeral, stop persistent (writable layer survives for reattach by key). The sweep probes + reclaims leftovers concurrently so boot latency stays flat as they accumulate.
  • Labels are immutable, so reattaching a dead creator's container records adoption in an in-container marker (/tmp/.electric-sbx-owner-pid, tmpfs ⇒ wiped on stop); the sweep probes it before reclaiming, so a live sibling's adopted container is never swept. The sweep is now awaited at boot so it can't race the first wake's reattach.
  • New shutdownAllDockerSandboxes() flushes pending debounced teardowns on runtime shutdown, wired through AgentHandlerResult.shutdownSandboxesBuiltinAgentsServer.stop() (covers desktop quit, runtime restarts, CLI SIGINT/SIGTERM; bounded 5s). Live-leased entries are left to their own dispose (a sibling runtime in the same process may own them) — if the process dies first, the pid-sweep reclaims them at next boot.
  • Post-start init verifies the mkdir exit code and force-removes the container on any failure; isNameConflict() now matches the daemon's actual name-conflict message instead of bare 409.

Lazy sandbox creation

  • New lazySandbox() wrapper (agents-runtime/sandbox/lazy.ts) defers the provider factory until the sandbox is actually used (exec/fs/fetch). The bootstrap docker profile returns it, so trivial wakes never create a container. Materialization is single-flight and retried on failure.
  • Terminal reclaim still works without use: reclaimDockerSandboxByKey() wipes an earlier wake's persistent workspace by key without creating a container just to delete it (owner leases only; defers to live sibling leases — the last one draining wipes it).
  • Spawn-inherit force-materializes the owner's workspace (ensureSandboxMaterialized) before spawning, so a child can attach even when the parent never ran a tool.
  • Concurrent container creations are capped at 4 process-wide (withCreationSlot) — bursts queue against the daemon instead of stampeding it; reattaches/execs are unlimited, total creations unbounded.

Grouping

All sandboxes carry com.docker.compose.project=electric-sandboxes (+ com.docker.compose.service=<entity-type>).

Testing

  • 13 new docker integration tests (running-orphan reclaim, adoption sparing, legacy-label safety, init-failure cleanup, shutdown flush, lazy composition, reclaim-by-key) and 13 unit tests for lazySandbox — written first, confirmed red, then green.
  • Full suites: agents-runtime 803 tests / 63 files, agents 55 / 11; tsc and eslint clean.

Out of scope (follow-ups)

  • No cap on concurrent wakes (they're whole agent runs; capping would queue user-visible messages behind backlog replay — needs a product call).
  • The e2b remote profile stays eager (working directory not statically known at profile-build time); the same wrapper applies once it is.
  • Server-side orphaned-claim recovery (dispatchRecoveryIntervalMs is defined but unused) — expired runner leases still queue wakes forever.

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Desktop Builds

Build artifacts for commit b57b6fc.

Platform Status Artifact
macOS Apple Silicon Passed DMG
macOS Intel Passed DMG
Windows x64 Passed Installer
Linux x64 Passed AppImage / deb

Workflow run

@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.54362% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.64%. Comparing base (7892079) to head (b57b6fc).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
packages/agents-runtime/src/sandbox/docker.ts 85.71% 23 Missing ⚠️
packages/agents-runtime/src/sandbox/lazy.ts 80.37% 21 Missing ⚠️
packages/agents/src/bootstrap.ts 65.00% 7 Missing ⚠️
packages/agents-runtime/src/process-wake.ts 60.00% 2 Missing ⚠️
packages/agents/src/server.ts 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4513       +/-   ##
===========================================
+ Coverage   32.48%   56.64%   +24.15%     
===========================================
  Files         216      359      +143     
  Lines       18368    39324    +20956     
  Branches     6478    11049     +4571     
===========================================
+ Hits         5967    22274    +16307     
- Misses      12369    16979     +4610     
- Partials       32       71       +39     
Flag Coverage Δ
packages/agents 70.53% <64.00%> (?)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (ø)
packages/agents-runtime 80.10% <83.15%> (?)
packages/agents-server 73.98% <ø> (+0.07%) ⬆️
packages/agents-server-ui 6.21% <ø> (ø)
packages/electric-ax 46.42% <ø> (?)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.64% <81.54%> (+24.15%) ⬆️
unit-tests 56.64% <81.54%> (+24.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@netlify

netlify Bot commented Jun 4, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 8011831
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a26ce9fab3ee20007fec96a
😎 Deploy Preview https://deploy-preview-4513--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Electric Agents Mobile Build

Local mobile checks ran for commit b57b6fc.

The EAS Android preview build was skipped because the mobile-eas-build label is not present.
Add the mobile-eas-build label to this PR to produce an installable preview build.

Workflow run

@msfstef msfstef added the claude label Jun 8, 2026
msfstef and others added 2 commits June 8, 2026 17:15
Users opening the desktop app found 15+ electric-sbx-* containers
running that they never asked for. Several compounding bugs:

- The boot sweep only removed *exited ephemeral* leftovers, but crash/
  quit leftovers are RUNNING (PID 1 is an infinite sleep loop), so it
  never reclaimed anything real. Containers now carry an owner-pid
  label; the sweep reclaims running orphans whose owner is dead (remove
  ephemeral / stop persistent), consults an in-container adoption
  marker so a live process that reattached a dead creator's container
  is never swept, and is awaited at boot so it can't race reattaches.

- The debounced idle teardowns are unref'd timers that died with the
  process: every graceful quit leaked the recently-active containers as
  running zombies. Runtime shutdown now flushes pending teardowns
  (BuiltinAgentsServer.stop), leaving live-leased entries to their own
  dispose or the next boot's sweep.

- A failed post-start init (mkdir exec) left a started container that
  was never registered - invisible to dispose and, while running, to
  the sweep. Creation now verifies the init exit code and removes the
  container on any failure. Fixing this exposed that isNameConflict()
  treated any HTTP 409 as a lost create race (exec on an exited
  container is also 409), silently "reattaching" to a removed
  container; it now matches the daemon's name-conflict message.

- Sandbox creation was eager on every claimed wake, so a reconnect
  backlog of trivial wakes (cron ticks, bookkeeping) stampeded the
  daemon with containers. The docker profile now returns a lazySandbox
  wrapper that defers the provider factory to first actual use.
  Terminal reclaim without use goes through reclaimDockerSandboxByKey
  (no create-to-delete), spawn-inherit force-materializes the owner's
  workspace before the child can attach, and concurrent creations are
  capped at 4 to smooth real bursts.

- All sandbox containers now carry compose project/service labels
  (com.docker.compose.project=electric-sandboxes) so Docker Desktop
  groups them under one entry and they can be stopped/deleted together
  (docker compose -p electric-sandboxes down).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the changeset as a user-facing release note: lead with the symptom
(leftover containers piling up) and the three-part fix (create-only-when-used,
clean-up-on-quit, reclaim-at-startup), dropping implementation jargon. Matches
the rewritten PR description; no code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@msfstef msfstef force-pushed the worktree-zombie-sandbox-containers-fix branch from f409434 to 8011831 Compare June 8, 2026 14:15
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

This PR closes three gaps that let electric-sbx-* Docker sandbox containers outlive their work (eager creation on trivial wakes, no teardown on quit, a boot sweep that skipped running leftovers) via lazy materialization (lazySandbox), owner-pid-tagged zombie reclamation at boot, and an immediate shutdown flush. The only change since my last review is a small, faithful refactor addressing a reviewer comment. The change remains high-quality and ready to merge.

What's Working Well

  • Lifecycle reasoning is rigorous. The owner-pid label plus in-container adoption marker (/tmp/.electric-sbx-owner-pid on tmpfs) cleanly distinguishes a crash leftover from a live sibling's container, and the boot sweep is awaited so it can't race the first reattach. The persistent-vs-ephemeral split (STOP vs REMOVE) is applied consistently via sandboxWipesOnDispose.
  • withCreationSlot is a correct semaphore — slot hand-off without an intermediate increment avoids the release/re-acquire race, and the finally guarantees no slot leak on factory failure.
  • lazySandbox dispose semantics are well thought throughdispose({reclaim}) without materialization still runs the provider reclaim callback, and the in-flight-factory-failed branch still honours reclaim.
  • Test coverage is strong and behaviorally meaningful — running-orphan reclaim, adoption sparing, legacy-label safety, init-failure cleanup, shutdown flush, lazy composition, and reclaim-by-key.

Issues Found

Critical (Must Fix)

None found.

Important (Should Fix)

None.

Suggestions (Nice to Have)

None new. The single remaining forward-looking note from iteration 2 stands (the boot-sweep reclaim fan-out is unbounded, unlike the creation cap — fine at the motivating "15+ leftovers" scale, only worth revisiting at hundreds of leftovers). Non-blocking.

Issue Conformance

No linked issue is referenced — a process note rather than a blocker, given the unusually thorough PR description (problem statement, root-cause breakdown, fix rationale, explicit out-of-scope/follow-ups). The implementation matches the described scope.

Previous Review Status

Since iteration 2 (commit 2b62715), the only code change is commit b57b6fc — a refactor of writeOwnerMarker that replaces the try/catch wrapper around the best-effort owner-marker write with .catch(() => {}), addressing kevin-dp's inline suggestion (docker.ts). The refactor is faithful: behavior is identical (errors still swallowed, best-effort), and keeping await ...catch(() => {}) rather than return runOneOff(...).catch(...) correctly keeps the function's Promise<void> return type from leaking the non-void RunOneOffResult — a thoughtful distinction over the literal suggestion. No regression.

The other open review thread — kevin-dp's suggestion to drop the separate creationSlots counter and derive availability from creationQueue.length — was discussed and resolved by a maintainer decision (msfstef) to keep the current explicit two-variable form for clarity. I concur: withCreationSlot is correct as written, and the alternative formulations proposed in the thread either don't enqueue correctly or overload the queue as both a queue and a counter. No change needed.

All four nice-to-haves from iteration 1 remain addressed.


Review iteration: 3 | 2026-06-09

From the Claude bot review (all non-blocking nice-to-haves):
- Boot sweep now probes + reclaims leftovers concurrently (Promise.all) instead
  of one awaited exec round-trip per orphan — keeps boot latency flat when many
  leftovers have accumulated (the sweep is awaited before profiles build).
- Document that reclaimDockerSandboxByKey / shutdownAllDockerSandboxes are
  best-effort: an unreachable daemon is swallowed and left to the next boot
  sweep; note why the shutdown flush runs after the wake drain (it only reclaims
  idle, lease-free containers).
- Add a lazySandbox unit test for dispose({reclaim}) racing an in-flight factory
  that then fails — the reclaim callback must still run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@msfstef msfstef marked this pull request as ready for review June 8, 2026 14:38
@msfstef msfstef requested review from kevin-dp and samwillis June 8, 2026 15:05
* unbounded, only their concurrency.
*/
const MAX_CONCURRENT_CREATIONS = 4
let creationSlots = MAX_CONCURRENT_CREATIONS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a separate creationSlots variable whose sole purpose is counting the amount of available slots in creationQueue? This means we have to take care of keeping creationSlots in sync with creationQueue. It would be simpler and more robust to use only creationQueue and rely on its length to determine available slots. So creationSlots becomes MAX_CONCURRENT_CREATIONS - creationQueue.length. But instead of writing it like that we would probably write it like this:

if (creationQueue.length >= MAX_CONCURRENT_CREATIONS) {
  await new Promise<void>((release) => creationQueue.push(release))
}
try {
  return await fn()
} finally {
  const next = creationQueue.shift()
  next?.()
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky if we don't keep track of the in flight creations and blocked waiters separately. Your code doesn't work (since nothing is ever added to the creation queue), but I assume you mean that essentially we always add creations to the creation queue, something like:

await new Promise<void>((release) => {
  creationQueue.push(release)
  if (creationQueue.length <= MAX_CONCURRENT_CREATIONS) release()
})
try {
  return await fn()
} finally {
  creationQueue.shift()
  // unblock any waiter entering active window
  creationQueue[MAX_CONCURRENT_CREATIONS - 1]?.() 
}

Or something like this - we still essentially keep track of in flight creations via the first MAX_CONCURRENT_CREATIONS slots of the queue, but at least to me this seems less straightforward to read and mixes the responsibility of the creation queue as both a queue and and a counter/stack.

I agree with your sentiment but I think I won't try to make this more clever. An alternative could be to just have MAX_CONCURRENT_CREATIONS promise queues and distribute creations between them and then that sorts itself out automatically, with the loss of a slow/stuck creation blocking everyone in that queue.

If you don't mind I might leave this as is, I think it works and is clear in its purpose, and I don't think there's a real concern about keeping track of that variable given the conciseness and limited role it plays - perhaps encapsulating it more would make it less scary.

Comment thread packages/agents-runtime/src/sandbox/docker.ts Outdated
From kevin-dp's review: replace the try/catch wrapper around the
best-effort owner-marker write with `.catch(() => {})`. Kept `await`
(not a bare `return runOneOff(...).catch(...)`) so the non-void
RunOneOffResult doesn't leak into the Promise<void> return type.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants