fix(agents): reclaim zombie sandbox containers and create them lazily#4513
fix(agents): reclaim zombie sandbox containers and create them lazily#4513msfstef wants to merge 4 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4513 +/- ##
===========================================
+ Coverage 32.48% 56.64% +24.15%
===========================================
Files 216 359 +143
Lines 18368 39324 +20956
Branches 6478 11049 +4571
===========================================
+ Hits 5967 22274 +16307
- Misses 12369 16979 +4610
- Partials 32 71 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Electric Agents Mobile BuildLocal mobile checks ran for commit The EAS Android preview build was skipped because the |
Users opening the desktop app found 15+ electric-sbx-* containers running that they never asked for. Several compounding bugs: - The boot sweep only removed *exited ephemeral* leftovers, but crash/ quit leftovers are RUNNING (PID 1 is an infinite sleep loop), so it never reclaimed anything real. Containers now carry an owner-pid label; the sweep reclaims running orphans whose owner is dead (remove ephemeral / stop persistent), consults an in-container adoption marker so a live process that reattached a dead creator's container is never swept, and is awaited at boot so it can't race reattaches. - The debounced idle teardowns are unref'd timers that died with the process: every graceful quit leaked the recently-active containers as running zombies. Runtime shutdown now flushes pending teardowns (BuiltinAgentsServer.stop), leaving live-leased entries to their own dispose or the next boot's sweep. - A failed post-start init (mkdir exec) left a started container that was never registered - invisible to dispose and, while running, to the sweep. Creation now verifies the init exit code and removes the container on any failure. Fixing this exposed that isNameConflict() treated any HTTP 409 as a lost create race (exec on an exited container is also 409), silently "reattaching" to a removed container; it now matches the daemon's name-conflict message. - Sandbox creation was eager on every claimed wake, so a reconnect backlog of trivial wakes (cron ticks, bookkeeping) stampeded the daemon with containers. The docker profile now returns a lazySandbox wrapper that defers the provider factory to first actual use. Terminal reclaim without use goes through reclaimDockerSandboxByKey (no create-to-delete), spawn-inherit force-materializes the owner's workspace before the child can attach, and concurrent creations are capped at 4 to smooth real bursts. - All sandbox containers now carry compose project/service labels (com.docker.compose.project=electric-sandboxes) so Docker Desktop groups them under one entry and they can be stopped/deleted together (docker compose -p electric-sandboxes down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the changeset as a user-facing release note: lead with the symptom (leftover containers piling up) and the three-part fix (create-only-when-used, clean-up-on-quit, reclaim-at-startup), dropping implementation jargon. Matches the rewritten PR description; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f409434 to
8011831
Compare
Claude Code ReviewSummaryThis PR closes three gaps that let What's Working Well
Issues FoundCritical (Must Fix)None found. Important (Should Fix)None. Suggestions (Nice to Have)None new. The single remaining forward-looking note from iteration 2 stands (the boot-sweep reclaim fan-out is unbounded, unlike the creation cap — fine at the motivating "15+ leftovers" scale, only worth revisiting at hundreds of leftovers). Non-blocking. Issue ConformanceNo linked issue is referenced — a process note rather than a blocker, given the unusually thorough PR description (problem statement, root-cause breakdown, fix rationale, explicit out-of-scope/follow-ups). The implementation matches the described scope. Previous Review StatusSince iteration 2 (commit The other open review thread — kevin-dp's suggestion to drop the separate All four nice-to-haves from iteration 1 remain addressed. Review iteration: 3 | 2026-06-09 |
From the Claude bot review (all non-blocking nice-to-haves):
- Boot sweep now probes + reclaims leftovers concurrently (Promise.all) instead
of one awaited exec round-trip per orphan — keeps boot latency flat when many
leftovers have accumulated (the sweep is awaited before profiles build).
- Document that reclaimDockerSandboxByKey / shutdownAllDockerSandboxes are
best-effort: an unreachable daemon is swallowed and left to the next boot
sweep; note why the shutdown flush runs after the wake drain (it only reclaims
idle, lease-free containers).
- Add a lazySandbox unit test for dispose({reclaim}) racing an in-flight factory
that then fails — the reclaim callback must still run.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| * unbounded, only their concurrency. | ||
| */ | ||
| const MAX_CONCURRENT_CREATIONS = 4 | ||
| let creationSlots = MAX_CONCURRENT_CREATIONS |
There was a problem hiding this comment.
Why use a separate creationSlots variable whose sole purpose is counting the amount of available slots in creationQueue? This means we have to take care of keeping creationSlots in sync with creationQueue. It would be simpler and more robust to use only creationQueue and rely on its length to determine available slots. So creationSlots becomes MAX_CONCURRENT_CREATIONS - creationQueue.length. But instead of writing it like that we would probably write it like this:
if (creationQueue.length >= MAX_CONCURRENT_CREATIONS) {
await new Promise<void>((release) => creationQueue.push(release))
}
try {
return await fn()
} finally {
const next = creationQueue.shift()
next?.()
}There was a problem hiding this comment.
It's tricky if we don't keep track of the in flight creations and blocked waiters separately. Your code doesn't work (since nothing is ever added to the creation queue), but I assume you mean that essentially we always add creations to the creation queue, something like:
await new Promise<void>((release) => {
creationQueue.push(release)
if (creationQueue.length <= MAX_CONCURRENT_CREATIONS) release()
})
try {
return await fn()
} finally {
creationQueue.shift()
// unblock any waiter entering active window
creationQueue[MAX_CONCURRENT_CREATIONS - 1]?.()
}Or something like this - we still essentially keep track of in flight creations via the first MAX_CONCURRENT_CREATIONS slots of the queue, but at least to me this seems less straightforward to read and mixes the responsibility of the creation queue as both a queue and and a counter/stack.
I agree with your sentiment but I think I won't try to make this more clever. An alternative could be to just have MAX_CONCURRENT_CREATIONS promise queues and distribute creations between them and then that sorts itself out automatically, with the loss of a slow/stuck creation blocking everyone in that queue.
If you don't mind I might leave this as is, I think it works and is clear in its purpose, and I don't think there's a real concern about keeping track of that variable given the conciseness and limited role it plays - perhaps encapsulating it more would make it less scary.
From kevin-dp's review: replace the try/catch wrapper around the
best-effort owner-marker write with `.catch(() => {})`. Kept `await`
(not a bare `return runOneOff(...).catch(...)`) so the non-void
RunOneOffResult doesn't leak into the Promise<void> return type.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What & why
Opening the desktop app could leave 15+ leftover
electric-sbx-*Docker containers running that the user never started. They piled up because, in several places, a sandbox container could outlive the work it was created for. This PR closes each of those gaps so a container only exists while something is actually using it.Why the containers piled up
Sandbox containers are meant to be short-lived: created for an agent's work, then torn down. Three gaps let them survive instead:
Two smaller bugs made it worse: a container could be stranded if its one-time setup step failed, and a Docker error code was occasionally misread so we'd "reconnect" to a container that had already been removed.
How we fix it
The approach attacks all three gaps so a container can't outlive its purpose:
As a quality-of-life touch, every container is also labelled so Docker Desktop groups them under one collapsible
electric-sandboxesentry you can stop or delete together (docker compose -p electric-sandboxes down).Implementation details (for reviewers)
Zombie reclamation
com.electric.sandbox.owner-pidlabel; the boot sweep probes it (kill(pid, 0)) and reclaims running orphans whose owner is dead — remove ephemeral, stop persistent (writable layer survives for reattach by key). The sweep probes + reclaims leftovers concurrently so boot latency stays flat as they accumulate./tmp/.electric-sbx-owner-pid, tmpfs ⇒ wiped on stop); the sweep probes it before reclaiming, so a live sibling's adopted container is never swept. The sweep is now awaited at boot so it can't race the first wake's reattach.shutdownAllDockerSandboxes()flushes pending debounced teardowns on runtime shutdown, wired throughAgentHandlerResult.shutdownSandboxes→BuiltinAgentsServer.stop()(covers desktop quit, runtime restarts, CLI SIGINT/SIGTERM; bounded 5s). Live-leased entries are left to their own dispose (a sibling runtime in the same process may own them) — if the process dies first, the pid-sweep reclaims them at next boot.mkdirexit code and force-removes the container on any failure;isNameConflict()now matches the daemon's actual name-conflict message instead of bare 409.Lazy sandbox creation
lazySandbox()wrapper (agents-runtime/sandbox/lazy.ts) defers the provider factory until the sandbox is actually used (exec/fs/fetch). The bootstrapdockerprofile returns it, so trivial wakes never create a container. Materialization is single-flight and retried on failure.reclaimDockerSandboxByKey()wipes an earlier wake's persistent workspace by key without creating a container just to delete it (owner leases only; defers to live sibling leases — the last one draining wipes it).inheritforce-materializes the owner's workspace (ensureSandboxMaterialized) before spawning, so a child can attach even when the parent never ran a tool.withCreationSlot) — bursts queue against the daemon instead of stampeding it; reattaches/execs are unlimited, total creations unbounded.Grouping
All sandboxes carry
com.docker.compose.project=electric-sandboxes(+com.docker.compose.service=<entity-type>).Testing
lazySandbox— written first, confirmed red, then green.tscand eslint clean.Out of scope (follow-ups)
e2bremote profile stays eager (working directory not statically known at profile-build time); the same wrapper applies once it is.dispatchRecoveryIntervalMsis defined but unused) — expired runner leases still queue wakes forever.🤖 Generated with Claude Code