fix(agents): reclaim zombie sandbox containers and create them lazily#4513
Draft
msfstef wants to merge 1 commit into
Draft
fix(agents): reclaim zombie sandbox containers and create them lazily#4513msfstef wants to merge 1 commit into
msfstef wants to merge 1 commit into
Conversation
Users opening the desktop app found 15+ electric-sbx-* containers running that they never asked for. Several compounding bugs: - The boot sweep only removed *exited ephemeral* leftovers, but crash/ quit leftovers are RUNNING (PID 1 is an infinite sleep loop), so it never reclaimed anything real. Containers now carry an owner-pid label; the sweep reclaims running orphans whose owner is dead (remove ephemeral / stop persistent), consults an in-container adoption marker so a live process that reattached a dead creator's container is never swept, and is awaited at boot so it can't race reattaches. - The debounced idle teardowns are unref'd timers that died with the process: every graceful quit leaked the recently-active containers as running zombies. Runtime shutdown now flushes pending teardowns (BuiltinAgentsServer.stop), leaving live-leased entries to their own dispose or the next boot's sweep. - A failed post-start init (mkdir exec) left a started container that was never registered - invisible to dispose and, while running, to the sweep. Creation now verifies the init exit code and removes the container on any failure. Fixing this exposed that isNameConflict() treated any HTTP 409 as a lost create race (exec on an exited container is also 409), silently "reattaching" to a removed container; it now matches the daemon's name-conflict message. - Sandbox creation was eager on every claimed wake, so a reconnect backlog of trivial wakes (cron ticks, bookkeeping) stampeded the daemon with containers. The docker profile now returns a lazySandbox wrapper that defers the provider factory to first actual use. Terminal reclaim without use goes through reclaimDockerSandboxByKey (no create-to-delete), spawn-inherit force-materializes the owner's workspace before the child can attach, and concurrent creations are capped at 4 to smooth real bursts. - All sandbox containers now carry compose project/service labels (com.docker.compose.project=electric-sandboxes) so Docker Desktop groups them under one entry and they can be stopped/deleted together (docker compose -p electric-sandboxes down). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4513 +/- ##
==========================================
+ Coverage 55.80% 56.01% +0.20%
==========================================
Files 300 301 +1
Lines 34047 34291 +244
Branches 9671 9739 +68
==========================================
+ Hits 18999 19207 +208
- Misses 15029 15065 +36
Partials 19 19
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Contributor
Electric Agents Mobile BuildLocal mobile checks ran for commit The EAS Android preview build was skipped because the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Users opening the desktop app found 15+
electric-sbx-*Docker containers running that they never explicitly started. Investigation found several compounding bugs in the sandbox container lifecycle:sleeploop that never exits on its own). In practice the sweep only removed exited-ephemeral containers, which the normal teardown already handles.createStarted→mkdirexec): started but never registered — unreachable by dispose and, while running, by the sweep. Fixing it exposed thatisNameConflict()treated any HTTP 409 as a lost create race (exec-on-exited-container is also 409), silently "reattaching" to a removed container.Fix
Zombie reclamation
com.electric.sandbox.owner-pidlabel; the boot sweep probes it (kill(pid, 0)) and reclaims running orphans whose owner is dead — remove ephemeral, stop persistent (writable layer survives for reattach by key)./tmp/.electric-sbx-owner-pid, tmpfs ⇒ wiped on stop); the sweep probes it before reclaiming, so a live sibling's adopted container is never swept. The sweep is now awaited at boot so it can't race the first wake's reattach.shutdownAllDockerSandboxes()flushes pending debounced teardowns on runtime shutdown, wired throughAgentHandlerResult.shutdownSandboxes→BuiltinAgentsServer.stop()(covers desktop quit, runtime restarts, CLI SIGINT/SIGTERM; bounded 5s). Live-leased entries are left to their own dispose (a sibling runtime in the same process may own them) — if the process dies first, the pid-sweep reclaims them at next boot.mkdirexit code and force-removes the container on any failure;isNameConflict()now matches the daemon's actual name-conflict message instead of bare 409.Lazy sandbox creation
lazySandbox()wrapper (agents-runtime/sandbox/lazy.ts) defers the provider factory until the sandbox is actually used (exec/fs/fetch). The bootstrapdockerprofile returns it, so trivial wakes (cron ticks deciding there's nothing to do, bookkeeping) never create a container. Materialization is single-flight and retried on failure.reclaimDockerSandboxByKey()wipes an earlier wake's persistent workspace by key without creating a container just to delete it (owner leases only; defers to live sibling leases — the last one draining wipes it).inheritforce-materializes the owner's workspace (ensureSandboxMaterialized) before spawning, so a child can attach even when the parent never ran a tool.withCreationSlot) — bursts queue against the daemon instead of stampeding it; reattaches/execs are unlimited, total creations unbounded.Grouping (QoL)
All sandboxes carry
com.docker.compose.project=electric-sandboxes(+com.docker.compose.service=<entity-type>), so Docker Desktop shows them as one collapsible group, anddocker compose -p electric-sandboxes downclears them all.Testing
lazySandbox— written first, confirmed red, then green.tscand eslint clean.Out of scope (follow-ups)
e2bremote profile stays eager (working directory not statically known at profile-build time); the same wrapper applies once it is.dispatchRecoveryIntervalMsis defined but unused) — expired runner leases still queue wakes forever.🤖 Generated with Claude Code