Skip to content

fix(agents): reclaim zombie sandbox containers and create them lazily#4513

Draft
msfstef wants to merge 1 commit into
mainfrom
worktree-zombie-sandbox-containers-fix
Draft

fix(agents): reclaim zombie sandbox containers and create them lazily#4513
msfstef wants to merge 1 commit into
mainfrom
worktree-zombie-sandbox-containers-fix

Conversation

@msfstef
Copy link
Copy Markdown
Contributor

@msfstef msfstef commented Jun 4, 2026

Problem

Users opening the desktop app found 15+ electric-sbx-* Docker containers running that they never explicitly started. Investigation found several compounding bugs in the sandbox container lifecycle:

  1. The boot sweep couldn't reclaim real leftovers. It skipped every running container — but crash/quit leftovers are running (PID 1 is an infinite sleep loop that never exits on its own). In practice the sweep only removed exited-ephemeral containers, which the normal teardown already handles.
  2. Every graceful quit leaked containers. Idle teardown is a debounced (~2 min) unref'd timer; on quit the wakes drain, disposes schedule teardowns, and then the process exits before they fire. Combined with (1), these accumulated forever.
  3. A failed post-start init stranded an invisible container (createStartedmkdir exec): started but never registered — unreachable by dispose and, while running, by the sweep. Fixing it exposed that isNameConflict() treated any HTTP 409 as a lost create race (exec-on-exited-container is also 409), silently "reattaching" to a removed container.
  4. Eager sandbox creation on every claimed wake. On runner reconnect, the backlog (cron ticks, scheduled sends, queued messages) is replayed with no concurrency cap, and each wake created/started its container before knowing whether it would run any tool — N pending entities ⇒ N containers at once on app open.

Fix

Zombie reclamation

  • Containers carry a com.electric.sandbox.owner-pid label; the boot sweep probes it (kill(pid, 0)) and reclaims running orphans whose owner is dead — remove ephemeral, stop persistent (writable layer survives for reattach by key).
  • Labels are immutable, so reattaching a dead creator's container records adoption in an in-container marker (/tmp/.electric-sbx-owner-pid, tmpfs ⇒ wiped on stop); the sweep probes it before reclaiming, so a live sibling's adopted container is never swept. The sweep is now awaited at boot so it can't race the first wake's reattach.
  • New shutdownAllDockerSandboxes() flushes pending debounced teardowns on runtime shutdown, wired through AgentHandlerResult.shutdownSandboxesBuiltinAgentsServer.stop() (covers desktop quit, runtime restarts, CLI SIGINT/SIGTERM; bounded 5s). Live-leased entries are left to their own dispose (a sibling runtime in the same process may own them) — if the process dies first, the pid-sweep reclaims them at next boot.
  • Post-start init verifies the mkdir exit code and force-removes the container on any failure; isNameConflict() now matches the daemon's actual name-conflict message instead of bare 409.

Lazy sandbox creation

  • New lazySandbox() wrapper (agents-runtime/sandbox/lazy.ts) defers the provider factory until the sandbox is actually used (exec/fs/fetch). The bootstrap docker profile returns it, so trivial wakes (cron ticks deciding there's nothing to do, bookkeeping) never create a container. Materialization is single-flight and retried on failure.
  • Terminal reclaim still works without use: reclaimDockerSandboxByKey() wipes an earlier wake's persistent workspace by key without creating a container just to delete it (owner leases only; defers to live sibling leases — the last one draining wipes it).
  • Spawn-inherit force-materializes the owner's workspace (ensureSandboxMaterialized) before spawning, so a child can attach even when the parent never ran a tool.
  • Concurrent container creations are capped at 4 process-wide (withCreationSlot) — bursts queue against the daemon instead of stampeding it; reattaches/execs are unlimited, total creations unbounded.

Grouping (QoL)

All sandboxes carry com.docker.compose.project=electric-sandboxes (+ com.docker.compose.service=<entity-type>), so Docker Desktop shows them as one collapsible group, and docker compose -p electric-sandboxes down clears them all.

Testing

  • 13 new docker integration tests (running-orphan reclaim, adoption sparing, legacy-label safety, init-failure cleanup, shutdown flush, lazy composition, reclaim-by-key) and 12 unit tests for lazySandbox — written first, confirmed red, then green.
  • Full suites: agents-runtime 803 tests / 63 files, agents 55 / 11; tsc and eslint clean.

Out of scope (follow-ups)

  • No cap on concurrent wakes (they're whole agent runs; capping would queue user-visible messages behind backlog replay — needs a product call).
  • The e2b remote profile stays eager (working directory not statically known at profile-build time); the same wrapper applies once it is.
  • Server-side orphaned-claim recovery (dispatchRecoveryIntervalMs is defined but unused) — expired runner leases still queue wakes forever.

🤖 Generated with Claude Code

Users opening the desktop app found 15+ electric-sbx-* containers
running that they never asked for. Several compounding bugs:

- The boot sweep only removed *exited ephemeral* leftovers, but crash/
  quit leftovers are RUNNING (PID 1 is an infinite sleep loop), so it
  never reclaimed anything real. Containers now carry an owner-pid
  label; the sweep reclaims running orphans whose owner is dead (remove
  ephemeral / stop persistent), consults an in-container adoption
  marker so a live process that reattached a dead creator's container
  is never swept, and is awaited at boot so it can't race reattaches.

- The debounced idle teardowns are unref'd timers that died with the
  process: every graceful quit leaked the recently-active containers as
  running zombies. Runtime shutdown now flushes pending teardowns
  (BuiltinAgentsServer.stop), leaving live-leased entries to their own
  dispose or the next boot's sweep.

- A failed post-start init (mkdir exec) left a started container that
  was never registered - invisible to dispose and, while running, to
  the sweep. Creation now verifies the init exit code and removes the
  container on any failure. Fixing this exposed that isNameConflict()
  treated any HTTP 409 as a lost create race (exec on an exited
  container is also 409), silently "reattaching" to a removed
  container; it now matches the daemon's name-conflict message.

- Sandbox creation was eager on every claimed wake, so a reconnect
  backlog of trivial wakes (cron ticks, bookkeeping) stampeded the
  daemon with containers. The docker profile now returns a lazySandbox
  wrapper that defers the provider factory to first actual use.
  Terminal reclaim without use goes through reclaimDockerSandboxByKey
  (no create-to-delete), spawn-inherit force-materializes the owner's
  workspace before the child can attach, and concurrent creations are
  capped at 4 to smooth real bursts.

- All sandbox containers now carry compose project/service labels
  (com.docker.compose.project=electric-sandboxes) so Docker Desktop
  groups them under one entry and they can be stopped/deleted together
  (docker compose -p electric-sandboxes down).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Electric Agents Desktop Builds

Build artifacts for commit da03e03.

Platform Status Artifact
macOS Apple Silicon Passed DMG
macOS Intel Passed DMG
Windows x64 Passed Installer
Linux x64 Passed AppImage / deb

Workflow run

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 81.35593% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.01%. Comparing base (1149beb) to head (da03e03).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/agents-runtime/src/sandbox/docker.ts 86.07% 22 Missing ⚠️
packages/agents-runtime/src/sandbox/lazy.ts 79.43% 22 Missing ⚠️
packages/agents/src/bootstrap.ts 65.00% 7 Missing ⚠️
packages/agents-runtime/src/process-wake.ts 60.00% 2 Missing ⚠️
packages/agents/src/server.ts 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4513      +/-   ##
==========================================
+ Coverage   55.80%   56.01%   +0.20%     
==========================================
  Files         300      301       +1     
  Lines       34047    34291     +244     
  Branches     9671     9739      +68     
==========================================
+ Hits        18999    19207     +208     
- Misses      15029    15065      +36     
  Partials       19       19              
Flag Coverage Δ
packages/agents 71.27% <64.00%> (-0.25%) ⬇️
packages/agents-mobile 84.09% <ø> (ø)
packages/agents-runtime 81.05% <82.96%> (+0.09%) ⬆️
packages/agents-server 72.76% <ø> (ø)
packages/agents-server-ui 5.79% <ø> (ø)
packages/electric-ax 46.42% <ø> (ø)
typescript 56.01% <81.35%> (+0.20%) ⬆️
unit-tests 56.01% <81.35%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 4, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit da03e03
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a21bc65f019340008be27fc
😎 Deploy Preview https://deploy-preview-4513--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Electric Agents Mobile Build

Local mobile checks ran for commit da03e03.

The EAS Android preview build was skipped because the mobile-eas-build label is not present.
Add the mobile-eas-build label to this PR to produce an installable preview build.

Workflow run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant