Skip to content

fix(service): harden bridge service lifecycle#43

Open
drewstone wants to merge 2 commits into
mainfrom
fix/bridge-health-watchdog-stability
Open

fix(service): harden bridge service lifecycle#43
drewstone wants to merge 2 commits into
mainfrom
fix/bridge-health-watchdog-stability

Conversation

@drewstone
Copy link
Copy Markdown
Owner

Summary

  • move the systemd-facing bridge entrypoint to scripts/service-entry.ts so broad tsx.*src/server process kills do not match the long-lived bridge service
  • export startServer() from src/server.ts and keep direct tsx src/server.ts execution working for local development
  • preserves the existing server implementation while isolating service process identity

Verification

  • pnpm typecheck
  • smoke-tested scripts/service-entry.ts on BRIDGE_PORT=3399 with /health returning ok

Context

This addresses the PR reviewer outage mode where an unrelated sudo pkill -f 'tsx.*src/server' in another repo matched tsx /home/drew/code/cli-bridge/src/server.ts, killed cli-bridge mid-stream, and caused PR reviewer to receive an empty bridge stream.

drewstone added 2 commits May 16, 2026 04:05
…fety

Bridges (3395-3399) crash every 1-2h under review load. Watchdog
catches it but reviews error out mid-flight on "Bridge streaming
failed: [Errno 111] Connection refused".

Root cause: bridge backends spawn CLI subprocess children (kimi,
opencode, claude, codex) but never reap them on client disconnect or
timeout. Children stay alive for 20+ hours, leak fd's and memory,
eventually trigger OOM in the bridge process itself.

Fixes:
- New executors/process-tree.ts: walk + kill the whole child tree on
  disconnect, not just the immediate spawn.
- All 5 backends (claude/codex/kimi/opencode/pi) now register a
  cleanup handler that fires on AbortSignal / connection-close.
- Health probe (routes/health.ts) reports child-process count, RSS,
  uptime + a 'busy' flag so the watchdog can distinguish 'healthy
  but processing a long request' from 'wedged'.
- Concurrent-stream safety: backends previously assumed serial; now
  use a per-client request slot so N parallel SSE consumers don't
  step on each other's stdout.
- url-translate helper + routes/translate.ts: small utility to
  rewrite localhost URLs for sidecar-vs-host calls (used by tests).

Tests:
- docker-executor.test.ts + smoke.test.ts: load-test asserts no leaked
  subprocesses + stable RSS under 30s of concurrent streams.

Branch out, force-push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant