fix: scalable process lifecycle via PGID-based cleanup#180
fix: scalable process lifecycle via PGID-based cleanup#180baudbot-agent wants to merge 6 commits intomainfrom
Conversation
Greptile SummaryAdded defensive cleanup to kill orphaned Key changes:
The fix is targeted and low-risk, though the pattern match could potentially be more specific to avoid false positives. Confidence Score: 4/5
Important Files Changed
Last reviewed commit: cde4397 |
| # Kill any standalone start-bridge.sh supervisors (from manual restarts or | ||
| # deployments before tmux-managed bridge was introduced). These can survive | ||
| # control-agent restarts and cause port conflicts with the new bridge. | ||
| START_BRIDGE_PIDS=$(pgrep -f "start-bridge\.sh" 2>/dev/null || true) |
There was a problem hiding this comment.
Pattern "start-bridge\.sh" will match any process with this substring in its command line, including:
- Editor sessions editing a file named
start-bridge.sh greporfindcommands searching for this file- This very
pgrepcommand itself (though|| truehandles that)
Consider making it more specific, e.g., "bash.*start-bridge\.sh" or "/start-bridge\.sh" to match only actual script executions.
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/skills/control-agent/startup-pi.sh
Line: 100
Comment:
Pattern `"start-bridge\.sh"` will match any process with this substring in its command line, including:
- Editor sessions editing a file named `start-bridge.sh`
- `grep` or `find` commands searching for this file
- This very `pgrep` command itself (though `|| true` handles that)
Consider making it more specific, e.g., `"bash.*start-bridge\.sh"` or `"/start-bridge\.sh"` to match only actual script executions.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
No longer applicable — this entire manual pgrep-based cleanup section has been removed. The branch now uses PGID-based process group management (kill -TERM -$OLD_PGID in start.sh), so we don't track individual process names at all.
Responded by bentlegen using anthropic/claude-sonnet-4-20250514.
cde4397 to
743b177
Compare
Prevent port 7890 conflicts by thoroughly cleaning up ALL bridge-related processes and tmux sessions during startup-pi.sh cleanup. This fixes the recurring failure mode where orphaned bridge processes survive control-agent restarts and block the port. Root cause investigation showed: 1. Control-agent restarts with new session ID 2. Old bridge restart loop survives (was in wrong tmux session tree) 3. Both old and new bridge compete for port 7890 4. New bridge crash-loops with EADDRINUSE The new cleanup strategy: 1. Kill ALL tmux sessions named 'slack-bridge' (handles duplicate sessions) 2. Kill ALL bridge processes via pkill -9 (catches processes anywhere in tree) 3. Final safety: kill anything still on port 7890 Why this approach: - Simple and thorough (no complex process group tracking) - Low false positive risk (broker-bridge.mjs and bridge.mjs are unique names) - Handles edge cases like processes in wrong tmux session - Works with existing tmux-based architecture - Control-agent keeps full bridge access (can still tmux attach) Tested: Manually verified cleanup kills orphaned processes from different tmux sessions that previous approach missed.
Replace manual process-name-based cleanup with automatic process group management. This makes cleanup scalable and bulletproof. Current approach requires manually tracking and killing specific process names (start-bridge.sh, broker-bridge.mjs, tmux sessions, etc.). This: - Doesn't scale: each new service = more cleanup code - Is brittle: orphaned processes can survive if not explicitly killed - Creates port conflicts when manual interventions leave strays behind Use UNIX process groups for automatic lifecycle management: 1. start.sh launches control-agent with 'setsid' (new process group) 2. All spawned services (bridge, workers, etc.) inherit the PGID 3. On restart: kill -TERM -$OLD_PGID terminates entire tree automatically 4. No tracking of individual PIDs or process names needed **start.sh:** - Track control-agent PGID in ~/.pi/agent/control-agent.pgid - On startup: kill old PGID (if exists) before launching new one - Launch control-agent via 'setsid' to create new process group - Remove manual bridge cleanup (now handled by PGID termination) **startup-pi.sh:** - Remove ALL manual cleanup code (tmux, pkill, port checks, PID files) - Just launch services - cleanup happens automatically via process groups - Add comment explaining that tmux session is killed via PGID ✅ Scales to unlimited services (zero code per new service) ✅ Impossible to have orphaned processes (PGID kills all children) ✅ No manual process name tracking or port conflict handling ✅ Portable (works on any UNIX, no systemd dependency) ✅ Simple: just process groups, no complex lifecycle management -81 lines (cleanup code removed) +48 lines (PGID management added) = 33 fewer lines, zero ongoing maintenance burden Future services just need to be spawned from control-agent - they'll automatically be cleaned up on restart without any code changes.
Tmux creates its own session via setsid(), giving it a separate PGID from the control-agent. This means the PGID-based cleanup in start.sh doesn't kill tmux sessions, causing 'duplicate session' errors on restart. Fix: Explicitly check for and kill old tmux session before creating new one. This is a hybrid approach: - Most services: killed automatically via PGID (no code needed) - Tmux sessions: require explicit cleanup (they escape PGID isolation) Addresses review comment about orphaned tmux sessions surviving restart.
Replace per-session cleanup with prefix-based pattern. Now adding new
tmux sessions requires zero cleanup code.
## Changes
1. Rename session: slack-bridge → baudbot-slack-bridge
2. Kill all 'baudbot-*' sessions instead of individual session names
## Scalability
Before (not scalable):
• Add new tmux session → add new kill-session command
• O(n) cleanup code per session
After (scalable):
• Add new tmux session with baudbot- prefix
• Zero additional cleanup code needed
• O(1) regardless of number of sessions
## Pattern
All agent tmux sessions use 'baudbot-' prefix:
- baudbot-slack-bridge (Slack integration)
- baudbot-metrics (future: metrics collection)
- baudbot-health-monitor (future: health checks)
- etc.
One command kills them all:
tmux list-sessions -F '#{session_name}' | grep '^baudbot-' | xargs tmux kill-session -t
## Hybrid Approach
- Most services: killed via PGID (automatic, zero code)
- Tmux sessions: killed via prefix pattern (scalable, minimal code)
This combines the best of both: automatic cleanup for regular processes,
convention-based cleanup for tmux (which escapes PGID due to setsid).
6c392ba to
fd63a6f
Compare
exec setsid caused systemd (Type=simple) to lose track of the process, since setsid forks and the original PID exits immediately. The service would be marked 'inactive (dead)' despite pi running fine in background. Fix: save PID directly and exec pi in-place. When systemd launches start.sh, it's already a process group leader, so kill -TERM -$PID still terminates the entire tree on restart.
Problem
Recurrent port 7890 conflicts causing bridge restart failures. Multiple processes compete for the same port after
baudbot restart, and adding new services requires adding more manual cleanup code.Root Cause
Manual process-name-based cleanup is brittle and does not scale:
EADDRINUSEerrorsSolution
Replace manual cleanup with UNIX process group (PGID) management:
start.sh: Launch control-agent viasetsid(new process group), track PGID in~/.pi/agent/control-agent.pgidstart.sh: On restart,kill -TERM -$OLD_PGIDterminates the entire process tree automaticallystartup-pi.sh: Remove all manual cleanup code (bridge pids, port checks, PID files) — handled by PGIDstartup-pi.sh: Rename tmux session tobaudbot-slack-bridgeand addbaudbot-*prefix-based tmux cleanup (tmux escapes PGID due to its ownsetsid)Why tmux needs special handling
Tmux calls
setsid()internally, giving it a separate PGID from the control-agent. So PGID-based cleanup doesn't reach tmux sessions. The fix uses abaudbot-*naming convention — one grep kills all agent tmux sessions, zero code changes needed when adding new ones.Changes
start.sh:exec setsidfor process group isolationstartup-pi.sh:baudbot-*prefix-based tmux session cleanupslack-bridge→baudbot-slack-bridgeBenefits