Skip to content

fix: scalable process lifecycle via PGID-based cleanup#180

Open
baudbot-agent wants to merge 6 commits intomainfrom
fix/kill-orphaned-start-bridge-supervisors
Open

fix: scalable process lifecycle via PGID-based cleanup#180
baudbot-agent wants to merge 6 commits intomainfrom
fix/kill-orphaned-start-bridge-supervisors

Conversation

@baudbot-agent
Copy link
Collaborator

@baudbot-agent baudbot-agent commented Feb 25, 2026

Problem

Recurrent port 7890 conflicts causing bridge restart failures. Multiple processes compete for the same port after baudbot restart, and adding new services requires adding more manual cleanup code.

Root Cause

Manual process-name-based cleanup is brittle and does not scale:

  • Each service requires explicit kill logic (tmux sessions, bridge PIDs, port holders, PID files)
  • Orphaned processes survive if not explicitly tracked
  • Race conditions between cleanup and respawn cause EADDRINUSE errors

Solution

Replace manual cleanup with UNIX process group (PGID) management:

  1. start.sh: Launch control-agent via setsid (new process group), track PGID in ~/.pi/agent/control-agent.pgid
  2. start.sh: On restart, kill -TERM -$OLD_PGID terminates the entire process tree automatically
  3. startup-pi.sh: Remove all manual cleanup code (bridge pids, port checks, PID files) — handled by PGID
  4. startup-pi.sh: Rename tmux session to baudbot-slack-bridge and add baudbot-* prefix-based tmux cleanup (tmux escapes PGID due to its own setsid)

Why tmux needs special handling

Tmux calls setsid() internally, giving it a separate PGID from the control-agent. So PGID-based cleanup doesn't reach tmux sessions. The fix uses a baudbot-* naming convention — one grep kills all agent tmux sessions, zero code changes needed when adding new ones.

Changes

start.sh:

  • Replace bridge-specific cleanup (PID file, tmux kill, port release) with PGID-based lifecycle
  • Track control-agent PGID, graceful SIGTERM → SIGKILL on restart
  • Launch via exec setsid for process group isolation

startup-pi.sh:

  • Remove ~45 lines of manual cleanup (bridge processes, port checks, PID files)
  • Add baudbot-* prefix-based tmux session cleanup
  • Rename tmux session: slack-bridgebaudbot-slack-bridge

Benefits

  • ✅ Adding new services requires zero cleanup code
  • ✅ No orphaned processes (PGID kills all children)
  • ✅ No port conflicts from stale bridge processes
  • ✅ Portable (POSIX process groups, no systemd dependency)
  • ✅ Net reduction in code (~33 fewer lines)

@greptile-apps
Copy link

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Added defensive cleanup to kill orphaned start-bridge.sh supervisor processes before starting the tmux-managed bridge. This prevents port 7890 conflicts when manual interventions or old deployments leave supervisor processes running.

Key changes:

  • New cleanup step kills start-bridge.sh processes with SIGTERM, then SIGKILL if needed
  • Follows existing pattern of graceful termination with fallback
  • Addresses production issue where orphaned supervisors survived control-agent restarts

The fix is targeted and low-risk, though the pattern match could potentially be more specific to avoid false positives.

Confidence Score: 4/5

  • Safe to merge with minimal risk - addresses a real production issue with defensive cleanup
  • The change is straightforward and follows existing patterns in the script. The SIGTERM → SIGKILL approach is consistent with other cleanup logic. Minor concern about the pattern match being overly broad, but the risk of accidentally killing unrelated processes is low in this environment.
  • No files require special attention - single file change with clear purpose

Important Files Changed

Filename Overview
pi/skills/control-agent/startup-pi.sh Added cleanup for orphaned start-bridge.sh supervisors to prevent port 7890 conflicts. Uses SIGTERM → SIGKILL pattern. Pattern match could be overly broad.

Last reviewed commit: cde4397

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

# Kill any standalone start-bridge.sh supervisors (from manual restarts or
# deployments before tmux-managed bridge was introduced). These can survive
# control-agent restarts and cause port conflicts with the new bridge.
START_BRIDGE_PIDS=$(pgrep -f "start-bridge\.sh" 2>/dev/null || true)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern "start-bridge\.sh" will match any process with this substring in its command line, including:

  • Editor sessions editing a file named start-bridge.sh
  • grep or find commands searching for this file
  • This very pgrep command itself (though || true handles that)

Consider making it more specific, e.g., "bash.*start-bridge\.sh" or "/start-bridge\.sh" to match only actual script executions.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/skills/control-agent/startup-pi.sh
Line: 100

Comment:
Pattern `"start-bridge\.sh"` will match any process with this substring in its command line, including:
- Editor sessions editing a file named `start-bridge.sh`
- `grep` or `find` commands searching for this file
- This very `pgrep` command itself (though `|| true` handles that)

Consider making it more specific, e.g., `"bash.*start-bridge\.sh"` or `"/start-bridge\.sh"` to match only actual script executions.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable — this entire manual pgrep-based cleanup section has been removed. The branch now uses PGID-based process group management (kill -TERM -$OLD_PGID in start.sh), so we don't track individual process names at all.

Responded by bentlegen using anthropic/claude-sonnet-4-20250514.

@baudbot-agent baudbot-agent force-pushed the fix/kill-orphaned-start-bridge-supervisors branch from cde4397 to 743b177 Compare February 26, 2026 02:37
Baudbot added 4 commits February 25, 2026 22:27
Prevent port 7890 conflicts by thoroughly cleaning up ALL bridge-related
processes and tmux sessions during startup-pi.sh cleanup. This fixes the
recurring failure mode where orphaned bridge processes survive control-agent
restarts and block the port.

Root cause investigation showed:
1. Control-agent restarts with new session ID
2. Old bridge restart loop survives (was in wrong tmux session tree)
3. Both old and new bridge compete for port 7890
4. New bridge crash-loops with EADDRINUSE

The new cleanup strategy:
1. Kill ALL tmux sessions named 'slack-bridge' (handles duplicate sessions)
2. Kill ALL bridge processes via pkill -9 (catches processes anywhere in tree)
3. Final safety: kill anything still on port 7890

Why this approach:
- Simple and thorough (no complex process group tracking)
- Low false positive risk (broker-bridge.mjs and bridge.mjs are unique names)
- Handles edge cases like processes in wrong tmux session
- Works with existing tmux-based architecture
- Control-agent keeps full bridge access (can still tmux attach)

Tested: Manually verified cleanup kills orphaned processes from different
tmux sessions that previous approach missed.
Replace manual process-name-based cleanup with automatic process group
management. This makes cleanup scalable and bulletproof.

Current approach requires manually tracking and killing specific process
names (start-bridge.sh, broker-bridge.mjs, tmux sessions, etc.). This:
- Doesn't scale: each new service = more cleanup code
- Is brittle: orphaned processes can survive if not explicitly killed
- Creates port conflicts when manual interventions leave strays behind

Use UNIX process groups for automatic lifecycle management:

1. start.sh launches control-agent with 'setsid' (new process group)
2. All spawned services (bridge, workers, etc.) inherit the PGID
3. On restart: kill -TERM -$OLD_PGID terminates entire tree automatically
4. No tracking of individual PIDs or process names needed

**start.sh:**
- Track control-agent PGID in ~/.pi/agent/control-agent.pgid
- On startup: kill old PGID (if exists) before launching new one
- Launch control-agent via 'setsid' to create new process group
- Remove manual bridge cleanup (now handled by PGID termination)

**startup-pi.sh:**
- Remove ALL manual cleanup code (tmux, pkill, port checks, PID files)
- Just launch services - cleanup happens automatically via process groups
- Add comment explaining that tmux session is killed via PGID

✅ Scales to unlimited services (zero code per new service)
✅ Impossible to have orphaned processes (PGID kills all children)
✅ No manual process name tracking or port conflict handling
✅ Portable (works on any UNIX, no systemd dependency)
✅ Simple: just process groups, no complex lifecycle management

-81 lines (cleanup code removed)
+48 lines (PGID management added)
= 33 fewer lines, zero ongoing maintenance burden

Future services just need to be spawned from control-agent - they'll
automatically be cleaned up on restart without any code changes.
Tmux creates its own session via setsid(), giving it a separate PGID
from the control-agent. This means the PGID-based cleanup in start.sh
doesn't kill tmux sessions, causing 'duplicate session' errors on restart.

Fix: Explicitly check for and kill old tmux session before creating new one.

This is a hybrid approach:
- Most services: killed automatically via PGID (no code needed)
- Tmux sessions: require explicit cleanup (they escape PGID isolation)

Addresses review comment about orphaned tmux sessions surviving restart.
Replace per-session cleanup with prefix-based pattern. Now adding new
tmux sessions requires zero cleanup code.

## Changes

1. Rename session: slack-bridge → baudbot-slack-bridge
2. Kill all 'baudbot-*' sessions instead of individual session names

## Scalability

Before (not scalable):
  • Add new tmux session → add new kill-session command
  • O(n) cleanup code per session

After (scalable):
  • Add new tmux session with baudbot- prefix
  • Zero additional cleanup code needed
  • O(1) regardless of number of sessions

## Pattern

All agent tmux sessions use 'baudbot-' prefix:
  - baudbot-slack-bridge (Slack integration)
  - baudbot-metrics (future: metrics collection)
  - baudbot-health-monitor (future: health checks)
  - etc.

One command kills them all:
  tmux list-sessions -F '#{session_name}' | grep '^baudbot-' | xargs tmux kill-session -t

## Hybrid Approach

- Most services: killed via PGID (automatic, zero code)
- Tmux sessions: killed via prefix pattern (scalable, minimal code)

This combines the best of both: automatic cleanup for regular processes,
convention-based cleanup for tmux (which escapes PGID due to setsid).
@benvinegar benvinegar force-pushed the fix/kill-orphaned-start-bridge-supervisors branch from 6c392ba to fd63a6f Compare February 26, 2026 03:29
@benvinegar benvinegar changed the title fix: kill orphaned start-bridge.sh supervisors on restart fix: scalable process lifecycle via PGID-based cleanup Feb 26, 2026
exec setsid caused systemd (Type=simple) to lose track of the process,
since setsid forks and the original PID exits immediately. The service
would be marked 'inactive (dead)' despite pi running fine in background.

Fix: save PID directly and exec pi in-place. When systemd launches
start.sh, it's already a process group leader, so kill -TERM -$PID
still terminates the entire tree on restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants