Skip to content

Fix: Aggressively kill all bridge processes on startup#179

Merged
benvinegar merged 1 commit intomainfrom
fix/bridge-aggressive-cleanup
Feb 25, 2026
Merged

Fix: Aggressively kill all bridge processes on startup#179
benvinegar merged 1 commit intomainfrom
fix/bridge-aggressive-cleanup

Conversation

@baudbot-agent
Copy link
Collaborator

Problem

The slack-bridge frequently fails to start with EADDRINUSE on port 7890 after control-agent restarts. This happens because:

  1. Control-agent restarts with a new session ID
  2. startup-pi.sh kills the tmux session and tries to kill port holders
  3. But a bridge process sometimes survives (detached, zombied, or in a wrong session tree)
  4. The orphaned process holds port 7890, causing the new bridge to crash-loop

Solution

Make the cleanup more aggressive:

  1. Kill tmux session (stops the restart loop)
  2. Kill ALL bridge processes by pattern matching node (broker-)?bridge\.mjs
    • Use SIGTERM first for graceful shutdown
    • Wait 3s, then SIGKILL stragglers
  3. Final safety check: kill anything still on port 7890

This ensures no bridge processes survive across control-agent restarts.

Testing

  • Manually tested cleanup logic
  • Verified pgrep pattern matches both bridge.mjs and broker-bridge.mjs
  • Confirmed graceful SIGTERM → SIGKILL fallback works

Impact

  • Prevents the common port 7890 conflict that requires manual intervention
  • Makes control-agent restarts more reliable
  • No change to normal operation (only affects startup cleanup)

Prevent port 7890 conflicts by killing ALL bridge processes (not just
port holders) during startup-pi.sh cleanup. This fixes the common
failure mode where:

1. Control-agent restarts and gets a new session ID
2. Runs startup-pi.sh which kills the tmux session
3. But a bridge process survives (detached/zombied/wrong session tree)
4. Old process holds port 7890, new bridge crash-loops

The new cleanup strategy:
- Kill tmux session (stops restart loop)
- Kill ALL node (broker-)?bridge.mjs processes with SIGTERM
- Wait 3s for graceful exit, then SIGKILL stragglers
- Final safety check: kill anything still on port 7890

This is more aggressive than the old approach but necessary to prevent
orphaned bridge processes after agent restarts.
@greptile-apps
Copy link

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Improved bridge cleanup robustness by switching from port-based to process pattern matching. The new approach kills all bridge processes (bridge.mjs and broker-bridge.mjs) using SIGTERM with 3s graceful shutdown, then SIGKILL for stragglers, followed by a final port 7890 safety check. This prevents orphaned bridge processes that survive tmux session cleanup from causing EADDRINUSE errors on control-agent restarts.

Key improvements:

  • More aggressive cleanup targets all bridge processes by pattern, not just port holders
  • Maintains graceful shutdown (SIGTERM → SIGKILL) for clean process termination
  • Final safety net still kills anything on port 7890
  • Well-documented rationale for the defensive approach

Confidence Score: 5/5

  • Safe to merge - defensive cleanup improvement with proper signal handling and fallbacks
  • The change is well-designed and addresses a real operational issue. The pgrep pattern correctly matches both bridge types, uses proper SIGTERM→SIGKILL escalation, maintains the port 7890 safety check as fallback, and includes clear documentation. The implementation follows shell best practices with proper error suppression and the pattern won't match test files or other unrelated processes.
  • No files require special attention

Important Files Changed

Filename Overview
pi/skills/control-agent/startup-pi.sh Changed cleanup strategy from port-based to process pattern matching to prevent orphaned bridge processes

Last reviewed commit: 859e043

@benvinegar benvinegar merged commit 264d718 into main Feb 25, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants