Skip to content

fix(state): graceful worker shutdown when shared-state manager dies#476

Draft
mrosseel wants to merge 1 commit into
brickbots:mainfrom
mrosseel:fix/graceful-worker-shutdown-shared-state
Draft

fix(state): graceful worker shutdown when shared-state manager dies#476
mrosseel wants to merge 1 commit into
brickbots:mainfrom
mrosseel:fix/graceful-worker-shutdown-shared-state

Conversation

@mrosseel

Copy link
Copy Markdown
Collaborator

Problem

When the multiprocessing Manager process that owns SharedStateObj dies, every worker that holds a proxy (Solver, Camera, Integrator, ...) gets a BrokenPipeError / ConnectionResetError / EOFError on its next shared_state.* call. The proxy connection is then permanently broken — there is no recovery by retrying.

On a real device this manifested as the solver spinning forever, logging "Broken pipe" every ~2 s (70k+ such lines in pifinder.log, going back to 2026-03; not caused by the cedar solver). Representative trace:

solver.py  -> state_utils.sleep_for_framerate(shared_state)
           -> shared_state.power_state()
           -> manager conn.send(...)
           -> BrokenPipeError: [Errno 32] Broken pipe

Investigation / root cause

SharedStateObj is a multiprocessing Manager proxy created in the main process; each worker process inherits a proxy. When the manager process is gone, proxy calls raise connection errors.

The flood comes specifically from the solver's nested loop structure (solver.py):

  • An outer while True: "restart" loop recreates the cedar client and re-enters an inner while True: work loop.
  • sleep_for_framerate() is the first call in the inner loop and calls shared_state.power_state() — the documented spin site.
  • When it raised BrokenPipeError, the exception was not caught locally; it propagated to the outer except Exception, which logged and then fell through to the bottom of the outer while True — restarting the loop. Re-creating the cedar client takes ~2 s, then power_state() immediately fails again. Infinite ~2 s spin.
  • The inner except (BrokenPipeError, ConnectionResetError) around last_image_metadata() used continue, which fed the same loop, and except EOFError likewise fell through to a restart.

The other workers were inconsistent: the integrator caught only EOFError (a BrokenPipeError from sleep_for_framerate would propagate uncaught and crash the process); the camera caught (BrokenPipeError, EOFError, FileNotFoundError) but not ConnectionResetError, and logged a full traceback.

The exact event that kills the manager (main-process crash, manager subprocess exit, or shutdown ordering) is not determinable from the logs alone — it predates the cedar work and reproducing it needs a running device. This PR does not try to keep the manager alive; it makes workers degrade gracefully when it is gone. See the open question below.

Fix

Add a small shared helper in state_utils.py and apply it consistently:

  • DEAD_MANAGER_EXCEPTIONS = (BrokenPipeError, ConnectionResetError, EOFError) — the proxy errors that mean "manager is gone".
  • SharedStateLost — a named signal that sleep_for_framerate() now raises (from the original error) so the documented spin site surfaces one intentional exception instead of a raw connection error.
  • is_dead_manager_error(exc) — detector used by worker loops; matches the tuple and SharedStateLost.

Applied in the worker loops so a dead manager is logged once and the loop exits cleanly instead of retrying:

  • solver.py — both the inner last_image_metadata() handler and the outer handler now return on a dead-manager error (no more outer-loop restart); the redundant except EOFError block is folded into the unified check.
  • integrator.py — broadened from EOFError-only to any dead-manager error; other exceptions still propagate as before.
  • camera_interface.py — added ConnectionResetError / SharedStateLost; FileNotFoundError handling preserved; other exceptions still propagate.
  • state_utils.sleep_for_framerate — translates a dead-manager error from power_state() into SharedStateLost.

Behaviour is unchanged while the manager is healthy (same return values, same framerate limiting, same power-off sleep).

Tests

  • New tests/test_state_utils.py (13 cases): detector matches the dead-manager set and rejects ordinary errors; sleep_for_framerate returns the same awake/sleep values when healthy and translates dead-manager errors into SharedStateLost (preserving __cause__).
  • Full suite green: pytest -m unit (453 passed), pytest -m smoke (5 passed), plus test_solver_shmem.py / test_integrator_drift.py. ruff check/format and mypy clean on changed files.

Open question

Why the manager process dies in the first place is still unknown and likely needs runtime reproduction on a device (candidates: a main-process exception, the manager subprocess being reaped, or teardown ordering during shutdown). This PR ships the graceful-degradation fix regardless, so a dead manager can no longer flood the logs.

🤖 Generated with Claude Code

When the multiprocessing Manager that owns SharedStateObj dies, worker
proxy calls raise BrokenPipeError/ConnectionResetError/EOFError. The
solver's outer restart loop turned this into an infinite ~2s spin that
flooded the log (70k+ "Broken pipe" lines), and the integrator/camera
loops handled it inconsistently.

Add a shared helper in state_utils (DEAD_MANAGER_EXCEPTIONS, the
SharedStateLost signal raised by sleep_for_framerate, and
is_dead_manager_error) and apply it in the solver, integrator and camera
worker loops so a dead manager is logged once and the loop exits cleanly
instead of retrying forever. Behaviour is unchanged while the manager is
healthy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant