fix(state): graceful worker shutdown when shared-state manager dies by mrosseel · Pull Request #476 · brickbots/PiFinder

mrosseel · 2026-06-17T09:38:47Z

Problem

When the multiprocessing Manager process that owns SharedStateObj dies, every worker that holds a proxy (Solver, Camera, Integrator, ...) gets a BrokenPipeError / ConnectionResetError / EOFError on its next shared_state.* call. The proxy connection is then permanently broken — there is no recovery by retrying.

On a real device this manifested as the solver spinning forever, logging "Broken pipe" every ~2 s (70k+ such lines in pifinder.log, going back to 2026-03; not caused by the cedar solver). Representative trace:

solver.py  -> state_utils.sleep_for_framerate(shared_state)
           -> shared_state.power_state()
           -> manager conn.send(...)
           -> BrokenPipeError: [Errno 32] Broken pipe

Investigation / root cause

SharedStateObj is a multiprocessing Manager proxy created in the main process; each worker process inherits a proxy. When the manager process is gone, proxy calls raise connection errors.

The flood comes specifically from the solver's nested loop structure (solver.py):

An outer while True: "restart" loop recreates the cedar client and re-enters an inner while True: work loop.
sleep_for_framerate() is the first call in the inner loop and calls shared_state.power_state() — the documented spin site.
When it raised BrokenPipeError, the exception was not caught locally; it propagated to the outer except Exception, which logged and then fell through to the bottom of the outer while True — restarting the loop. Re-creating the cedar client takes ~2 s, then power_state() immediately fails again. Infinite ~2 s spin.
The inner except (BrokenPipeError, ConnectionResetError) around last_image_metadata() used continue, which fed the same loop, and except EOFError likewise fell through to a restart.

The other workers were inconsistent: the integrator caught only EOFError (a BrokenPipeError from sleep_for_framerate would propagate uncaught and crash the process); the camera caught (BrokenPipeError, EOFError, FileNotFoundError) but not ConnectionResetError, and logged a full traceback.

The exact event that kills the manager (main-process crash, manager subprocess exit, or shutdown ordering) is not determinable from the logs alone — it predates the cedar work and reproducing it needs a running device. This PR does not try to keep the manager alive; it makes workers degrade gracefully when it is gone. See the open question below.

Fix

Add a small shared helper in state_utils.py and apply it consistently:

DEAD_MANAGER_EXCEPTIONS = (BrokenPipeError, ConnectionResetError, EOFError) — the proxy errors that mean "manager is gone".
SharedStateLost — a named signal that sleep_for_framerate() now raises (from the original error) so the documented spin site surfaces one intentional exception instead of a raw connection error.
is_dead_manager_error(exc) — detector used by worker loops; matches the tuple and SharedStateLost.

Applied in the worker loops so a dead manager is logged once and the loop exits cleanly instead of retrying:

solver.py — both the inner last_image_metadata() handler and the outer handler now return on a dead-manager error (no more outer-loop restart); the redundant except EOFError block is folded into the unified check.
integrator.py — broadened from EOFError-only to any dead-manager error; other exceptions still propagate as before.
camera_interface.py — added ConnectionResetError / SharedStateLost; FileNotFoundError handling preserved; other exceptions still propagate.
state_utils.sleep_for_framerate — translates a dead-manager error from power_state() into SharedStateLost.

Behaviour is unchanged while the manager is healthy (same return values, same framerate limiting, same power-off sleep).

Tests

New tests/test_state_utils.py (13 cases): detector matches the dead-manager set and rejects ordinary errors; sleep_for_framerate returns the same awake/sleep values when healthy and translates dead-manager errors into SharedStateLost (preserving __cause__).
Full suite green: pytest -m unit (453 passed), pytest -m smoke (5 passed), plus test_solver_shmem.py / test_integrator_drift.py. ruff check/format and mypy clean on changed files.

Open question

Why the manager process dies in the first place is still unknown and likely needs runtime reproduction on a device (candidates: a main-process exception, the manager subprocess being reaped, or teardown ordering during shutdown). This PR ships the graceful-degradation fix regardless, so a dead manager can no longer flood the logs.

🤖 Generated with Claude Code

When the multiprocessing Manager that owns SharedStateObj dies, worker proxy calls raise BrokenPipeError/ConnectionResetError/EOFError. The solver's outer restart loop turned this into an infinite ~2s spin that flooded the log (70k+ "Broken pipe" lines), and the integrator/camera loops handled it inconsistently. Add a shared helper in state_utils (DEAD_MANAGER_EXCEPTIONS, the SharedStateLost signal raised by sleep_for_framerate, and is_dead_manager_error) and apply it in the solver, integrator and camera worker loops so a dead manager is logged once and the loop exits cleanly instead of retrying forever. Behaviour is unchanged while the manager is healthy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(state): graceful worker shutdown when shared-state manager dies#476

fix(state): graceful worker shutdown when shared-state manager dies#476
mrosseel wants to merge 1 commit into
brickbots:mainfrom
mrosseel:fix/graceful-worker-shutdown-shared-state

mrosseel commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mrosseel commented Jun 17, 2026

Problem

Investigation / root cause

Fix

Tests

Open question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant