fix(state): graceful worker shutdown when shared-state manager dies#476
Draft
mrosseel wants to merge 1 commit into
Draft
fix(state): graceful worker shutdown when shared-state manager dies#476mrosseel wants to merge 1 commit into
mrosseel wants to merge 1 commit into
Conversation
When the multiprocessing Manager that owns SharedStateObj dies, worker proxy calls raise BrokenPipeError/ConnectionResetError/EOFError. The solver's outer restart loop turned this into an infinite ~2s spin that flooded the log (70k+ "Broken pipe" lines), and the integrator/camera loops handled it inconsistently. Add a shared helper in state_utils (DEAD_MANAGER_EXCEPTIONS, the SharedStateLost signal raised by sleep_for_framerate, and is_dead_manager_error) and apply it in the solver, integrator and camera worker loops so a dead manager is logged once and the loop exits cleanly instead of retrying forever. Behaviour is unchanged while the manager is healthy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the
multiprocessingManager process that ownsSharedStateObjdies, every worker that holds a proxy (Solver, Camera, Integrator, ...) gets aBrokenPipeError/ConnectionResetError/EOFErroron its nextshared_state.*call. The proxy connection is then permanently broken — there is no recovery by retrying.On a real device this manifested as the solver spinning forever, logging "Broken pipe" every ~2 s (70k+ such lines in
pifinder.log, going back to 2026-03; not caused by the cedar solver). Representative trace:Investigation / root cause
SharedStateObjis amultiprocessingManager proxy created in the main process; each worker process inherits a proxy. When the manager process is gone, proxy calls raise connection errors.The flood comes specifically from the solver's nested loop structure (
solver.py):while True:"restart" loop recreates the cedar client and re-enters an innerwhile True:work loop.sleep_for_framerate()is the first call in the inner loop and callsshared_state.power_state()— the documented spin site.BrokenPipeError, the exception was not caught locally; it propagated to the outerexcept Exception, which logged and then fell through to the bottom of the outerwhile True— restarting the loop. Re-creating the cedar client takes ~2 s, thenpower_state()immediately fails again. Infinite ~2 s spin.except (BrokenPipeError, ConnectionResetError)aroundlast_image_metadata()usedcontinue, which fed the same loop, andexcept EOFErrorlikewise fell through to a restart.The other workers were inconsistent: the integrator caught only
EOFError(aBrokenPipeErrorfromsleep_for_frameratewould propagate uncaught and crash the process); the camera caught(BrokenPipeError, EOFError, FileNotFoundError)but notConnectionResetError, and logged a full traceback.The exact event that kills the manager (main-process crash, manager subprocess exit, or shutdown ordering) is not determinable from the logs alone — it predates the cedar work and reproducing it needs a running device. This PR does not try to keep the manager alive; it makes workers degrade gracefully when it is gone. See the open question below.
Fix
Add a small shared helper in
state_utils.pyand apply it consistently:DEAD_MANAGER_EXCEPTIONS = (BrokenPipeError, ConnectionResetError, EOFError)— the proxy errors that mean "manager is gone".SharedStateLost— a named signal thatsleep_for_framerate()now raises (from the original error) so the documented spin site surfaces one intentional exception instead of a raw connection error.is_dead_manager_error(exc)— detector used by worker loops; matches the tuple andSharedStateLost.Applied in the worker loops so a dead manager is logged once and the loop exits cleanly instead of retrying:
solver.py— both the innerlast_image_metadata()handler and the outer handler nowreturnon a dead-manager error (no more outer-loop restart); the redundantexcept EOFErrorblock is folded into the unified check.integrator.py— broadened fromEOFError-only to any dead-manager error; other exceptions still propagate as before.camera_interface.py— addedConnectionResetError/SharedStateLost;FileNotFoundErrorhandling preserved; other exceptions still propagate.state_utils.sleep_for_framerate— translates a dead-manager error frompower_state()intoSharedStateLost.Behaviour is unchanged while the manager is healthy (same return values, same framerate limiting, same power-off sleep).
Tests
tests/test_state_utils.py(13 cases): detector matches the dead-manager set and rejects ordinary errors;sleep_for_frameratereturns the same awake/sleep values when healthy and translates dead-manager errors intoSharedStateLost(preserving__cause__).pytest -m unit(453 passed),pytest -m smoke(5 passed), plustest_solver_shmem.py/test_integrator_drift.py.ruff check/formatandmypyclean on changed files.Open question
Why the manager process dies in the first place is still unknown and likely needs runtime reproduction on a device (candidates: a main-process exception, the manager subprocess being reaped, or teardown ordering during shutdown). This PR ships the graceful-degradation fix regardless, so a dead manager can no longer flood the logs.
🤖 Generated with Claude Code