fix(sync-service): don't suspend a consumer mid-transaction (#4501)#4503
fix(sync-service): don't suspend a consumer mid-transaction (#4501)#4503erik-the-implementer wants to merge 1 commit into
Conversation
A consumer could suspend on its idle timeout while holding a pending_txn for an in-flight multi-fragment transaction, dropping that state. When a later fragment of the transaction arrived, a fresh consumer received a has_begin?: false fragment with pending_txn=nil and crashed in process_txn_fragment/2 (KeyError on :consider_flushed?). Guard consumer_can_suspend?/1 on is_nil(pending_txn) so the consumer hibernates instead and suspends only once the transaction completes. Fixes #4501 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ 108 Tests Failed:
View the top 3 failed test(s) by shortest run time
View the full list of 16 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
Claude Code ReviewSummaryThis PR fixes the production crash What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)Defense in depth: the crash in File: This PR removes the most common trigger (idle-timeout suspend), but the underlying fragility remains. Consider adding a defensive clause so the consumer degrades gracefully regardless of why it lost its I haven't fully traced whether a single consumer crash triggers any Suggestions (Nice to Have)
Issue ConformanceThe implementation directly addresses #4501. The linked issue is a Sentry-sourced crash report with a clear stacktrace pointing at Previous Review StatusN/A — first review of this PR. Review iteration: 1 | 2026-06-04 |
Summary
Fixes #4501 —
KeyError: key :consider_flushed? not found in: nilinElectric.Shapes.Consumer.process_txn_fragment/2.A shape consumer could suspend (terminate to reclaim memory) on its idle timeout while still holding a
pending_txnfor an in-flight multi-fragment transaction. The producer'sEventRoutertracks "this shape already saw the begin for the current xid" keyed byshape_handle, independently of consumer liveness — so when a later fragment of that transaction arrived,ConsumerRegistrystarted a fresh consumer and delivered ahas_begin?: falsefragment to it. The fresh consumer haspending_txn: nil, soprocess_txn_fragment/2dereferencednilattxn.consider_flushed?.Fix
consumer_can_suspend?/1now also requiresis_nil(state.pending_txn). A consumer that is mid-transaction hibernates instead of suspending, and only suspends once the transaction completes and it goes idle again.Why it happened in production
It needs the gap between two fragments of one transaction, as delivered to a single shape's consumer, to exceed
hibernate_after(10 min default). That lines up with a connection-scaledown backlog being replayed slowly: I correlated this to a specific source whose largest multi-fragment transaction (21 fragments) was smeared across ~10 minutes ending exactly at the crash. (Source id shared internally.)Test
test/electric/shapes/consumer_test.exs— "should hibernate not suspend while a multi-fragment transaction is pending": sends a begin fragment, asserts the consumer hibernates (does not emit{:shutdown, :suspend}) while pending, then sends the commit fragment and asserts it suspends. Verified to fail without the guard and pass with it.🤖 Generated with Claude Code