feat(sync-service): self-heal stuck replication slot creation#4515
feat(sync-service): self-heal stuck replication slot creation#4515erik-the-implementer wants to merge 5 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pg_log_standby_snapshot When CREATE_REPLICATION_SLOT is blocked waiting on pending transactions, Connection.Manager now periodically runs pg_log_standby_snapshot() on the admin pool so Postgres can reach a consistent snapshot and the source recovers without a manual restart. Degrades gracefully (one-time notice) when the function is unavailable (PG < 14 or missing EXECUTE privilege). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reation Holds an xid-bearing transaction open to block CREATE_REPLICATION_SLOT, then commits it and keeps the database idle, asserting that Electric forces a standby snapshot and resumes replication on its own (no manual restart). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4515 +/- ##
===========================================
- Coverage 69.32% 42.20% -27.12%
===========================================
Files 77 273 +196
Lines 9277 22113 +12836
Branches 2896 7289 +4393
===========================================
+ Hits 6431 9333 +2902
- Misses 2828 12722 +9894
- Partials 18 58 +40
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummarySolid, well-scoped fix that makes Electric self-heal from a stuck logical replication slot creation (slot blocked on a pending transaction on an otherwise-idle DB) by periodically calling pg_log_standby_snapshot(), instead of requiring a manual source restart. The implementation piggybacks cleanly on existing machinery and degrades gracefully. No blocking concerns — a few suggestions below. What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None blocking. Suggestions (Nice to Have)
Issue ConformanceNo linked issue. Per project convention this is worth a note — PRs should reference the issue/incident they address — though the PR description SRE background (the customer investigation, Previous Review StatusFirst review — no prior iteration. Review iteration: 1 | 2026-06-05 |
Summary
When a source's logical replication slot creation is blocked waiting on pending transactions, Electric now self-heals instead of requiring a manual source restart.
Connection.Managerperiodically runsSELECT pg_log_standby_snapshot()on the admin pool while slot creation is blocked, so Postgres can emit anXLOG_RUNNING_XACTSrecord and the logical snapshot builder reachesCONSISTENTas soon as the blocking transaction ends.:replication_configurationstatus-check timer (it already detects the blocked state and dispatches:replication_slot_creation_blocked_by_pending_transactions); mirrors the existing:check_lock_not_abandonedadmin-pool query pattern.EXECUTEprivilege): falls back to the previous behavior and emits a one-time:replication_slot_unblock_unavailablestack event + warning with remediation.Background
From an SRE investigation of a customer ("Ajax") with repeated source-inactivity incidents: a long-running transaction pins the slot's
restart_lsn, retained WAL grows pastmax_slot_wal_keep_size(4 GB) and Postgres invalidates the slot. Recreating the slot then blocks on the same transaction — and on an otherwise-idle database, Postgres does not emit a freshXLOG_RUNNING_XACTSrecord for a long time after the transaction commits, so the source stays stuck until someone restarts it.pg_log_standby_snapshot()forces that record on demand, making recovery automatic.Test plan
mix test test/electric/connection/manager_test.exs— includes a new degrade-path test (Repatch simulates the function being unavailable; asserts the one-time notice + that the source keeps waiting). 23 tests, 0 failures.mix test test/electric/postgres/replication_client_test.exs— no regressions.integration-tests/tests/self-heal-stuck-slot-creation.lux— end-to-end against real Postgres: a held transaction blocks slot creation; after it commits on an idle DB, Electric forces a standby snapshot and resumes replication on its own. Passes (lux SUCCESS).@core/sync-servicepatch).🤖 Generated with Claude Code