feat(sync-service): self-heal stuck replication slot creation by erik-the-implementer · Pull Request #4515 · electric-sql/electric

erik-the-implementer · 2026-06-05T13:59:46Z

Summary

When a source's logical replication slot creation is blocked waiting on pending transactions, Electric now self-heals instead of requiring a manual source restart.

Connection.Manager periodically runs SELECT pg_log_standby_snapshot() on the admin pool while slot creation is blocked, so Postgres can emit an XLOG_RUNNING_XACTS record and the logical snapshot builder reaches CONSISTENT as soon as the blocking transaction ends.
Piggybacks on the existing :replication_configuration status-check timer (it already detects the blocked state and dispatches :replication_slot_creation_blocked_by_pending_transactions); mirrors the existing :check_lock_not_abandoned admin-pool query pattern.
Degrades gracefully when the function is unavailable (PostgreSQL < 14 or missing EXECUTE privilege): falls back to the previous behavior and emits a one-time :replication_slot_unblock_unavailable stack event + warning with remediation.
Makes the connection-status-check interval configurable so the behavior can be tested deterministically and fast.

Background

From an SRE investigation of a customer ("Ajax") with repeated source-inactivity incidents: a long-running transaction pins the slot's restart_lsn, retained WAL grows past max_slot_wal_keep_size (4 GB) and Postgres invalidates the slot. Recreating the slot then blocks on the same transaction — and on an otherwise-idle database, Postgres does not emit a fresh XLOG_RUNNING_XACTS record for a long time after the transaction commits, so the source stays stuck until someone restarts it. pg_log_standby_snapshot() forces that record on demand, making recovery automatic.

Test plan

mix test test/electric/connection/manager_test.exs — includes a new degrade-path test (Repatch simulates the function being unavailable; asserts the one-time notice + that the source keeps waiting). 23 tests, 0 failures.
mix test test/electric/postgres/replication_client_test.exs — no regressions.
integration-tests/tests/self-heal-stuck-slot-creation.lux — end-to-end against real Postgres: a held transaction blocks slot creation; after it commits on an idle DB, Electric forces a standby snapshot and resumes replication on its own. Passes (lux SUCCESS).
Changeset added (@core/sync-service patch).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…pg_log_standby_snapshot When CREATE_REPLICATION_SLOT is blocked waiting on pending transactions, Connection.Manager now periodically runs pg_log_standby_snapshot() on the admin pool so Postgres can reach a consistent snapshot and the source recovers without a manual restart. Degrades gracefully (one-time notice) when the function is unavailable (PG < 14 or missing EXECUTE privilege). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…reation Holds an xid-bearing transaction open to block CREATE_REPLICATION_SLOT, then commits it and keeps the database idle, asserting that Electric forces a standby snapshot and resumes replication on its own (no manual restart). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-05T14:01:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.20%. Comparing base (b0030a1) to head (166fc11).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #4515       +/-   ##
===========================================
- Coverage   69.32%   42.20%   -27.12%     
===========================================
  Files          77      273      +196     
  Lines        9277    22113    +12836     
  Branches     2896     7289     +4393     
===========================================
+ Hits         6431     9333     +2902     
- Misses       2828    12722     +9894     
- Partials       18       58       +40

Flag	Coverage Δ
packages/agents	`71.78% <ø> (ø)`
packages/agents-mcp	`77.54% <ø> (?)`
packages/agents-mobile	`84.09% <ø> (?)`
packages/agents-server	`72.78% <ø> (-0.03%)`	⬇️
packages/agents-server-ui	`5.77% <ø> (?)`
packages/electric-ax	`46.42% <ø> (ø)`
packages/experimental	`87.73% <ø> (?)`
packages/react-hooks	`86.48% <ø> (?)`
packages/start	`82.83% <ø> (?)`
packages/typescript-client	`91.71% <ø> (?)`
packages/y-electric	`56.05% <ø> (?)`
typescript	`42.20% <ø> (-27.12%)`	⬇️
unit-tests	`42.20% <ø> (-27.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude · 2026-06-05T14:09:46Z

Claude Code Review

Summary

Solid, well-scoped fix that makes Electric self-heal from a stuck logical replication slot creation (slot blocked on a pending transaction on an otherwise-idle DB) by periodically calling pg_log_standby_snapshot(), instead of requiring a manual source restart. The implementation piggybacks cleanly on existing machinery and degrades gracefully. No blocking concerns — a few suggestions below.

What's Working Well

Reuses existing structure rather than adding new processes. Hooking into the existing :replication_configuration status-check tick (manager.ex:653) and mirroring the :check_lock_not_abandoned admin-pool query pattern keeps the blast radius small and avoids a new per-source GenServer/timer — good for the 10k–150k shape scale.
Graceful degradation is thorough. The pg_version >= 140000 gate, the explicit :insufficient_privilege/:undefined_function handling, the {:error, _}/catch fallback to :ok (retry next tick), and the one-time notice via slot_unblock_notice_sent together cover PG<14, missing EXECUTE, transient errors, and unexpected raises without ever crashing the manager.
The handle_continue(:unblock_slot_creation, ...) guard + catch-all is correct (manager.ex:578, :599). The guarded clause requires admin_pool_available/1 and the :configuring_connection step; the unguarded fallback prevents a FunctionClauseError — exactly the exhaustive-match discipline this codebase cares about.
pg_version is reliably set before this path runs. :pg_info_obtained is cast during the replication client :query_pg_info step, which precedes :acquire_lock (and thus the :replication_client_lock_acquired cast that arms the configuration timer). Since both casts originate from the same process, Erlang message ordering guarantees pg_version is populated before the first :unblock_slot_creation continue — so there is no spurious :pg_version_too_old. Nice that this holds without an explicit guard.
Relocating defguardp admin_pool_available to the top of the module (manager.ex:136) is necessary (the new usage precedes the old definition site) and harmless to existing callers.
Tests fit the change. The Repatch degrade-path unit test asserts the one-time notice + continued waiting, and the self-heal-stuck-slot-creation.lux test proves the real-Postgres recovery on an idle DB (matching on the forced-standby-snapshot debug line is a good, specific proof of the feature). Changeset is present (@core/sync-service patch).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None blocking.

Suggestions (Nice to Have)

Remediation message is privilege-centric even for the :pg_version_too_old case — manager.ex:1218. When the reason is :pg_version_too_old, the warning still says "Grant EXECUTE ON FUNCTION pg_log_standby_snapshot() … (PostgreSQL 14+)", which a user cannot act on by granting — they need to upgrade Postgres. The interpolated reason hints at it, but the actionable sentence is misleading for that branch. Consider tailoring the remediation per reason, e.g. for :pg_version_too_old: "This requires PostgreSQL 14+; upgrade so the source can recover automatically."
The unblock query blocks the manager process synchronously for up to 5s per tick. Postgrex.query(pool, "SELECT pg_log_standby_snapshot()", [], timeout: 5_000) runs inline in the handle_continue, so while it is in flight the Connection.Manager cannot service other messages (e.g. handle_call(:ping)). This mirrors the existing execute_lock_breaker_query precedent and only happens while slot creation is already blocked, so keeping it as-is is reasonable — just flagging it as a known property in case status/health reporting latency during the stuck window matters.
Consider a telemetry event for self-heal attempts. The feature currently surfaces via a stack event + Logger.debug. Since debug is typically filtered in production, a telemetry event/counter under [:electric, :replication, ...] would let SREs see how often this path fires across the fleet — directly relevant to the SRE incident this addresses. Optional.
Notice re-emits on every reconnect. slot_unblock_notice_sent resets to false whenever a new replication client starts (manager.ex:376), so a persistently-misconfigured source (PG<14 / no privilege) re-logs the warning each reconnection cycle. This is arguably desirable (re-surfaces the actionable problem), just noting it is per-connection-attempt rather than truly once-per-source.

Issue Conformance

No linked issue. Per project convention this is worth a note — PRs should reference the issue/incident they address — though the PR description SRE background (the customer investigation, max_slot_wal_keep_size slot invalidation, and the XLOG_RUNNING_XACTS-on-idle-DB root cause) is unusually thorough and stands in well for one. The implementation matches the described intent precisely, with no scope creep beyond the self-heal path and its supporting test/config plumbing.

Previous Review Status

First review — no prior iteration.

Review iteration: 1 | 2026-06-05

alco and others added 5 commits June 5, 2026 15:25

feat(sync-service): make connection status check interval configurable

c572651

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

style(sync-service): apply mix format to manager.ex

1952463

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: add changeset for slot-creation self-heal

fd5b163

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

erik-the-implementer added the claude label Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sync-service): self-heal stuck replication slot creation#4515

feat(sync-service): self-heal stuck replication slot creation#4515
erik-the-implementer wants to merge 5 commits into
mainfrom
feat/slot-creation-self-heal

erik-the-implementer commented Jun 5, 2026

Uh oh!

codecov Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erik-the-implementer commented Jun 5, 2026

Summary

Background

Test plan

Uh oh!

codecov Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Jun 5, 2026

Claude Code Review

Summary

What's Working Well

Issues Found

Critical (Must Fix)

Important (Should Fix)

Suggestions (Nice to Have)

Issue Conformance

Previous Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 5, 2026 •

edited

Loading