Skip to content

feat(sync-service): self-heal stuck replication slot creation#4515

Draft
erik-the-implementer wants to merge 5 commits into
mainfrom
feat/slot-creation-self-heal
Draft

feat(sync-service): self-heal stuck replication slot creation#4515
erik-the-implementer wants to merge 5 commits into
mainfrom
feat/slot-creation-self-heal

Conversation

@erik-the-implementer
Copy link
Copy Markdown
Contributor

Summary

When a source's logical replication slot creation is blocked waiting on pending transactions, Electric now self-heals instead of requiring a manual source restart.

  • Connection.Manager periodically runs SELECT pg_log_standby_snapshot() on the admin pool while slot creation is blocked, so Postgres can emit an XLOG_RUNNING_XACTS record and the logical snapshot builder reaches CONSISTENT as soon as the blocking transaction ends.
  • Piggybacks on the existing :replication_configuration status-check timer (it already detects the blocked state and dispatches :replication_slot_creation_blocked_by_pending_transactions); mirrors the existing :check_lock_not_abandoned admin-pool query pattern.
  • Degrades gracefully when the function is unavailable (PostgreSQL < 14 or missing EXECUTE privilege): falls back to the previous behavior and emits a one-time :replication_slot_unblock_unavailable stack event + warning with remediation.
  • Makes the connection-status-check interval configurable so the behavior can be tested deterministically and fast.

Background

From an SRE investigation of a customer ("Ajax") with repeated source-inactivity incidents: a long-running transaction pins the slot's restart_lsn, retained WAL grows past max_slot_wal_keep_size (4 GB) and Postgres invalidates the slot. Recreating the slot then blocks on the same transaction — and on an otherwise-idle database, Postgres does not emit a fresh XLOG_RUNNING_XACTS record for a long time after the transaction commits, so the source stays stuck until someone restarts it. pg_log_standby_snapshot() forces that record on demand, making recovery automatic.

Test plan

  • mix test test/electric/connection/manager_test.exs — includes a new degrade-path test (Repatch simulates the function being unavailable; asserts the one-time notice + that the source keeps waiting). 23 tests, 0 failures.
  • mix test test/electric/postgres/replication_client_test.exs — no regressions.
  • integration-tests/tests/self-heal-stuck-slot-creation.lux — end-to-end against real Postgres: a held transaction blocks slot creation; after it commits on an idle DB, Electric forces a standby snapshot and resumes replication on its own. Passes (lux SUCCESS).
  • Changeset added (@core/sync-service patch).

🤖 Generated with Claude Code

alco and others added 5 commits June 5, 2026 15:25
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pg_log_standby_snapshot

When CREATE_REPLICATION_SLOT is blocked waiting on pending transactions,
Connection.Manager now periodically runs pg_log_standby_snapshot() on the
admin pool so Postgres can reach a consistent snapshot and the source
recovers without a manual restart. Degrades gracefully (one-time notice)
when the function is unavailable (PG < 14 or missing EXECUTE privilege).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reation

Holds an xid-bearing transaction open to block CREATE_REPLICATION_SLOT, then
commits it and keeps the database idle, asserting that Electric forces a
standby snapshot and resumes replication on its own (no manual restart).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.20%. Comparing base (b0030a1) to head (166fc11).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4515       +/-   ##
===========================================
- Coverage   69.32%   42.20%   -27.12%     
===========================================
  Files          77      273      +196     
  Lines        9277    22113    +12836     
  Branches     2896     7289     +4393     
===========================================
+ Hits         6431     9333     +2902     
- Misses       2828    12722     +9894     
- Partials       18       58       +40     
Flag Coverage Δ
packages/agents 71.78% <ø> (ø)
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 84.09% <ø> (?)
packages/agents-server 72.78% <ø> (-0.03%) ⬇️
packages/agents-server-ui 5.77% <ø> (?)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.71% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 42.20% <ø> (-27.12%) ⬇️
unit-tests 42.20% <ø> (-27.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude Code Review

Summary

Solid, well-scoped fix that makes Electric self-heal from a stuck logical replication slot creation (slot blocked on a pending transaction on an otherwise-idle DB) by periodically calling pg_log_standby_snapshot(), instead of requiring a manual source restart. The implementation piggybacks cleanly on existing machinery and degrades gracefully. No blocking concerns — a few suggestions below.

What's Working Well

  • Reuses existing structure rather than adding new processes. Hooking into the existing :replication_configuration status-check tick (manager.ex:653) and mirroring the :check_lock_not_abandoned admin-pool query pattern keeps the blast radius small and avoids a new per-source GenServer/timer — good for the 10k–150k shape scale.
  • Graceful degradation is thorough. The pg_version >= 140000 gate, the explicit :insufficient_privilege/:undefined_function handling, the {:error, _}/catch fallback to :ok (retry next tick), and the one-time notice via slot_unblock_notice_sent together cover PG<14, missing EXECUTE, transient errors, and unexpected raises without ever crashing the manager.
  • The handle_continue(:unblock_slot_creation, ...) guard + catch-all is correct (manager.ex:578, :599). The guarded clause requires admin_pool_available/1 and the :configuring_connection step; the unguarded fallback prevents a FunctionClauseError — exactly the exhaustive-match discipline this codebase cares about.
  • pg_version is reliably set before this path runs. :pg_info_obtained is cast during the replication client :query_pg_info step, which precedes :acquire_lock (and thus the :replication_client_lock_acquired cast that arms the configuration timer). Since both casts originate from the same process, Erlang message ordering guarantees pg_version is populated before the first :unblock_slot_creation continue — so there is no spurious :pg_version_too_old. Nice that this holds without an explicit guard.
  • Relocating defguardp admin_pool_available to the top of the module (manager.ex:136) is necessary (the new usage precedes the old definition site) and harmless to existing callers.
  • Tests fit the change. The Repatch degrade-path unit test asserts the one-time notice + continued waiting, and the self-heal-stuck-slot-creation.lux test proves the real-Postgres recovery on an idle DB (matching on the forced-standby-snapshot debug line is a good, specific proof of the feature). Changeset is present (@core/sync-service patch).

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None blocking.

Suggestions (Nice to Have)

  1. Remediation message is privilege-centric even for the :pg_version_too_old casemanager.ex:1218. When the reason is :pg_version_too_old, the warning still says "Grant EXECUTE ON FUNCTION pg_log_standby_snapshot() … (PostgreSQL 14+)", which a user cannot act on by granting — they need to upgrade Postgres. The interpolated reason hints at it, but the actionable sentence is misleading for that branch. Consider tailoring the remediation per reason, e.g. for :pg_version_too_old: "This requires PostgreSQL 14+; upgrade so the source can recover automatically."

  2. The unblock query blocks the manager process synchronously for up to 5s per tick. Postgrex.query(pool, "SELECT pg_log_standby_snapshot()", [], timeout: 5_000) runs inline in the handle_continue, so while it is in flight the Connection.Manager cannot service other messages (e.g. handle_call(:ping)). This mirrors the existing execute_lock_breaker_query precedent and only happens while slot creation is already blocked, so keeping it as-is is reasonable — just flagging it as a known property in case status/health reporting latency during the stuck window matters.

  3. Consider a telemetry event for self-heal attempts. The feature currently surfaces via a stack event + Logger.debug. Since debug is typically filtered in production, a telemetry event/counter under [:electric, :replication, ...] would let SREs see how often this path fires across the fleet — directly relevant to the SRE incident this addresses. Optional.

  4. Notice re-emits on every reconnect. slot_unblock_notice_sent resets to false whenever a new replication client starts (manager.ex:376), so a persistently-misconfigured source (PG<14 / no privilege) re-logs the warning each reconnection cycle. This is arguably desirable (re-surfaces the actionable problem), just noting it is per-connection-attempt rather than truly once-per-source.

Issue Conformance

No linked issue. Per project convention this is worth a note — PRs should reference the issue/incident they address — though the PR description SRE background (the customer investigation, max_slot_wal_keep_size slot invalidation, and the XLOG_RUNNING_XACTS-on-idle-DB root cause) is unusually thorough and stands in well for one. The implementation matches the described intent precisely, with no scope creep beyond the self-heal path and its supporting test/config plumbing.

Previous Review Status

First review — no prior iteration.


Review iteration: 1 | 2026-06-05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants