Skip to content

blueprint: logical decoding of archived WAL (v0.1 draft)#24

Open
NikolayS wants to merge 2753 commits into
masterfrom
blueprint/logical-decoding-archived-wals
Open

blueprint: logical decoding of archived WAL (v0.1 draft)#24
NikolayS wants to merge 2753 commits into
masterfrom
blueprint/logical-decoding-archived-wals

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

@NikolayS NikolayS commented Apr 16, 2026

Summary

Adds blueprints/LOGICAL_DECODING_ARCHIVED_WALS.md — a draft spec for producing a logical change stream from archived WAL files via a physical standby fed by restore_command, decoupling logical consumers from the production primary.

Status: Draft, ready to start Sprint 0 (post-Sprint-0 cleanup expected as v0.6).

Core thesis: Don't build an offline decoder (multi-year effort). Instead, exploit the fact that a PG16+ physical standby already supports logical slots — and that standby can replay from archive via restore_command, needing no connection to the primary.

What's in the spec

  • Goal & motivation — three pain points: WAL-retention risk, no logical PITR, managed-service lock-in (aspirational).
  • 4 user stories — decoupling, windowed logical extraction (PITR-adjacent), PII-free staging, WAL correctness verification via paused-state inspection.
  • Architecture — standby-as-decoder; consumer pipeline (JSON / SQL / replay / Kafka).
  • Phased PoC plan (~9 weeks):
    • Sprint 0 (3 weeks): four gates G1–G4, Outcome A/B/C scope split, observability.
    • Sprint 1: Dockerized harness + consumer + PII filter TDD.
    • Sprint 2: controlled pause/consume/resume replay; DDL handling; slot-invalidation cookbook; US-4 paused-state inspection controller.
    • Sprint 3: CLI polish, WAL-G / pgBackRest integration, demo.
  • Testing strategy — TDD scope (filters/formatters/CLI) vs. integration (orchestrator, slot lifecycle).
  • Team composition — ~3.5 FTE.
  • Future work — core patches (slot-aware replay throttling, walsender restore_command), catalog snapshot export, DBLab integration.
  • References — Ringer 2020 thread, Kukushkin 2025 revival, PG16 standby logical decoding (Drouvot), PG18 pg_logicalinspect, PG19 proposals (dynamic wal_level, pg_waldump tar support) — all clearly marked as under review.

Attribution

  • Author: Claude Opus 4.6 (1M context)
  • Key researcher: Claude Opus 4.6 (deep research mode)
  • Architecture lead: @x4m (Andrey Borodin) — Postgres.tv hacking session
  • Contributors: @NikolayS (idea generator, AI coordinator), @kirkw
  • Meeting secretary / transcription: Circleback.ai — transcribed the Postgres.tv hacking session
  • Reviewers (four rounds): another Claude Opus 4.6 instance, Gemini 3.1 Pro, GPT 5.4

Test plan

danielgustafsson and others added 30 commits April 6, 2026 01:55
If the background worker for processing databases manages to finish
before the launcher starts waiting for it, the launcher would treat
it erroneously as an error.  Fix by ensureing to check result state
in this case.  Identified on CI and synthetically reproduced during
local testing.

Also while, make sure to properly lock the shared memory structure
before updating tje result state.

Author: Daniel Gustafsson <daniel@yesql.seA
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4
Postcommit review and buildfarm/CI failures revealed a few issues in
the test code which this commit attempts to resolve.  These failures
are verified using synthetic means.

  * Wait for launcher exit in enable/disable checksum tests

    When enabling or disabling data checksums in a test with waiting
    for an end state (on or off), the test typically want to perform
    more test against the cluster immediately. Make sure to wait for
    the launcher to exit in these cases before returning in order to
    know it can immediately be acted on.  This is a more generic way
    of implementating 0036232.

  * Refactor injection point tests to use the injection_points test
    extension. Two injection points added for online checksums were
    better expressed using the injection_points extension with the
    test code embedded in datachecksum_state.c.

  * Make tests less timing dependent and allow transitions to "on"
    and not just "inprogress-on" in case a test manages to finish
    before it's checked for state.

  * When waiting on a blocking background psql keeping a temporary
    table open, the test first closed the background session abd
    then the server.  This could cause data checksums to manage to
    get enabled in the brief window between dropping the temporary
    table and closing the server.  Fix by closing the server first
    before the background session.

  * Remove a few superfluous duplicate checks and general cleanup
    of comments as well as making LSN logging consistent.

These issues were reported by Andres as well as spotted in the
buildfarm and on CI.

Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se
On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p
is not available. Use pg_integer_constant_p and add appropriate
guards. There is a similar potential hazard for the x86 path, but
for now let's get the buildfarm green.

Oversight in commit fbc57f2, per buildfarm member hoatzin.
…ation

Previously, during shutdown, walsenders always waited until all pending data
was replicated to receivers. This ensures sender and receiver stay in sync
after shutdown, which is important for physical replication switchovers,
but it can significantly delay shutdown. For example, in logical replication,
if apply workers are blocked on locks, walsenders may wait until those locks
are released, preventing shutdown from completing for a long time.

This commit introduces a new GUC, wal_sender_shutdown_timeout,
which specifies the maximum time a walsender waits during shutdown for all
pending data to be replicated. When set, shutdown completes once all data is
replicated or the timeout expires. A value of -1 (the default) disables
the timeout.

This can reduce shutdown time when replication is slow or stalled. However,
if the timeout is reached, the sender and receiver may be left out of sync,
which can be problematic for physical replication switchovers.

Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Ronan Dunklau <ronan@dunklau.fr>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
When determining if it is safe to use an expression as a grouping key
for partial aggregation, eager aggregation relies on the B-tree
equalimage support function to ensure that equality implies image
equality.

Previously, the code incorrectly passed the default collation of the
expression's data type to the equalimage procedure, rather than the
expression's actual collation.  As a result, if a column used a
non-deterministic collation but the base type's default collation was
deterministic, eager aggregation would incorrectly assume that the
column was safe for byte-level grouping.  This could cause rows to be
prematurely grouped and subsequently discarded by strict join
conditions, resulting in incorrect query results.

This patch fixes the issue by passing the expression's actual
collation to the equalimage procedure.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
Pushing aggregates containing volatile functions below a join can
violate volatility semantics by changing the number of times the
function is executed.

Here we check the Aggref nodes in the targetlist and havingQual for
volatile functions and disable eager aggregation when such functions
are present.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
…ion test

Previously the stats.sql regression test used conditions like
"datname = (SELECT current_database())" to check the current database name.

The subquery is unnecessary, so this commit simplifies these expressions to
"datname = current_database()".

Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/A1535A8F-65AF-4C3D-ACBE-25891CB5D38B@gmail.com
Alexander Lakhin has noticed that it can be possible on machines with
slow storage to have the spawned workers be stuck in
initialize_worker_spi(), before they reach their main loop.  Waiting for
a flush to happen would block the interrupt attempts done by the
database commands, causing the test to fail on timeout once the number
of interrupt attempts is reached in CountOtherDBBackends().

This commit switches the test to wait for the spawned bgworkers to reach
their main loops before attempting the database commands that would
trigger the interrupts, napping for a time larger than the default, with
worker_spi.naptime set at 10 minutes.  Another thing that could be
attempted is to enforce a larger number of tries in
CountOtherDBBackends(), if what is done here is not enough.  Let's see
first if what this commit does is enough for the buildfarm members
widowbird and jay.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com
Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently.  This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL.  A
good chunk of the callbacks get simpler thanks to this change.

The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data.  This would cause some
stats data loss.

Oversight in 4019f72.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me
This module allows plan advice strings to be provided automatically
from an in-memory advice stash. Advice stashes are stored in dynamic
shared memory and must be recreated and repopulated after a server
restart. If pg_stash_advice.stash_name is set to the name of an advice
stash, and if query identifiers are enabled, the query identifier
for each query will be looked up in the advice stash and the
associated advice string, if any, will be used each time that query
is planned.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com
Some compilers didn't like the empty initializer when compiled without
USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual
Studio 2019.

Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz
When freeing pending_shmem_requests we should also free the ->options.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com
Child processes do not need the postmaster's working memory context and
normally release it at the start of their main entry point. However,
the slotsync worker forgot to do so.

This commit makes the slotsync worker release the postmaster's working
memory context at startup, preventing unintended use.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com
This commit updates 011_lock_stats.pl to verify log_lock_waits behavior.

The tests check that messages are emitted both when a wait occurs and
when the lock is acquired, and that the "still waiting for" message is logged
exactly once per wait, even if the backend wakes up during the wait.

The latter covers the behavior introduced by commit fd6ecbf.

Author: Hüseyin Demir <huseyin.d3r@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com
Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
The restructuring in commit 53b8ca6 revealed an interesting
corner case: if a table needs vacuuming for wraparound prevention
and autovacuum is disabled for it, we might still choose to analyze
it.  Research seems to indicate this was an accidental addition by
commit 48188e1, and further discussion indicates there is
consensus that it is unnecessary and can be removed.

Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan
It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
This code missed the need to update the combined state's
nullbitmap if state1 already had a bitmap but state2 didn't.
We need to extend the existing bitmap with 1's but didn't.
This could result in wrong output from a parallelized
array_agg(anyarray) calculation, if the input has a mix of
null and non-null elements.  The errors depended on timing
of the parallel workers, and therefore would vary from one
run to another.

Also install guards against integer overflow when calculating
the combined object's sizes, and make some trivial cosmetic
improvements.

Author: Dmytro Astapov <dastapov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Backpatch-through: 16
contrib/pg_stash_advice and src/test/modules/test_shmem
missed these, leading to complaints from git after an
in-tree check-world run.

Use our standard boilerplate list of ignorable subdirectories,
although the two modules presently create different subsets
of that.
These columns haven't been computed yet when the filtering happens
(since we've not written the candidate tuple into the table); so
any check on them is wrong or useless.  Worse, since aa606b9 such a
reference results in an access off the end of a TupleDesc, potentially
causing a phony "generated columns are not supported in COPY FROM
WHERE conditions" error; and since c98ad08 it throws an Assert
instead.

Actually we could allow tableoid, which has been set to the OID of the
table named as the COPY target.  However, plausible uses for tests of
tableoid would involve a partitioned target table, and the user would
wish it to read as the OID of the destination partition.  There has
been some discussion of changing things to make it work like that,
but pending that happening we should just disallow tableoid along
with other system columns.

It seems best though to install this prohibition only in HEAD.
In the back branches we'll just guard the unsafe TupleDesc access,
and people will keep getting whatever semantics they got before.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com
CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql
Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.

Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.

To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().

A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.

The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order.  However, it is nowhere near being capable of doing
that correctly.  Nor is there reason to think that it ever will be,
or that that is a well-defined requirement.  (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.)  Moreover, the problem will get worse as
we add more subcommand types.  Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway.  The foreign-key issue
will be handled in a follow-up patch.

This will result in a release-note-worthy incompatibility,
which is that forward references like
	CREATE SCHEMA myschema
	    CREATE VIEW myview AS SELECT * FROM mytable
	    CREATE TABLE mytable (...);
used to work and no longer will.  Considering how many closely
related variants never worked, this isn't much of a loss.

Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
The previous patch simplified CREATE SCHEMA's behavior to "execute all
subcommands in the order they are written".  However, that's a bit too
simple, as the spec clearly requires forward references in foreign key
constraint clauses to work, see feature F311-01.  (Most other SQL
implementations seem to read more into the spec than that, but it's
not clear that there's justification for more in the text, and this is
the only case that doesn't introduce unresolvable issues.)  We never
implemented that before, but let's do so now.

To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD
FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's
subcommand list.  This works because the foreign key constraints are
independent and don't affect any other DDL that might be in CREATE
SCHEMA.  For simplicity, we do this for all FOREIGN KEY clauses even
if they would have worked where they were.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
Having rejected the principle that we should know how to re-order
the sub-commands of CREATE SCHEMA, there is not really anything
except a little coding to stop us from supporting more object types.
This patch adds support for creating functions (including procedures
and aggregates), operators, types (including domains), collations,
and text search objects.

SQL:2021 specifies that we should allow functions, procedures,
types, domains, and collations, so this moves us a great deal
closer to full SQL compatibility of CREATE SCHEMA.  What remains
missing from their list are casts, transforms, roles, and some
object types we don't support yet (e.g. CREATE CHARACTER SET).
Supporting casts or transforms would be problematic because
they don't have names at all, let alone schema-qualified names,
so it'd be quite a stretch to say that they belong to a schema.
Roles likewise are not schema-qualified, plus they are global
to a cluster, making it even less reasonable to consider them
as belonging to a schema.  So I don't see us trying to complete
the list.

User-defined aggregates and operators are outside the spec's ken,
as are text search objects, so adding them does not do anything for
spec compatibility.  But they go along with these other object types,
plus it takes no additional code to support them since they are
represented as DefineStmts like some variants of CREATE TYPE.
It would indeed take some effort to reject them.

Author: Kirill Reshke <reshkekirill@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com
The associated value should look like something that could be
part of an EXPLAIN options list, but restricted to EXPLAIN options
added by extensions.

For example, if pg_overexplain is loaded, you could set
auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'.
You can also specify arguments to these options in the same manner
as normal e.g. 'DEBUG 1, RANGE_TABLE false'.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
This function is a thin wrapper around relation_needs_vacanalyze()
that handles fetching and freeing the pgstat entry for the table.
Since all callers of relation_needs_vacanalyze() do that anyway, we
can teach that function to fetch/free the pgstat entry and use it
instead.

Suggested-by: Álvaro Herrera <alvherre@kurilemu.de>
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.

Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

Rather than pre-checking pg_is_in_recovery() on the standby (which
would add an extra round-trip on every call), we issue WAIT FOR LSN
directly and handle the 'not in recovery' result as a signal to fall
back to polling.

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), when the standby has been
promoted, or when WAIT FOR LSN is interrupted by a recovery conflict,
the function falls back to the original polling-based approach using
pg_stat_replication on the upstream.  The recovery conflict fallback
is necessary because some conflicts are unavoidable - for example,
ResolveRecoveryConflictWithTablespace() kills all backends
unconditionally, regardless of what they are doing.

The recovery conflict detection matches the English error message
"conflict with recovery", which is reliable because the test suite
runs with LC_MESSAGES=C.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
Add a note to the WAIT FOR documentation explaining that sessions
using this command on a standby server may be interrupted by recovery
conflicts.  Some conflicts are unavoidable - for example, replaying
a tablespace drop terminates all backends unconditionally.

Discussion: https://postgr.es/m/CAPpHfds7oSCbZqob7ytT_Lso8fv-NW8LnedUTE4Krde%2B3rkJeA%40mail.gmail.com
Author: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
NikolayS and others added 28 commits April 22, 2026 11:19
… disclosed

Three adjustments this commit, reflecting the actual production model:

1. Author is Claude Opus 4.6 (me) — not @NikolayS. The blueprint is
   AI-authored; @NikolayS frames problems, provides judgement,
   coordinates agents, and makes final calls.

2. Attribution block expanded:
   - Author: Claude Opus 4.6 (1M context)
   - Key researcher: Claude Opus 4.6 (deep research mode)
   - Architecture lead: @x4m (Andrey Borodin)
   - Contributors: @NikolayS (idea generator, AI coordinator),
     @kirkw
   - Meeting secretary / transcription: Circleback.ai (transcribed
     the Postgres.tv hacking session; transcript fed into research
     and drafting)
   - Reviewers (four rounds): another Claude Opus 4.6 instance,
     Gemini 3.1 Pro, GPT 5.4

3. GitHub handles (@NikolayS, @x4m, @kirkw) are now clickable
   markdown links throughout the document.

4. Author column removed from the Changelog table — with a single
   author it was noise; version/date/changes remain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added https://www.youtube.com/watch?v=LjiU6kB6izw as a clickable link
wherever the hacking session is mentioned — attribution block
(architecture lead + meeting secretary lines) and reference [13].

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three parallel review agents converged on the same smell: across v0.1
to v0.5, "(added in v0.X)" and "(corrected in v0.Y)" parentheticals
accumulated throughout the prose, plus defensive meta-commentary about
prior versions. The changelog already captures the revision history;
keeping it in the current-state sections turns the doc into archaeology
for fresh readers.

Stripped:

- ~10 "(added in v0.3)" / "(added in v0.5)" / "(corrected in v0.2)"
  parentheticals on section headers and inline prose.
- "Previous version of this doc was wrong" lead-in in §4.2.2 — current
  prose stands on its own.
- "Revised in v0.2 after feasibility pushback" header in §4.2.3.
- Sprint 0 "Revised in v0.2" introduction.
- "was 2" historical annotations on S0-6 and S2-2 day budgets.
- "v0.3 said ~8, v0.5 reconciles..." version-history parenthetical
  after the Gantt total.
- "The decode historic WAL reading of v0.1 was wrong" in G2 gate note.
- "v0.2 phrasing conflated..." archaeology in §4.1.4 risk 1.
- Gantt heading "(v0.5 - Sprint 0 extended to 3 weeks per budget
  reconciliation)" simplified to just "Gantt overview."

Compressed:

- "Note on authorship" preamble: 90 words -> 15 words. The US-4
  retention detail belongs in §2 US-4, not the header.
- US-4 §2 justification: three paragraphs of attribution + value
  defense + granularity essay collapsed to one sentence of
  attribution + one short comparison + granularity caveat.
- Outcome A US-4 bullet: dropped "per v0.4 retention decision" and
  "core value-add from @x4m"; now just "in scope; see §2 US-4."
- Sprint 0 budget note: dropped the "v0.3/v0.4 claimed 2 weeks while
  tasks totalled 11.5-14.5d" relitigation; states the current budget.

Fixed:

- Broken §7.0.1 / §7.0.2 internal refs (section numbers that don't
  exist) replaced with heading names.
- `postgresql.conf` comment pointing to "§4.2 for the correct recipe"
  redirected to §2 US-2 (§4.2 is about controlled replay, not the
  US-2 recipe).
- Components table `Python (psycopg2/3)` reconciled to `Python
  (psycopg)` — the Python example currently imports psycopg2 only;
  the "/3" was aspirational.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All three round-5 reviewers endorsed proceeding to Sprint 0. Two
small items were small enough to fix before closing out:

- Outcome B's US-4 bullet softened. v0.5 reframed US-4 around
  paused-state inspection as independently valuable; Outcome B still
  said "likely cut," which partially contradicted that framing.
  Now: "narrows substantially, keep as internal debug tool, don't
  feature in shipped product positioning."

- Gantt post-early-exit timeline tightened from "~7-8 week" to "~8
  week" — the math (15d sprint 0, exit at day 8, save 7d = 1.4
  weeks, 9 - 1.4 = 7.6 ≈ 8) made 7 aspirational.

Explicitly NOT applied:
- Authorship block reframing for pgsql-hackers audience (a reviewer's
  judgment-call suggestion; the current AI-authorship transparency is
  a deliberate choice by @NikolayS).
- Changelog v0.5 entry (h) defensive-tone trim (cosmetic).
- R1's Sprint 0 execution tips (READ ONLY hooks, VACUUM pg_attribute,
  S3-vs-local latency isolation) — belong in the execution thread,
  not the spec.

Per round-5 consensus: stop iterating, start Sprint 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Self-contained ~20 second reproducer that spins up a local PG18 primary
+ archive-only standby, creates a logical slot, triggers G3 invalidation
via pg_statistic dead-tuple generation + VACUUM, and captures the
Heap2/PRUNE_ON_ACCESS WAL record with its snapshotConflictHorizon.

Verified working with both test_decoding and pgoutput plugins — same
~3s time-to-invalidation, same WAL trigger record, demonstrating the
G3 mechanism is slot-level (not plugin-dependent).

Usage:
  ./blueprints/repro_g3.sh            # run experiment
  ./blueprints/repro_g3.sh cleanup    # tear down clusters + files

Requires PostgreSQL 18 server + client (pgdg apt packages), unprivileged
user, ~300 MB disk under /tmp/sprint0-repro/. No sudo needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Demonstrates the working US-2 approach: pre-stage archive up to the
segment containing L_start, start standby with restore_command pointing
at gated archive, let replay pause at archive end, create slot (restart_lsn
pins at L_start), then release subsequent segments to advance replay
through the window.

Tested: DELETE with REPLICA IDENTITY FULL is decoded with full old-tuple
data, enabling reconstruction of the deleted rows.

This contradicts an earlier finding that called US-2 unimplementable —
the prior attempts used recovery_target_lsn=L_start + action=pause which
blocks snapbuild, not the gated-archive variant demonstrated here.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Demonstrates that US-2 (windowed logical extraction) is achievable even
when the primary is dead at recovery time — provided a production-side
backgrounder has been recording pg_log_standby_snapshot() calls at
quiet moments into the WAL archive.

The recipe: gate archive to stop BEFORE the segment containing the
quiet-moment path-(a) running_xacts record. Start standby (stops at
archive end). Launch slot creation (blocks waiting for snapbuild
consistency). Release the snapshot's segment mid-slot-creation — the
standby's restore_command retries pick it up, replay advances into
the segment, snapbuild reads the path-(a) record forward from
restart_lsn, hits CONSISTENT immediately. Slot creation completes in
~1 second with restart_lsn pinned at the quiet-moment LSN. Remaining
segments released, full decoded window extracted.

Tested under 30s sustained OLTP + primary killed before recovery.
Decoded output: 2906 changes including 51 DELETEs from the accident.

This contradicts the earlier "primary must be reachable during
recovery" finding — the pre-positioned quiet-moment snapshot IS
visible to snapbuild if you gate the archive to place restart_lsn
BEFORE the snapshot LSN, not after.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sprint 0 was executed on a lab VM (PG18 + PG17 cross-validated) and
produced ~35 comments of raw evidence on issue #25 plus three committed
reproducer scripts. v0.6 incorporates those findings into the blueprint.

Major changes:

- Status line updated: "Post Sprint 0 execution — findings incorporated;
  pgsql-hackers RFC draft ready"

- US-2 recipe in §2 REWRITTEN. The v0.5 recovery_target_lsn+pause recipe
  does not work in practice — creating a slot on a paused standby blocks
  snapbuild indefinitely because no forward WAL arrives. Replaced with
  the gated-archive + quiet-moment-snapshot recipe validated in Sprint 0.
  The new recipe works with the PRIMARY DEAD during recovery provided
  production has recorded periodic pg_log_standby_snapshot() calls.

- New §10 Sprint 0 Execution Findings section with:
  - Gate-by-gate results (G1 PASS, G2 PASS, G3 FAILS, G4 PASS)
  - G3 MTTI table across 5 workload regimes
  - US-2 working recipe validation with raw decode output
  - US-1 verdict: requires core patch (elevated to Future Work §8.1)
  - Outcome determination: Outcome B is the shipping scope

- New §11 Production-Side Prerequisites documenting the
  pg_log_standby_snapshot() backgrounder as a hard requirement for US-2.

- References §9: added pointer to issue #25 as source of full raw evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Given an archive directory, reports two key LSNs:
- path-(a) anchor: earliest RUNNING_XACTS record where snapbuild can
  bootstrap a slot with oldestRunningXid == nextXid
- MTTI ceiling: first Heap2/PRUNE_ON_ACCESS on a catalog relation at
  or after the anchor, which invalidates any logical slot replaying
  past it

Delta between them is the archive's practical US-2 window, in bytes.

Sprint 1 placeholder for the "pg_waldump --find-first-catalog-prune"
tooling ask called out in the Sprint 0 consolidation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Logical slots are per-database; a catalog prune in database A does not
invalidate a slot in database B. The prior version over-predicted by
reporting the first catalog prune in ANY database.

Add --db <oid> to restrict the scan. Without --db, behavior is unchanged
(conservative) and the output notes the filter is off.

Validated against a real failed-recovery archive:
- Without --db: ceiling at 0/29009D98 (rel 1663/1/2619, db=1 template1)
- With --db 5:  ceiling at 0/33004780 (rel 1663/5/2619, db=5 postgres)
Actual invalidation during standby replay was at 0/33004780 — the --db 5
prediction matches exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three additions from continued issue #25 iteration:

1. §10.7 — new static-analysis tool wal_archive_ceiling.sh, end-to-end
   validated against a real failed-recovery archive (predicted LSN,
   snapshotConflictHorizon, rel/blk all match standby's dynamic
   invalidation exactly).

2. §10.2 table — two new data points quantifying the US-2 ceiling under
   sustained 300s OLTP: baseline (default autovacuum) invalidates at
   t+138s; tuned primary (autovacuum_naptime=600s) survives the full
   window and drains 30,413 rows.

3. §11.5 — formalize the operator-facing rule: US-2 window is controlled
   by primary-side autovacuum, not any standby GUC. Rule of thumb:
   autovacuum_naptime >= 2 x desired-window-seconds.

Also captures the per-database specificity of InvalidatePossiblyObsoleteSlot
(slot->data.database gate) which caused an initial over-prediction in the
tool before --db <oid> was added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When enabled and a PRUNE_ON_ACCESS WAL record on a catalog relation is
about to cause InvalidateObsoleteReplicationSlots to invalidate at least
one active logical slot in the affected database, pause recovery
instead. The operator can then drain, advance, or drop the slot via
hot-standby SQL and call pg_wal_replay_resume() to continue. On resume,
the caller falls through to the normal invalidation path: if the slot
is gone or advanced past the conflict horizon, invalidation is a no-op;
otherwise the slot is invalidated as before.

Motivated by blueprints/LOGICAL_DECODING_ARCHIVED_WALS.md §4.2.3 / US-4
and the Sprint 0 findings in issue #25: an archive-only logical-decoding
standby cannot feed hot_standby_feedback to its primary, so the primary
has no back-pressure against catalog vacuuming. Any logical slot on
such a standby is invalidated the first time replay applies a catalog
prune whose snapshotConflictHorizon exceeds the slot's catalog_xmin.
MTTI under default autovacuum on the primary is ~2 * autovacuum_naptime.

Hooks in at ResolveRecoveryConflictWithSnapshot right before the existing
InvalidateObsoleteReplicationSlots call, which is the single point the
recovery path reaches on a logical-slot-relevant conflict. Reuses the
existing recoveryNotPausedCV and SetRecoveryPause API, so this
integrates with pg_wal_replay_pause/resume and recovery_target_action
= pause without new shared-memory state.

Default is off (unchanged behavior). PGC_SIGHUP so operators can flip
it on an already-running standby if they decide to retrofit pause
semantics mid-incident.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this, monitoring tools see pg_get_wal_replay_pause_state() stuck
at "pause requested" forever — the transition to "paused" normally
happens in the main replay loop, but MaybePauseOnLogicalSlotConflict
blocks inside ResolveRecoveryConflictWithSnapshot before returning to
that loop. Call ConfirmRecoveryPaused() inside our local wait loop so
the pause shows up correctly to SQL-level observers.

Also drop the static qualifier on ConfirmRecoveryPaused in xlogrecovery.c
and expose it in xlogrecovery.h for use by standby.c.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes the last gap for US-1 continuous CDC from an archive-only
logical-decoding standby.

Before this, MaybePauseOnLogicalSlotConflict paused correctly, the
operator could drain the slot, but on resume the fall-through to
InvalidateObsoleteReplicationSlots still invalidated the slot — because
the drain's logical-decoding machinery has no way to advance
catalog_xmin past the conflict horizon (the conflict record hasn't
been replayed yet, so the xids involved are still considered
potentially-active by snapbuild).

After resume, scan slots in the conflicted database for those whose
confirmed_flush_lsn has reached (or passed) the pause LSN — i.e., the
operator drained up to the conflict. For each such slot, advance both
catalog_xmin and xmin past the conflict horizon (using
TransactionIdAdvance so the value is strictly > horizon, satisfying
DetermineSlotInvalidationCause's TransactionIdPrecedesOrEquals check).

Slots the operator did NOT drain are left untouched and invalidated
normally — the old "I'll let the slot die" path is preserved.

End-to-end verified against a 300s sustained-OLTP archive with two
catalog prune events:
- Without this change: slot invalidated on first pause's resume.
- With this change: both pauses handled, 45,469 decode events extracted
  (exactly 3x the 15,153 workload INSERTs = BEGIN/INSERT/COMMIT each),
  slot final state wal_status=reserved, no invalidation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WIP — functional assertions are in place but the archive-restore timing
on init_from_backup(has_restoring => 1) is flaky: the standby stalls
waiting for a segment that was written to the primary's pg_wal but not
yet archived at the moment init_from_backup snapshots archive_command
state.

Saving the skeleton so reviewers can see the intended assertions and
iterate on the timing harness. The equivalent workflow has already been
validated end-to-end via /tmp/us1_v2.sh (45,469 events decoded, 2 pauses
handled, slot survived — documented in issue #25 comment 4261479941);
this is about replicating that in Perl TAP form for upstream CI.

Known failure mode this revision: PostgreSQL::Test::Cluster's standby
init with has_restoring=1 polls for replay progress before the last
pre-backup segment has landed in the archive. Need to either
* wait for `pg_stat_archiver.last_archived_wal` to match primary's
  current LSN before taking the backup, or
* use has_streaming=1 for the initial catchup and swap to restoring-
  only mid-test.

Not blocking the patch — the prototype still stands on the bash demo.
This revision of the TAP test surfaces an interaction that wasn't
visible in the bash demo at /tmp/us1_v2.sh:

When the catalog-prune WAL record is already in the archive by the
time the standby starts replaying, slot creation can block in
DecodingContextFindStartpoint waiting for more WAL *just as* replay
reaches the prune record and MaybePauseOnLogicalSlotConflict fires.
The result is a deadlock: slot creation waits for replay to advance,
replay is paused waiting for the slot to be drained, but the slot has
not yet reached SNAPBUILD_CONSISTENT and is not drainable.

Two candidate fixes for the patch, for followup work:
1. In MaybePauseOnLogicalSlotConflict, skip slots where the
   effective_catalog_xmin is still InvalidTransactionId (slot not yet
   consistent). Rationale: an in-progress slot hasn't produced output,
   so invalidating it is fine — it gets retried.
2. Decouple slot-creation from the startup process more deeply so a
   paused recovery can still make the tiny bit of progress
   DecodingContextFindStartpoint needs.

Updated top-of-file doc reflects the edge case and points reviewers
at the bash demo as the current authoritative proof of the US-1 path.
Closes the deadlock window the TAP test scaffold surfaced in
a5844b0.

Before: MaybePauseOnLogicalSlotConflict pauses replay for any slot in
the affected database whose data.catalog_xmin precedes the conflict
horizon — including slots still inside DecodingContextFindStartpoint,
whose catalog_xmin has been preliminarily assigned but whose snapbuild
has not yet hit SNAPBUILD_CONSISTENT (effective_catalog_xmin is still
InvalidTransactionId). Pausing for such a slot deadlocks: snapbuild
needs replay to advance to find the path-(a) anchor, but replay is
paused waiting for the slot to be drained, and the slot is not
drainable.

After: also require effective_catalog_xmin to be valid. An in-progress
slot is allowed to be invalidated — the caller just retries creation.
A slot that has reached SNAPBUILD_CONSISTENT (and hence produced
output a consumer may have committed) still pauses.

Expected impact on the US-1 happy path: none. In the bash reproducer
at /tmp/us1_v2.sh the slot is always consistent by the time a prune
record is replayed, because of the gated-archive ordering.

Expected impact on the TAP harness: the startup process no longer
hangs when it races into a prune record before slot creation has
completed. The TAP test still needs a two-phase restructure (create
slot BEFORE the catalog-churn workload on the primary) to reach a
passing state — that's tracked in the a5844b0 commit message.
Two fixes, tightly related:

1. standby.c MaybePauseOnLogicalSlotConflict: use
   TransactionIdPrecedesOrEquals to match the semantics of
   DetermineSlotInvalidationCause. Otherwise a slot whose catalog_xmin
   was just advanced past horizon H by our own resume code (catalog_xmin
   now == H+1) will NOT pause when the next prune arrives with horizon
   == H+1, yet it WILL still be invalidated by the fall-through
   InvalidateObsoleteReplicationSlots call — since that function uses
   PrecedesOrEquals. Off-by-one caused slot invalidation one prune
   after a successful drain-and-advance cycle.

2. TAP test 050 restructured into an explicit two-phase flow:
   Phase 1: basebackup + quiet-moment pg_log_standby_snapshot, wait
            for that segment to archive, then start standby and create
            slot — guaranteed no prune records in the archive yet, so
            slot reaches SNAPBUILD_CONSISTENT in seconds.
   Phase 2: run catalog churn on the primary (table create+drop
            iterations, ANALYZE x2, VACUUM on catalog relations), wait
            for those segments to archive, let the standby replay
            through them. The GUC pauses on each prune; orchestrator
            drains and resumes.

Test result: **all 5 assertions pass**, 22 pause events handled,
3092 decoded events, zero invalidations, slot final wal_status=reserved.

Runs in ~36 wallclock seconds on a modest VM.

[17:31:23] ok 1 - recovery_pause_on_logical_slot_conflict GUC is registered
[17:31:37] ok 2 - slot created cleanly in Phase 1 (state: reserved)
[17:31:58] ok 3 - slot survived catalog prune with GUC on
[17:31:58] ok 4 - at least one pause event was handled (22 seen)
[17:31:58] ok 5 - at least 2000 decoded events (3092 got)
§4.2.3 / §8.1 "slot-aware replay throttling (mechanism TBD)" is now a
concrete 215-line prototype with a passing TAP test on the branch.

Main additions:

- New §12 "Sprint 1 Core Patch: recovery_pause_on_logical_slot_conflict":
  describes the hook point in ResolveRecoveryConflictWithSnapshot, the
  pause+wait+advance flow, edge cases handled (in-progress slot
  deadlock avoidance, PrecedesOrEquals semantics), validation (TAP
  passing, bash end-to-end 45k events at 100% coverage, 102-test
  regression sweep), and files touched.

- Changelog row for 0.8: compresses the 5-commit arc
  (2d70df88a3b95d8761b6ebbd5d4e7d16094)
  into the milestone's record. Outcome determination upgraded from B
  (forensic-only) to A (continuous US-1 viable).

Existing sections (§10 Sprint 0 findings, §11 production-side
prereqs) unchanged — they are still accurate for unpatched PG. §12
adds the "with the patch" story on top.
Adds a second standby node (standby_off) brought up in the same Phase 1
as the main standby — quiet archive, no prune records yet. Both
standbys reach SNAPBUILD_CONSISTENT cleanly and create their slots.

Phase 2's catalog churn then hits both standbys. The GUC-on one pauses
and drains (existing assertions); the GUC-off one invalidates — the
existing PG upstream behavior, unchanged by this patch.

New assertions:
  ok 3 - baseline slot created cleanly in Phase 1
  ok 7 - baseline (GUC off): slot invalidates as expected

Before/after comparison now runs in the same harness. If GUC-off ever
stops invalidating, either the test stopped triggering the conflict or
the patch accidentally benefits the off-path — both are real
regressions to catch.

Test summary: 7/7 tests pass, 39 wallclock seconds.

[17:55:58] ok 1 - recovery_pause_on_logical_slot_conflict GUC is registered
[17:56:14] ok 2 - slot created cleanly in Phase 1 (state: reserved)
[17:56:14] ok 3 - baseline slot created cleanly in Phase 1 (state: reserved)
[17:56:35] ok 4 - slot survived catalog prune with GUC on (state: reserved|)
[17:56:35] ok 5 - at least one pause event was handled (22 seen)
[17:56:35] ok 6 - at least 2000 decoded events (3092 got)
[17:56:35] ok 7 - baseline (GUC off): slot invalidates as expected under catalog prune
Stores the draft email body for posting
recovery_pause_on_logical_slot_conflict to pgsql-hackers (as
blueprints/pgsql_hackers_rfc_email.md) alongside the squashed
2-commit patch series (in blueprints/rfc-v1-patches/).

Patch series also lives on the rfc-v1-recovery-pause-on-slot-conflict
branch of this fork for reviewers who prefer git over mailing-list
attachments.

The 5 prototype commits on this branch (2d70df8, 8a3b95d,
8761b6e, bbd5d4e, 7d16094) are squashed into a single
0001-Pause-recovery-on-logical-slot-conflict.patch in the series; the
TAP test commits (7d16094 + 253ad28) are squashed into
0002-Add-TAP-test-for-recovery_pause_on_logical_slot_conf.patch.
Three fixes from an adversarial review pass:

1. (HIGH) Add PromoteIsTriggered() check in the wait loop so
   pg_promote() while paused doesn't stall. Existing recoveryPausesHere
   uses CheckForStandbyTrigger, which is static in xlogrecovery.c;
   PromoteIsTriggered is the exposed equivalent.

2. (HIGH) Filter out synced slots (data.synced == true) in both the
   pause-check scan and the advance scan. Writing to a synced slot from
   the startup process would race the slot-sync worker. Also, the
   error hint "drain, advance, or drop the slot" does not apply to
   synced slots — ALTER / DROP_REPLICATION_SLOT on a synced slot
   ERROR out at slot.c:932 / :982.

3. (MEDIUM → documented) The advance marks slots dirty but does not
   force a restartpoint, so a crash between resume and the next
   restartpoint loses the advance. Added a note in the code. On
   restart, we re-encounter the same conflict record, re-pause, and
   the operator re-drains — idempotent, no data loss, but a
   hiccup. A proper fix needs SaveSlotToPath to be callable from the
   startup process; that is out of scope for this prototype.

Test result: 050 TAP still 7/7 passing in 37s. The fixes add 3 new
safety edges without breaking the happy path.
Reflects commit e896c73 (promote check, synced-slot filter,
durability-gap doc comment) rolled into the squashed patches.

Both patches still apply cleanly to upstream postgres/postgres master
(commit 191a037 as of this writing). Regression sweep on the
fresh tree includes:
  t/050_recovery_pause_on_slot_conflict.pl .. ok  (7/7 asserts)
  t/035_standby_logical_decoding.pl ......... ok
  t/040_standby_failover_slots_sync.pl ...... ok  (synced-slot filter)
  t/044_invalidate_inactive_slots.pl ........ ok
  (135 tests total, PASS)

0001 now 216 → 254 insertions (the three safety fixes).
0002 unchanged in substance.

The rfc-v1-recovery-pause-on-slot-conflict branch is force-pushed to
the new squashed pair (fb514d2 + a16bce1).
Expands the "Design notes" section from 3 edge cases to 6, folding in
the three safety fixes caught during adversarial review:
(4) PromoteIsTriggered escape, (5) synced-slot filter, (6) durability
gap — the latter flagged as known-limitation / deferred with a brief
explanation of why it's not catastrophic (idempotent re-drain on
restart, no data loss) and why the proper fix is out of scope for the
prototype (SaveSlotToPath is static, requires MyReplicationSlot).

No other changes to the email text. The email body is still the
candidate send-text pending human sanity-check before posting.
The previous version checked PromoteIsTriggered() in the wait loop,
which only reads the LocalPromoteIsTriggered cache. The cache is
populated by CheckForStandbyTrigger(), which does the real work:
detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered().
Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays
false forever — so pg_promote()'s signal went unnoticed in our loop.

Empirically verified the bug and fix:
- Before (PromoteIsTriggered only): pg_promote(wait => true,
  wait_seconds => 30) returns FALSE after 30 seconds, standby still
  in_recovery.
- After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second,
  standby is promoted.

Raw test run from /tmp/verify_promote.sh:
  [18:56:24] wait for standby to enter pause
  [18:56:29] pause state: paused
  [18:56:29] *** calling pg_promote() now ***
  [18:56:30] pg_promote returned in 0.937961725s
  [18:56:30] pg_is_in_recovery post-promote: f
  [18:56:30] PASS: promote escaped the pause in under 10s

CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes
it extern (drop the static, add the extern in xlogrecovery.h). Mirrors
the existing recoveryPausesHere() escape loop in the same file.
Two additions:

1. **Promote-during-pause assertions** (ok 8, 9, 10). Brings up a
   third standby with the GUC on, waits for it to enter the paused
   state, then calls pg_promote(wait=>true, wait_seconds=>30) and
   asserts:
   - promote returned true
   - promote completed in under 10 seconds
   Without the CheckForStandbyTrigger() escape in the wait loop (the
   fix in ee42817), the standby stays paused for 30s and
   pg_promote returns false. This test guards against regression of
   that fix.

2. **Phase-1 WAL stabilization**. Previously the Phase-1 sequence was
   pg_log_standby_snapshot + pg_switch_wal. The first post-backup
   segment would archive and standbys could start, but slot creation
   on them could block in DecodingContextFindStartpoint 'waiting for
   WAL to become available' at segment N+1 (the active-but-not-yet-
   archived segment on primary). Adding a second pg_log_standby_snapshot
   + pg_switch_wal after the first one gives snapbuild enough forward
   WAL to decide the slot is consistent without waiting for primary
   activity. Eliminates the intermittent 'slot creation on standby_off
   timed out' flake seen in earlier runs.

Test result: 10/10 assertions pass, ~30 wallclock seconds (down from
~40s, and without the flake).

  ok 1 - GUC registered
  ok 2 - slot created cleanly in Phase 1 (state: reserved)
  ok 3 - baseline slot created cleanly in Phase 1 (state: reserved)
  ok 4 - slot survived catalog prune with GUC on (state: reserved|)
  ok 5 - at least one pause event was handled (18 seen)
  ok 6 - at least 2000 decoded events (3094 got)
  ok 7 - baseline (GUC off): slot invalidates as expected
  ok 8 - promote-test standby reached paused state before promotion
  ok 9 - pg_promote returned true while standby was paused by GUC
  ok 10 - pg_promote completed in under 10s (actual: 1s)
…10-test TAP

Reflects commits ee42817 (promote-escape actually working),
68b62ce (TAP hardening + promote-during-pause coverage).

v3 patches:
  0001 — 265 insertions (was 256 in v2)
  0002 — 296 insertions (was 247 in v2)

Both patches still apply cleanly to upstream postgres/postgres master.
Test: 10/10 passing, ~30s runtime.

The rfc-v1-recovery-pause-on-slot-conflict branch was force-pushed to
the new squashed pair (87e7c8f + d63deab).
The July 2025 "Requested WAL segment has already been removed" thread
was authored by Japin Li, not Kukushkin. Kukushkin is a discussion
participant alongside Fujii Masao et al.; his contribution was
mailing-list commentary, not a patch.

Fixed in four live references:
  - §1  "What exists today" bullet
  - §8.2 Future Work entry #2 (walsender restore_command integration)
  - §9  Reference #2
  - §10.4 US-1 verdict prior-art note

Changelog entries for v0.1 and v0.2 left as frozen history — their
misattribution is preserved as part of the iteration record, not
rewritten.

Sawada (dynamic wal_level PoC, §9 ref #7) and Amul Sul (pg_waldump
tarfile support, §9 ref #8) verified correctly attributed — both are
thread originators / patch authors on their respective pgsql-hackers
threads, no change needed.
@NikolayS NikolayS force-pushed the blueprint/logical-decoding-archived-wals branch from 2731368 to a485e95 Compare April 22, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.