blueprint: logical decoding of archived WAL (v0.1 draft) by NikolayS · Pull Request #24 · NikolayS/postgres

NikolayS · 2026-04-16T03:11:39Z

Summary

Adds blueprints/LOGICAL_DECODING_ARCHIVED_WALS.md — a draft spec for producing a logical change stream from archived WAL files via a physical standby fed by restore_command, decoupling logical consumers from the production primary.

Status: Draft, ready to start Sprint 0 (post-Sprint-0 cleanup expected as v0.6).

Core thesis: Don't build an offline decoder (multi-year effort). Instead, exploit the fact that a PG16+ physical standby already supports logical slots — and that standby can replay from archive via restore_command, needing no connection to the primary.

What's in the spec

Goal & motivation — three pain points: WAL-retention risk, no logical PITR, managed-service lock-in (aspirational).
4 user stories — decoupling, windowed logical extraction (PITR-adjacent), PII-free staging, WAL correctness verification via paused-state inspection.
Architecture — standby-as-decoder; consumer pipeline (JSON / SQL / replay / Kafka).
Phased PoC plan (~9 weeks):
- Sprint 0 (3 weeks): four gates G1–G4, Outcome A/B/C scope split, observability.
- Sprint 1: Dockerized harness + consumer + PII filter TDD.
- Sprint 2: controlled pause/consume/resume replay; DDL handling; slot-invalidation cookbook; US-4 paused-state inspection controller.
- Sprint 3: CLI polish, WAL-G / pgBackRest integration, demo.
Testing strategy — TDD scope (filters/formatters/CLI) vs. integration (orchestrator, slot lifecycle).
Team composition — ~3.5 FTE.
Future work — core patches (slot-aware replay throttling, walsender restore_command), catalog snapshot export, DBLab integration.
References — Ringer 2020 thread, Kukushkin 2025 revival, PG16 standby logical decoding (Drouvot), PG18 pg_logicalinspect, PG19 proposals (dynamic wal_level, pg_waldump tar support) — all clearly marked as under review.

Attribution

Author: Claude Opus 4.6 (1M context)
Key researcher: Claude Opus 4.6 (deep research mode)
Architecture lead: @x4m (Andrey Borodin) — Postgres.tv hacking session
Contributors: @NikolayS (idea generator, AI coordinator), @kirkw
Meeting secretary / transcription: Circleback.ai — transcribed the Postgres.tv hacking session
Reviewers (four rounds): another Claude Opus 4.6 instance, Gemini 3.1 Pro, GPT 5.4

Test plan

Spec reviewed by @x4m for architectural soundness (especially the standby-as-decoder viability thesis)
Spec reviewed by @kirkw for completeness / missing scenarios
Sprint 0 gate experiment executed — confirm all four gates G1–G4 per tracking issue Sprint 0: Prove the Pipe — logical decoding from an archive-fed standby #25
Post-Sprint-0 findings committed as v0.6 appendix + changelog entry, mapping results to Outcome A/B/C

If the background worker for processing databases manages to finish before the launcher starts waiting for it, the launcher would treat it erroneously as an error. Fix by ensureing to check result state in this case. Identified on CI and synthetically reproduced during local testing. Also while, make sure to properly lock the shared memory structure before updating tje result state. Author: Daniel Gustafsson <daniel@yesql.seA Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4

Postcommit review and buildfarm/CI failures revealed a few issues in the test code which this commit attempts to resolve. These failures are verified using synthetic means. * Wait for launcher exit in enable/disable checksum tests When enabling or disabling data checksums in a test with waiting for an end state (on or off), the test typically want to perform more test against the cluster immediately. Make sure to wait for the launcher to exit in these cases before returning in order to know it can immediately be acted on. This is a more generic way of implementating 0036232. * Refactor injection point tests to use the injection_points test extension. Two injection points added for online checksums were better expressed using the injection_points extension with the test code embedded in datachecksum_state.c. * Make tests less timing dependent and allow transitions to "on" and not just "inprogress-on" in case a test manages to finish before it's checked for state. * When waiting on a blocking background psql keeping a temporary table open, the test first closed the background session abd then the server. This could cause data checksums to manage to get enabled in the brief window between dropping the temporary table and closing the server. Fix by closing the server first before the background session. * Remove a few superfluous duplicate checks and general cleanup of comments as well as making LSN logging consistent. These issues were reported by Andres as well as spotted in the buildfarm and on CI. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se

On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p is not available. Use pg_integer_constant_p and add appropriate guards. There is a similar potential hazard for the x86 path, but for now let's get the buildfarm green. Oversight in commit fbc57f2, per buildfarm member hoatzin.

…ation Previously, during shutdown, walsenders always waited until all pending data was replicated to receivers. This ensures sender and receiver stay in sync after shutdown, which is important for physical replication switchovers, but it can significantly delay shutdown. For example, in logical replication, if apply workers are blocked on locks, walsenders may wait until those locks are released, preventing shutdown from completing for a long time. This commit introduces a new GUC, wal_sender_shutdown_timeout, which specifies the maximum time a walsender waits during shutdown for all pending data to be replicated. When set, shutdown completes once all data is replicated or the timeout expires. A value of -1 (the default) disables the timeout. This can reduce shutdown time when replication is slow or stalled. However, if the timeout is reached, the sender and receiver may be left out of sync, which can be problematic for physical replication switchovers. Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Ronan Dunklau <ronan@dunklau.fr> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com

When determining if it is safe to use an expression as a grouping key for partial aggregation, eager aggregation relies on the B-tree equalimage support function to ensure that equality implies image equality. Previously, the code incorrectly passed the default collation of the expression's data type to the equalimage procedure, rather than the expression's actual collation. As a result, if a column used a non-deterministic collation but the base type's default collation was deterministic, eager aggregation would incorrectly assume that the column was safe for byte-level grouping. This could cause rows to be prematurely grouped and subsequently discarded by strict join conditions, resulting in incorrect query results. This patch fixes the issue by passing the expression's actual collation to the equalimage procedure. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com

Pushing aggregates containing volatile functions below a join can violate volatility semantics by changing the number of times the function is executed. Here we check the Aggref nodes in the targetlist and havingQual for volatile functions and disable eager aggregation when such functions are present. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com

…ion test Previously the stats.sql regression test used conditions like "datname = (SELECT current_database())" to check the current database name. The subquery is unnecessary, so this commit simplifies these expressions to "datname = current_database()". Author: Chao Li <lic@highgo.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/A1535A8F-65AF-4C3D-ACBE-25891CB5D38B@gmail.com

Alexander Lakhin has noticed that it can be possible on machines with slow storage to have the spawned workers be stuck in initialize_worker_spi(), before they reach their main loop. Waiting for a flush to happen would block the interrupt attempts done by the database commands, causing the test to fail on timeout once the number of interrupt attempts is reached in CountOtherDBBackends(). This commit switches the test to wait for the spawned bgworkers to reach their main loops before attempting the database commands that would trigger the interrupts, napping for a time larger than the default, with worker_spi.naptime set at 10 minutes. Another thing that could be attempted is to enforce a larger number of tries in CountOtherDBBackends(), if what is done here is not enough. Let's see first if what this commit does is enough for the buildfarm members widowbird and jay. Analyzed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com

Previously, one LWLock was used for each lock type, adding complexity without an observable performance benefit as data is gathered only for paths involving lock waits, at least currently. This commit replaces the per-type set of LWLocks with a single LWLock protecting the stats data of all the lock types, like the stats kinds for SLRU or WAL. A good chunk of the callbacks get simpler thanks to this change. The previous approach also had one bug in the flush callback when nowait was called with "true": a backend iterating over all entries could successfully flush some entries while skipping others due to contention, then unconditionally reset the pending data. This would cause some stats data loss. Oversight in 4019f72. Reported-by: Tomas Vondra <tomas@vondra.me> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me

This module allows plan advice strings to be provided automatically from an in-memory advice stash. Advice stashes are stored in dynamic shared memory and must be recreated and repopulated after a server restart. If pg_stash_advice.stash_name is set to the name of an advice stash, and if query identifiers are enabled, the query identifier for each query will be looked up in the advice stash and the associated advice string, if any, will be used each time that query is planned. Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com

Some compilers didn't like the empty initializer when compiled without USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual Studio 2019. Author: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz

When freeing pending_shmem_requests we should also free the ->options. Author: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com

Child processes do not need the postmaster's working memory context and normally release it at the start of their main entry point. However, the slotsync worker forgot to do so. This commit makes the slotsync worker release the postmaster's working memory context at startup, preventing unintended use. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com

This commit updates 011_lock_stats.pl to verify log_lock_waits behavior. The tests check that messages are emitted both when a wait occurs and when the lock is acquired, and that the "still waiting for" message is logged exactly once per wait, even if the backend wakes up during the wait. The latter covers the behavior introduced by commit fd6ecbf. Author: Hüseyin Demir <huseyin.d3r@gmail.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com

Previously, this logic was embedded within SplitIdentifierString, SplitDirectoriesString, and SplitGUCList. Factoring it out saves a bit of duplicated code, and also makes it available to extensions that might want to do similar things without necessarily wanting to do exactly the same thing. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

The restructuring in commit 53b8ca6 revealed an interesting corner case: if a table needs vacuuming for wraparound prevention and autovacuum is disabled for it, we might still choose to analyze it. Research seems to indicate this was an accidental addition by commit 48188e1, and further discussion indicates there is consensus that it is unnecessary and can be removed. Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Shinya Kato <shinya11.kato@gmail.com> Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan

It would be useful to be able to tell auto_explain to set a custom EXPLAIN option, but it would be bad if it tried to do so and the option name or value wasn't valid, because then every query would fail with a complaint about the EXPLAIN option. So add a guc_check_handler that auto_explain will be able to use to only try to set option name/value/type combinations that have been determined to be legal, and to emit useful messages about ones that aren't. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

This code missed the need to update the combined state's nullbitmap if state1 already had a bitmap but state2 didn't. We need to extend the existing bitmap with 1's but didn't. This could result in wrong output from a parallelized array_agg(anyarray) calculation, if the input has a mix of null and non-null elements. The errors depended on timing of the parallel workers, and therefore would vary from one run to another. Also install guards against integer overflow when calculating the combined object's sizes, and make some trivial cosmetic improvements. Author: Dmytro Astapov <dastapov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com Backpatch-through: 16

contrib/pg_stash_advice and src/test/modules/test_shmem missed these, leading to complaints from git after an in-tree check-world run. Use our standard boilerplate list of ignorable subdirectories, although the two modules presently create different subsets of that.

These columns haven't been computed yet when the filtering happens (since we've not written the candidate tuple into the table); so any check on them is wrong or useless. Worse, since aa606b9 such a reference results in an access off the end of a TupleDesc, potentially causing a phony "generated columns are not supported in COPY FROM WHERE conditions" error; and since c98ad08 it throws an Assert instead. Actually we could allow tableoid, which has been set to the OID of the table named as the COPY target. However, plausible uses for tests of tableoid would involve a partitioned target table, and the user would wish it to read as the OID of the destination partition. There has been some discussion of changing things to make it work like that, but pending that happening we should just disallow tableoid along with other system columns. It seems best though to install this prohibition only in HEAD. In the back branches we'll just guard the unsafe TupleDesc access, and people will keep getting whatever semantics they got before. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com

CLUSTER is no longer the favored way to invoke this functionality, and the code is about to shift its focus to the REPACK more ambitiously. Rename the file to avoid leaving an unnecessary historical artifact around. Author: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql

Previously, autovacuum always disabled parallel vacuum regardless of the table's index count or configuration. This commit enables autovacuum workers to use parallel index vacuuming and index cleanup, using the same parallel vacuum infrastructure as manual VACUUM. Two new configuration options control the feature. The GUC autovacuum_max_parallel_workers sets the maximum number of parallel workers a single autovacuum worker may launch; it defaults to 0, preserving existing behavior unless explicitly enabled. The per-table storage parameter autovacuum_parallel_workers provides per-table limits. A value of 0 disables parallel vacuum for the table, a positive value caps the worker count (still bounded by the GUC), and -1 (the default) defers to the GUC. To handle cases where autovacuum workers receive a SIGHUP and update their cost-based vacuum delay parameters mid-operation, a new propagation mechanism is added to vacuumparallel.c. The leader stores its effective cost parameters in a DSM segment. Parallel vacuum workers poll for changes in vacuum_delay_point(); if an update is detected, they apply the new values locally via VacuumUpdateCosts(). A new test module, src/test/modules/test_autovacuum, is added to verify that parallel autovacuum workers are correctly launched and that cost-parameter updates are propagated as expected. The patch was originally proposed by Maxim Orlov, but the implementation has undergone significant architectural changes since then during the review process. Author: Daniil Davydov <3danissimo@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: zengman <zengman@halodbtech.com> Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com

transformCreateSchemaStmtElements has always believed that it is supposed to re-order the subcommands of CREATE SCHEMA into a safe execution order. However, it is nowhere near being capable of doing that correctly. Nor is there reason to think that it ever will be, or that that is a well-defined requirement. (The SQL standard does say that it should be possible to do foreign-key forward references within CREATE SCHEMA, but it's not clear that the text requires anything more than that.) Moreover, the problem will get worse as we add more subcommand types. Let's just drop the whole idea and execute the commands in the order given, which seems like a much less astonishment-prone definition anyway. The foreign-key issue will be handled in a follow-up patch. This will result in a release-note-worthy incompatibility, which is that forward references like CREATE SCHEMA myschema CREATE VIEW myview AS SELECT * FROM mytable CREATE TABLE mytable (...); used to work and no longer will. Considering how many closely related variants never worked, this isn't much of a loss. Along the way, pass down a ParseState so that we can provide an error cursor for "wrong schema name" and related errors, and fix transformCreateSchemaStmtElements so that it doesn't scribble on the parsetree passed to it. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us

The previous patch simplified CREATE SCHEMA's behavior to "execute all subcommands in the order they are written". However, that's a bit too simple, as the spec clearly requires forward references in foreign key constraint clauses to work, see feature F311-01. (Most other SQL implementations seem to read more into the spec than that, but it's not clear that there's justification for more in the text, and this is the only case that doesn't introduce unresolvable issues.) We never implemented that before, but let's do so now. To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's subcommand list. This works because the foreign key constraints are independent and don't affect any other DDL that might be in CREATE SCHEMA. For simplicity, we do this for all FOREIGN KEY clauses even if they would have worked where they were. Author: Jian He <jian.universality@gmail.com> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us

Having rejected the principle that we should know how to re-order the sub-commands of CREATE SCHEMA, there is not really anything except a little coding to stop us from supporting more object types. This patch adds support for creating functions (including procedures and aggregates), operators, types (including domains), collations, and text search objects. SQL:2021 specifies that we should allow functions, procedures, types, domains, and collations, so this moves us a great deal closer to full SQL compatibility of CREATE SCHEMA. What remains missing from their list are casts, transforms, roles, and some object types we don't support yet (e.g. CREATE CHARACTER SET). Supporting casts or transforms would be problematic because they don't have names at all, let alone schema-qualified names, so it'd be quite a stretch to say that they belong to a schema. Roles likewise are not schema-qualified, plus they are global to a cluster, making it even less reasonable to consider them as belonging to a schema. So I don't see us trying to complete the list. User-defined aggregates and operators are outside the spec's ken, as are text search objects, so adding them does not do anything for spec compatibility. But they go along with these other object types, plus it takes no additional code to support them since they are represented as DefineStmts like some variants of CREATE TYPE. It would indeed take some effort to reject them. Author: Kirill Reshke <reshkekirill@gmail.com> Author: Jian He <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com

The associated value should look like something that could be part of an EXPLAIN options list, but restricted to EXPLAIN options added by extensions. For example, if pg_overexplain is loaded, you could set auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'. You can also specify arguments to these options in the same manner as normal e.g. 'DEBUG 1, RANGE_TABLE false'. Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com

This function is a thin wrapper around relation_needs_vacanalyze() that handles fetching and freeing the pgstat entry for the table. Since all callers of relation_needs_vacanalyze() do that anyway, we can teach that function to fetch/free the pgstat entry and use it instead. Suggested-by: Álvaro Herrera <alvherre@kurilemu.de> Author: Sami Imseih <samimseih@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com

Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building the result tuple descriptor for the WAIT FOR command. This avoids a syscache access that could re-establish a catalog snapshot after we've explicitly released all snapshots before the wait. Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>

When the standby is passed as a PostgreSQL::Test::Cluster instance, use the WAIT FOR LSN command on the standby server to implement wait_for_catchup() for replay, write, and flush modes. This is more efficient than polling pg_stat_replication on the upstream, as the WAIT FOR LSN command uses a latch-based wakeup mechanism. The optimization applies when: - The standby is passed as a Cluster object (not just a name string) - The mode is 'replay', 'write', or 'flush' (not 'sent') Rather than pre-checking pg_is_in_recovery() on the standby (which would add an extra round-trip on every call), we issue WAIT FOR LSN directly and handle the 'not in recovery' result as a signal to fall back to polling. For 'sent' mode, when the standby is passed as a string (e.g., a subscription name for logical replication), when the standby has been promoted, or when WAIT FOR LSN is interrupted by a recovery conflict, the function falls back to the original polling-based approach using pg_stat_replication on the upstream. The recovery conflict fallback is necessary because some conflicts are unavoidable - for example, ResolveRecoveryConflictWithTablespace() kills all backends unconditionally, regardless of what they are doing. The recovery conflict detection matches the English error message "conflict with recovery", which is reliable because the test suite runs with LC_MESSAGES=C. Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>

Add a note to the WAIT FOR documentation explaining that sessions using this command on a standby server may be interrupted by recovery conflicts. Some conflicts are unavoidable - for example, replaying a tablespace drop terminates all backends unconditionally. Discussion: https://postgr.es/m/CAPpHfds7oSCbZqob7ytT_Lso8fv-NW8LnedUTE4Krde%2B3rkJeA%40mail.gmail.com Author: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>

@NikolayS

… disclosed Three adjustments this commit, reflecting the actual production model: 1. Author is Claude Opus 4.6 (me) — not @NikolayS. The blueprint is AI-authored; @NikolayS frames problems, provides judgement, coordinates agents, and makes final calls. 2. Attribution block expanded: - Author: Claude Opus 4.6 (1M context) - Key researcher: Claude Opus 4.6 (deep research mode) - Architecture lead: @x4m (Andrey Borodin) - Contributors: @NikolayS (idea generator, AI coordinator), @kirkw - Meeting secretary / transcription: Circleback.ai (transcribed the Postgres.tv hacking session; transcript fed into research and drafting) - Reviewers (four rounds): another Claude Opus 4.6 instance, Gemini 3.1 Pro, GPT 5.4 3. GitHub handles (@NikolayS, @x4m, @kirkw) are now clickable markdown links throughout the document. 4. Author column removed from the Changelog table — with a single author it was noise; version/date/changes remain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added https://www.youtube.com/watch?v=LjiU6kB6izw as a clickable link wherever the hacking session is mentioned — attribution block (architecture lead + meeting secretary lines) and reference [13]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@x4m

Three parallel review agents converged on the same smell: across v0.1 to v0.5, "(added in v0.X)" and "(corrected in v0.Y)" parentheticals accumulated throughout the prose, plus defensive meta-commentary about prior versions. The changelog already captures the revision history; keeping it in the current-state sections turns the doc into archaeology for fresh readers. Stripped: - ~10 "(added in v0.3)" / "(added in v0.5)" / "(corrected in v0.2)" parentheticals on section headers and inline prose. - "Previous version of this doc was wrong" lead-in in §4.2.2 — current prose stands on its own. - "Revised in v0.2 after feasibility pushback" header in §4.2.3. - Sprint 0 "Revised in v0.2" introduction. - "was 2" historical annotations on S0-6 and S2-2 day budgets. - "v0.3 said ~8, v0.5 reconciles..." version-history parenthetical after the Gantt total. - "The decode historic WAL reading of v0.1 was wrong" in G2 gate note. - "v0.2 phrasing conflated..." archaeology in §4.1.4 risk 1. - Gantt heading "(v0.5 - Sprint 0 extended to 3 weeks per budget reconciliation)" simplified to just "Gantt overview." Compressed: - "Note on authorship" preamble: 90 words -> 15 words. The US-4 retention detail belongs in §2 US-4, not the header. - US-4 §2 justification: three paragraphs of attribution + value defense + granularity essay collapsed to one sentence of attribution + one short comparison + granularity caveat. - Outcome A US-4 bullet: dropped "per v0.4 retention decision" and "core value-add from @x4m"; now just "in scope; see §2 US-4." - Sprint 0 budget note: dropped the "v0.3/v0.4 claimed 2 weeks while tasks totalled 11.5-14.5d" relitigation; states the current budget. Fixed: - Broken §7.0.1 / §7.0.2 internal refs (section numbers that don't exist) replaced with heading names. - `postgresql.conf` comment pointing to "§4.2 for the correct recipe" redirected to §2 US-2 (§4.2 is about controlled replay, not the US-2 recipe). - Components table `Python (psycopg2/3)` reconciled to `Python (psycopg)` — the Python example currently imports psycopg2 only; the "/3" was aspirational. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@NikolayS

All three round-5 reviewers endorsed proceeding to Sprint 0. Two small items were small enough to fix before closing out: - Outcome B's US-4 bullet softened. v0.5 reframed US-4 around paused-state inspection as independently valuable; Outcome B still said "likely cut," which partially contradicted that framing. Now: "narrows substantially, keep as internal debug tool, don't feature in shipped product positioning." - Gantt post-early-exit timeline tightened from "~7-8 week" to "~8 week" — the math (15d sprint 0, exit at day 8, save 7d = 1.4 weeks, 9 - 1.4 = 7.6 ≈ 8) made 7 aspirational. Explicitly NOT applied: - Authorship block reframing for pgsql-hackers audience (a reviewer's judgment-call suggestion; the current AI-authorship transparency is a deliberate choice by @NikolayS). - Changelog v0.5 entry (h) defensive-tone trim (cosmetic). - R1's Sprint 0 execution tips (READ ONLY hooks, VACUUM pg_attribute, S3-vs-local latency isolation) — belong in the execution thread, not the spec. Per round-5 consensus: stop iterating, start Sprint 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Self-contained ~20 second reproducer that spins up a local PG18 primary + archive-only standby, creates a logical slot, triggers G3 invalidation via pg_statistic dead-tuple generation + VACUUM, and captures the Heap2/PRUNE_ON_ACCESS WAL record with its snapshotConflictHorizon. Verified working with both test_decoding and pgoutput plugins — same ~3s time-to-invalidation, same WAL trigger record, demonstrating the G3 mechanism is slot-level (not plugin-dependent). Usage: ./blueprints/repro_g3.sh # run experiment ./blueprints/repro_g3.sh cleanup # tear down clusters + files Requires PostgreSQL 18 server + client (pgdg apt packages), unprivileged user, ~300 MB disk under /tmp/sprint0-repro/. No sudo needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Demonstrates the working US-2 approach: pre-stage archive up to the segment containing L_start, start standby with restore_command pointing at gated archive, let replay pause at archive end, create slot (restart_lsn pins at L_start), then release subsequent segments to advance replay through the window. Tested: DELETE with REPLICA IDENTITY FULL is decoded with full old-tuple data, enabling reconstruction of the deleted rows. This contradicts an earlier finding that called US-2 unimplementable — the prior attempts used recovery_target_lsn=L_start + action=pause which blocks snapbuild, not the gated-archive variant demonstrated here. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Demonstrates that US-2 (windowed logical extraction) is achievable even when the primary is dead at recovery time — provided a production-side backgrounder has been recording pg_log_standby_snapshot() calls at quiet moments into the WAL archive. The recipe: gate archive to stop BEFORE the segment containing the quiet-moment path-(a) running_xacts record. Start standby (stops at archive end). Launch slot creation (blocks waiting for snapbuild consistency). Release the snapshot's segment mid-slot-creation — the standby's restore_command retries pick it up, replay advances into the segment, snapbuild reads the path-(a) record forward from restart_lsn, hits CONSISTENT immediately. Slot creation completes in ~1 second with restart_lsn pinned at the quiet-moment LSN. Remaining segments released, full decoded window extracted. Tested under 30s sustained OLTP + primary killed before recovery. Decoded output: 2906 changes including 51 DELETEs from the accident. This contradicts the earlier "primary must be reachable during recovery" finding — the pre-positioned quiet-moment snapshot IS visible to snapbuild if you gate the archive to place restart_lsn BEFORE the snapshot LSN, not after. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sprint 0 was executed on a lab VM (PG18 + PG17 cross-validated) and produced ~35 comments of raw evidence on issue #25 plus three committed reproducer scripts. v0.6 incorporates those findings into the blueprint. Major changes: - Status line updated: "Post Sprint 0 execution — findings incorporated; pgsql-hackers RFC draft ready" - US-2 recipe in §2 REWRITTEN. The v0.5 recovery_target_lsn+pause recipe does not work in practice — creating a slot on a paused standby blocks snapbuild indefinitely because no forward WAL arrives. Replaced with the gated-archive + quiet-moment-snapshot recipe validated in Sprint 0. The new recipe works with the PRIMARY DEAD during recovery provided production has recorded periodic pg_log_standby_snapshot() calls. - New §10 Sprint 0 Execution Findings section with: - Gate-by-gate results (G1 PASS, G2 PASS, G3 FAILS, G4 PASS) - G3 MTTI table across 5 workload regimes - US-2 working recipe validation with raw decode output - US-1 verdict: requires core patch (elevated to Future Work §8.1) - Outcome determination: Outcome B is the shipping scope - New §11 Production-Side Prerequisites documenting the pg_log_standby_snapshot() backgrounder as a hard requirement for US-2. - References §9: added pointer to issue #25 as source of full raw evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Given an archive directory, reports two key LSNs: - path-(a) anchor: earliest RUNNING_XACTS record where snapbuild can bootstrap a slot with oldestRunningXid == nextXid - MTTI ceiling: first Heap2/PRUNE_ON_ACCESS on a catalog relation at or after the anchor, which invalidates any logical slot replaying past it Delta between them is the archive's practical US-2 window, in bytes. Sprint 1 placeholder for the "pg_waldump --find-first-catalog-prune" tooling ask called out in the Sprint 0 consolidation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Logical slots are per-database; a catalog prune in database A does not invalidate a slot in database B. The prior version over-predicted by reporting the first catalog prune in ANY database. Add --db <oid> to restrict the scan. Without --db, behavior is unchanged (conservative) and the output notes the filter is off. Validated against a real failed-recovery archive: - Without --db: ceiling at 0/29009D98 (rel 1663/1/2619, db=1 template1) - With --db 5: ceiling at 0/33004780 (rel 1663/5/2619, db=5 postgres) Actual invalidation during standby replay was at 0/33004780 — the --db 5 prediction matches exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three additions from continued issue #25 iteration: 1. §10.7 — new static-analysis tool wal_archive_ceiling.sh, end-to-end validated against a real failed-recovery archive (predicted LSN, snapshotConflictHorizon, rel/blk all match standby's dynamic invalidation exactly). 2. §10.2 table — two new data points quantifying the US-2 ceiling under sustained 300s OLTP: baseline (default autovacuum) invalidates at t+138s; tuned primary (autovacuum_naptime=600s) survives the full window and drains 30,413 rows. 3. §11.5 — formalize the operator-facing rule: US-2 window is controlled by primary-side autovacuum, not any standby GUC. Rule of thumb: autovacuum_naptime >= 2 x desired-window-seconds. Also captures the per-database specificity of InvalidatePossiblyObsoleteSlot (slot->data.database gate) which caused an initial over-prediction in the tool before --db <oid> was added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When enabled and a PRUNE_ON_ACCESS WAL record on a catalog relation is about to cause InvalidateObsoleteReplicationSlots to invalidate at least one active logical slot in the affected database, pause recovery instead. The operator can then drain, advance, or drop the slot via hot-standby SQL and call pg_wal_replay_resume() to continue. On resume, the caller falls through to the normal invalidation path: if the slot is gone or advanced past the conflict horizon, invalidation is a no-op; otherwise the slot is invalidated as before. Motivated by blueprints/LOGICAL_DECODING_ARCHIVED_WALS.md §4.2.3 / US-4 and the Sprint 0 findings in issue #25: an archive-only logical-decoding standby cannot feed hot_standby_feedback to its primary, so the primary has no back-pressure against catalog vacuuming. Any logical slot on such a standby is invalidated the first time replay applies a catalog prune whose snapshotConflictHorizon exceeds the slot's catalog_xmin. MTTI under default autovacuum on the primary is ~2 * autovacuum_naptime. Hooks in at ResolveRecoveryConflictWithSnapshot right before the existing InvalidateObsoleteReplicationSlots call, which is the single point the recovery path reaches on a logical-slot-relevant conflict. Reuses the existing recoveryNotPausedCV and SetRecoveryPause API, so this integrates with pg_wal_replay_pause/resume and recovery_target_action = pause without new shared-memory state. Default is off (unchanged behavior). PGC_SIGHUP so operators can flip it on an already-running standby if they decide to retrofit pause semantics mid-incident. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without this, monitoring tools see pg_get_wal_replay_pause_state() stuck at "pause requested" forever — the transition to "paused" normally happens in the main replay loop, but MaybePauseOnLogicalSlotConflict blocks inside ResolveRecoveryConflictWithSnapshot before returning to that loop. Call ConfirmRecoveryPaused() inside our local wait loop so the pause shows up correctly to SQL-level observers. Also drop the static qualifier on ConfirmRecoveryPaused in xlogrecovery.c and expose it in xlogrecovery.h for use by standby.c. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes the last gap for US-1 continuous CDC from an archive-only logical-decoding standby. Before this, MaybePauseOnLogicalSlotConflict paused correctly, the operator could drain the slot, but on resume the fall-through to InvalidateObsoleteReplicationSlots still invalidated the slot — because the drain's logical-decoding machinery has no way to advance catalog_xmin past the conflict horizon (the conflict record hasn't been replayed yet, so the xids involved are still considered potentially-active by snapbuild). After resume, scan slots in the conflicted database for those whose confirmed_flush_lsn has reached (or passed) the pause LSN — i.e., the operator drained up to the conflict. For each such slot, advance both catalog_xmin and xmin past the conflict horizon (using TransactionIdAdvance so the value is strictly > horizon, satisfying DetermineSlotInvalidationCause's TransactionIdPrecedesOrEquals check). Slots the operator did NOT drain are left untouched and invalidated normally — the old "I'll let the slot die" path is preserved. End-to-end verified against a 300s sustained-OLTP archive with two catalog prune events: - Without this change: slot invalidated on first pause's resume. - With this change: both pauses handled, 45,469 decode events extracted (exactly 3x the 15,153 workload INSERTs = BEGIN/INSERT/COMMIT each), slot final state wal_status=reserved, no invalidation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WIP — functional assertions are in place but the archive-restore timing on init_from_backup(has_restoring => 1) is flaky: the standby stalls waiting for a segment that was written to the primary's pg_wal but not yet archived at the moment init_from_backup snapshots archive_command state. Saving the skeleton so reviewers can see the intended assertions and iterate on the timing harness. The equivalent workflow has already been validated end-to-end via /tmp/us1_v2.sh (45,469 events decoded, 2 pauses handled, slot survived — documented in issue #25 comment 4261479941); this is about replicating that in Perl TAP form for upstream CI. Known failure mode this revision: PostgreSQL::Test::Cluster's standby init with has_restoring=1 polls for replay progress before the last pre-backup segment has landed in the archive. Need to either * wait for `pg_stat_archiver.last_archived_wal` to match primary's current LSN before taking the backup, or * use has_streaming=1 for the initial catchup and swap to restoring- only mid-test. Not blocking the patch — the prototype still stands on the bash demo.

This revision of the TAP test surfaces an interaction that wasn't visible in the bash demo at /tmp/us1_v2.sh: When the catalog-prune WAL record is already in the archive by the time the standby starts replaying, slot creation can block in DecodingContextFindStartpoint waiting for more WAL *just as* replay reaches the prune record and MaybePauseOnLogicalSlotConflict fires. The result is a deadlock: slot creation waits for replay to advance, replay is paused waiting for the slot to be drained, but the slot has not yet reached SNAPBUILD_CONSISTENT and is not drainable. Two candidate fixes for the patch, for followup work: 1. In MaybePauseOnLogicalSlotConflict, skip slots where the effective_catalog_xmin is still InvalidTransactionId (slot not yet consistent). Rationale: an in-progress slot hasn't produced output, so invalidating it is fine — it gets retried. 2. Decouple slot-creation from the startup process more deeply so a paused recovery can still make the tiny bit of progress DecodingContextFindStartpoint needs. Updated top-of-file doc reflects the edge case and points reviewers at the bash demo as the current authoritative proof of the US-1 path.

Closes the deadlock window the TAP test scaffold surfaced in a5844b0. Before: MaybePauseOnLogicalSlotConflict pauses replay for any slot in the affected database whose data.catalog_xmin precedes the conflict horizon — including slots still inside DecodingContextFindStartpoint, whose catalog_xmin has been preliminarily assigned but whose snapbuild has not yet hit SNAPBUILD_CONSISTENT (effective_catalog_xmin is still InvalidTransactionId). Pausing for such a slot deadlocks: snapbuild needs replay to advance to find the path-(a) anchor, but replay is paused waiting for the slot to be drained, and the slot is not drainable. After: also require effective_catalog_xmin to be valid. An in-progress slot is allowed to be invalidated — the caller just retries creation. A slot that has reached SNAPBUILD_CONSISTENT (and hence produced output a consumer may have committed) still pauses. Expected impact on the US-1 happy path: none. In the bash reproducer at /tmp/us1_v2.sh the slot is always consistent by the time a prune record is replayed, because of the gated-archive ordering. Expected impact on the TAP harness: the startup process no longer hangs when it races into a prune record before slot creation has completed. The TAP test still needs a two-phase restructure (create slot BEFORE the catalog-churn workload on the primary) to reach a passing state — that's tracked in the a5844b0 commit message.

Two fixes, tightly related: 1. standby.c MaybePauseOnLogicalSlotConflict: use TransactionIdPrecedesOrEquals to match the semantics of DetermineSlotInvalidationCause. Otherwise a slot whose catalog_xmin was just advanced past horizon H by our own resume code (catalog_xmin now == H+1) will NOT pause when the next prune arrives with horizon == H+1, yet it WILL still be invalidated by the fall-through InvalidateObsoleteReplicationSlots call — since that function uses PrecedesOrEquals. Off-by-one caused slot invalidation one prune after a successful drain-and-advance cycle. 2. TAP test 050 restructured into an explicit two-phase flow: Phase 1: basebackup + quiet-moment pg_log_standby_snapshot, wait for that segment to archive, then start standby and create slot — guaranteed no prune records in the archive yet, so slot reaches SNAPBUILD_CONSISTENT in seconds. Phase 2: run catalog churn on the primary (table create+drop iterations, ANALYZE x2, VACUUM on catalog relations), wait for those segments to archive, let the standby replay through them. The GUC pauses on each prune; orchestrator drains and resumes. Test result: **all 5 assertions pass**, 22 pause events handled, 3092 decoded events, zero invalidations, slot final wal_status=reserved. Runs in ~36 wallclock seconds on a modest VM. [17:31:23] ok 1 - recovery_pause_on_logical_slot_conflict GUC is registered [17:31:37] ok 2 - slot created cleanly in Phase 1 (state: reserved) [17:31:58] ok 3 - slot survived catalog prune with GUC on [17:31:58] ok 4 - at least one pause event was handled (22 seen) [17:31:58] ok 5 - at least 2000 decoded events (3092 got)

§4.2.3 / §8.1 "slot-aware replay throttling (mechanism TBD)" is now a concrete 215-line prototype with a passing TAP test on the branch. Main additions: - New §12 "Sprint 1 Core Patch: recovery_pause_on_logical_slot_conflict": describes the hook point in ResolveRecoveryConflictWithSnapshot, the pause+wait+advance flow, edge cases handled (in-progress slot deadlock avoidance, PrecedesOrEquals semantics), validation (TAP passing, bash end-to-end 45k events at 100% coverage, 102-test regression sweep), and files touched. - Changelog row for 0.8: compresses the 5-commit arc (2d70df8 → 8a3b95d → 8761b6e → bbd5d4e → 7d16094) into the milestone's record. Outcome determination upgraded from B (forensic-only) to A (continuous US-1 viable). Existing sections (§10 Sprint 0 findings, §11 production-side prereqs) unchanged — they are still accurate for unpatched PG. §12 adds the "with the patch" story on top.

Adds a second standby node (standby_off) brought up in the same Phase 1 as the main standby — quiet archive, no prune records yet. Both standbys reach SNAPBUILD_CONSISTENT cleanly and create their slots. Phase 2's catalog churn then hits both standbys. The GUC-on one pauses and drains (existing assertions); the GUC-off one invalidates — the existing PG upstream behavior, unchanged by this patch. New assertions: ok 3 - baseline slot created cleanly in Phase 1 ok 7 - baseline (GUC off): slot invalidates as expected Before/after comparison now runs in the same harness. If GUC-off ever stops invalidating, either the test stopped triggering the conflict or the patch accidentally benefits the off-path — both are real regressions to catch. Test summary: 7/7 tests pass, 39 wallclock seconds. [17:55:58] ok 1 - recovery_pause_on_logical_slot_conflict GUC is registered [17:56:14] ok 2 - slot created cleanly in Phase 1 (state: reserved) [17:56:14] ok 3 - baseline slot created cleanly in Phase 1 (state: reserved) [17:56:35] ok 4 - slot survived catalog prune with GUC on (state: reserved|) [17:56:35] ok 5 - at least one pause event was handled (22 seen) [17:56:35] ok 6 - at least 2000 decoded events (3092 got) [17:56:35] ok 7 - baseline (GUC off): slot invalidates as expected under catalog prune

Stores the draft email body for posting recovery_pause_on_logical_slot_conflict to pgsql-hackers (as blueprints/pgsql_hackers_rfc_email.md) alongside the squashed 2-commit patch series (in blueprints/rfc-v1-patches/). Patch series also lives on the rfc-v1-recovery-pause-on-slot-conflict branch of this fork for reviewers who prefer git over mailing-list attachments. The 5 prototype commits on this branch (2d70df8, 8a3b95d, 8761b6e, bbd5d4e, 7d16094) are squashed into a single 0001-Pause-recovery-on-logical-slot-conflict.patch in the series; the TAP test commits (7d16094 + 253ad28) are squashed into 0002-Add-TAP-test-for-recovery_pause_on_logical_slot_conf.patch.

Three fixes from an adversarial review pass: 1. (HIGH) Add PromoteIsTriggered() check in the wait loop so pg_promote() while paused doesn't stall. Existing recoveryPausesHere uses CheckForStandbyTrigger, which is static in xlogrecovery.c; PromoteIsTriggered is the exposed equivalent. 2. (HIGH) Filter out synced slots (data.synced == true) in both the pause-check scan and the advance scan. Writing to a synced slot from the startup process would race the slot-sync worker. Also, the error hint "drain, advance, or drop the slot" does not apply to synced slots — ALTER / DROP_REPLICATION_SLOT on a synced slot ERROR out at slot.c:932 / :982. 3. (MEDIUM → documented) The advance marks slots dirty but does not force a restartpoint, so a crash between resume and the next restartpoint loses the advance. Added a note in the code. On restart, we re-encounter the same conflict record, re-pause, and the operator re-drains — idempotent, no data loss, but a hiccup. A proper fix needs SaveSlotToPath to be callable from the startup process; that is out of scope for this prototype. Test result: 050 TAP still 7/7 passing in 37s. The fixes add 3 new safety edges without breaking the happy path.

Reflects commit e896c73 (promote check, synced-slot filter, durability-gap doc comment) rolled into the squashed patches. Both patches still apply cleanly to upstream postgres/postgres master (commit 191a037 as of this writing). Regression sweep on the fresh tree includes: t/050_recovery_pause_on_slot_conflict.pl .. ok (7/7 asserts) t/035_standby_logical_decoding.pl ......... ok t/040_standby_failover_slots_sync.pl ...... ok (synced-slot filter) t/044_invalidate_inactive_slots.pl ........ ok (135 tests total, PASS) 0001 now 216 → 254 insertions (the three safety fixes). 0002 unchanged in substance. The rfc-v1-recovery-pause-on-slot-conflict branch is force-pushed to the new squashed pair (fb514d2 + a16bce1).

Expands the "Design notes" section from 3 edge cases to 6, folding in the three safety fixes caught during adversarial review: (4) PromoteIsTriggered escape, (5) synced-slot filter, (6) durability gap — the latter flagged as known-limitation / deferred with a brief explanation of why it's not catastrophic (idempotent re-drain on restart, no data loss) and why the proper fix is out of scope for the prototype (SaveSlotToPath is static, requires MyReplicationSlot). No other changes to the email text. The email body is still the candidate send-text pending human sanity-check before posting.

The previous version checked PromoteIsTriggered() in the wait loop, which only reads the LocalPromoteIsTriggered cache. The cache is populated by CheckForStandbyTrigger(), which does the real work: detect PROMOTE_SIGNAL_FILE, unlink it, call SetPromoteIsTriggered(). Without calling CheckForStandbyTrigger, LocalPromoteIsTriggered stays false forever — so pg_promote()'s signal went unnoticed in our loop. Empirically verified the bug and fix: - Before (PromoteIsTriggered only): pg_promote(wait => true, wait_seconds => 30) returns FALSE after 30 seconds, standby still in_recovery. - After (CheckForStandbyTrigger): pg_promote returns TRUE in <1 second, standby is promoted. Raw test run from /tmp/verify_promote.sh: [18:56:24] wait for standby to enter pause [18:56:29] pause state: paused [18:56:29] *** calling pg_promote() now *** [18:56:30] pg_promote returned in 0.937961725s [18:56:30] pg_is_in_recovery post-promote: f [18:56:30] PASS: promote escaped the pause in under 10s CheckForStandbyTrigger was static in xlogrecovery.c. This commit makes it extern (drop the static, add the extern in xlogrecovery.h). Mirrors the existing recoveryPausesHere() escape loop in the same file.

Two additions: 1. **Promote-during-pause assertions** (ok 8, 9, 10). Brings up a third standby with the GUC on, waits for it to enter the paused state, then calls pg_promote(wait=>true, wait_seconds=>30) and asserts: - promote returned true - promote completed in under 10 seconds Without the CheckForStandbyTrigger() escape in the wait loop (the fix in ee42817), the standby stays paused for 30s and pg_promote returns false. This test guards against regression of that fix. 2. **Phase-1 WAL stabilization**. Previously the Phase-1 sequence was pg_log_standby_snapshot + pg_switch_wal. The first post-backup segment would archive and standbys could start, but slot creation on them could block in DecodingContextFindStartpoint 'waiting for WAL to become available' at segment N+1 (the active-but-not-yet- archived segment on primary). Adding a second pg_log_standby_snapshot + pg_switch_wal after the first one gives snapbuild enough forward WAL to decide the slot is consistent without waiting for primary activity. Eliminates the intermittent 'slot creation on standby_off timed out' flake seen in earlier runs. Test result: 10/10 assertions pass, ~30 wallclock seconds (down from ~40s, and without the flake). ok 1 - GUC registered ok 2 - slot created cleanly in Phase 1 (state: reserved) ok 3 - baseline slot created cleanly in Phase 1 (state: reserved) ok 4 - slot survived catalog prune with GUC on (state: reserved|) ok 5 - at least one pause event was handled (18 seen) ok 6 - at least 2000 decoded events (3094 got) ok 7 - baseline (GUC off): slot invalidates as expected ok 8 - promote-test standby reached paused state before promotion ok 9 - pg_promote returned true while standby was paused by GUC ok 10 - pg_promote completed in under 10s (actual: 1s)

…10-test TAP Reflects commits ee42817 (promote-escape actually working), 68b62ce (TAP hardening + promote-during-pause coverage). v3 patches: 0001 — 265 insertions (was 256 in v2) 0002 — 296 insertions (was 247 in v2) Both patches still apply cleanly to upstream postgres/postgres master. Test: 10/10 passing, ~30s runtime. The rfc-v1-recovery-pause-on-slot-conflict branch was force-pushed to the new squashed pair (87e7c8f + d63deab).

The July 2025 "Requested WAL segment has already been removed" thread was authored by Japin Li, not Kukushkin. Kukushkin is a discussion participant alongside Fujii Masao et al.; his contribution was mailing-list commentary, not a patch. Fixed in four live references: - §1 "What exists today" bullet - §8.2 Future Work entry #2 (walsender restore_command integration) - §9 Reference #2 - §10.4 US-1 verdict prior-art note Changelog entries for v0.1 and v0.2 left as frozen history — their misattribution is preserved as part of the iteration record, not rewritten. Sawada (dynamic wal_level PoC, §9 ref #7) and Amul Sul (pg_waldump tarfile support, §9 ref #8) verified correctly attributed — both are thread originators / patch authors on their respective pgsql-hackers threads, no change needed.

danielgustafsson and others added 30 commits April 6, 2026 01:55

NikolayS and others added 28 commits April 22, 2026 11:19

NikolayS force-pushed the blueprint/logical-decoding-archived-wals branch from 2731368 to a485e95 Compare April 22, 2026 18:19

NikolayS force-pushed the master branch from 34be85f to 1f62dbf Compare May 17, 2026 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blueprint: logical decoding of archived WAL (v0.1 draft)#24

blueprint: logical decoding of archived WAL (v0.1 draft)#24
NikolayS wants to merge 2753 commits into
masterfrom
blueprint/logical-decoding-archived-wals

NikolayS commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

NikolayS commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the spec

Attribution

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

NikolayS commented Apr 16, 2026 •

edited

Loading