Skip to content

Add index-only scan counters to index stats#35

Open
NikolayS wants to merge 2752 commits into
masterfrom
ios-index-stats
Open

Add index-only scan counters to index stats#35
NikolayS wants to merge 2752 commits into
masterfrom
ios-index-stats

Conversation

@NikolayS
Copy link
Copy Markdown
Owner

Adds cumulative index-only scan counters to pg_stat_all_indexes:

  • idx_only_scan — number of index-only searches on the index
  • idx_only_tup_read — index entries returned by index-only scans
  • idx_only_heap_fetch — index-only scan visibility-map misses requiring heap visits

Implementation notes:

  • Reuses index scan instrumentation to count IOS searches.
  • Counts IOS index entries in nodeIndexonlyscan.c alongside index_getnext_tid().
  • Counts IOS heap fetches at the existing EXPLAIN Heap Fetches hook point.
  • Adds regression coverage for both all-visible IOS (heap_fetch = 0) and VM-miss IOS (heap_fetch = 1).

Local validation:

  • ninja -C build -j2
  • meson test -C build regress/regress --print-errorlogs — 245 subtests passed
  • git diff --check

michaelpq and others added 30 commits April 6, 2026 08:51
Commit 7c64d56 has removed the isolation test providing coverage
for lock statistics due to some instability in the CI, where the
deadlock timeout may not have enough time to process, preventing the
stats data to be updated.  These also relied on a set of hardcoded
sleeps.

This commit switches the test suite to TAP, instead, that uses an
injection point with a wait to avoid the sleeps.  The injection point is
added in ProcSleep(), once we know that the deadlock timeout has fired
and that the stats have been updated.

Multiple lock patterns are checked, all rely on the same workflow, with
two sessions:
- session 1 holds a given lock type.
- session 2 attaches to the new injection point with the wait action.
- session 2 attempts to acquire a lock conflicting with the lock of
session 1, waiting for the injection point to be reached.
- session 1 releases its lock, session 2 commits.
- pg_stat_lock is polled until the counters are updated for the lock
type.

Bertrand's version of the patch introduced a new routine to
BackgroundPsql() to detect the blocked background sessions.  I have
tweaked the test so as we use the same method as some of the other tests
instead, based on some \echo commands.  This test has been run multiple
times in the CI, all passing, so I'd like to think that this is more
stable than the first version attempted.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/acNTR1lLHwQJ0o+P@ip-10-97-1-34.eu-west-3.compute.internal
If the background worker for processing databases manages to finish
before the launcher starts waiting for it, the launcher would treat
it erroneously as an error.  Fix by ensureing to check result state
in this case.  Identified on CI and synthetically reproduced during
local testing.

Also while, make sure to properly lock the shared memory structure
before updating tje result state.

Author: Daniel Gustafsson <daniel@yesql.seA
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4
Postcommit review and buildfarm/CI failures revealed a few issues in
the test code which this commit attempts to resolve.  These failures
are verified using synthetic means.

  * Wait for launcher exit in enable/disable checksum tests

    When enabling or disabling data checksums in a test with waiting
    for an end state (on or off), the test typically want to perform
    more test against the cluster immediately. Make sure to wait for
    the launcher to exit in these cases before returning in order to
    know it can immediately be acted on.  This is a more generic way
    of implementating 0036232.

  * Refactor injection point tests to use the injection_points test
    extension. Two injection points added for online checksums were
    better expressed using the injection_points extension with the
    test code embedded in datachecksum_state.c.

  * Make tests less timing dependent and allow transitions to "on"
    and not just "inprogress-on" in case a test manages to finish
    before it's checked for state.

  * When waiting on a blocking background psql keeping a temporary
    table open, the test first closed the background session abd
    then the server.  This could cause data checksums to manage to
    get enabled in the brief window between dropping the temporary
    table and closing the server.  Fix by closing the server first
    before the background session.

  * Remove a few superfluous duplicate checks and general cleanup
    of comments as well as making LSN logging consistent.

These issues were reported by Andres as well as spotted in the
buildfarm and on CI.

Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se
On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p
is not available. Use pg_integer_constant_p and add appropriate
guards. There is a similar potential hazard for the x86 path, but
for now let's get the buildfarm green.

Oversight in commit fbc57f2, per buildfarm member hoatzin.
…ation

Previously, during shutdown, walsenders always waited until all pending data
was replicated to receivers. This ensures sender and receiver stay in sync
after shutdown, which is important for physical replication switchovers,
but it can significantly delay shutdown. For example, in logical replication,
if apply workers are blocked on locks, walsenders may wait until those locks
are released, preventing shutdown from completing for a long time.

This commit introduces a new GUC, wal_sender_shutdown_timeout,
which specifies the maximum time a walsender waits during shutdown for all
pending data to be replicated. When set, shutdown completes once all data is
replicated or the timeout expires. A value of -1 (the default) disables
the timeout.

This can reduce shutdown time when replication is slow or stalled. However,
if the timeout is reached, the sender and receiver may be left out of sync,
which can be problematic for physical replication switchovers.

Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru>
Reviewed-by: Ronan Dunklau <ronan@dunklau.fr>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
When determining if it is safe to use an expression as a grouping key
for partial aggregation, eager aggregation relies on the B-tree
equalimage support function to ensure that equality implies image
equality.

Previously, the code incorrectly passed the default collation of the
expression's data type to the equalimage procedure, rather than the
expression's actual collation.  As a result, if a column used a
non-deterministic collation but the base type's default collation was
deterministic, eager aggregation would incorrectly assume that the
column was safe for byte-level grouping.  This could cause rows to be
prematurely grouped and subsequently discarded by strict join
conditions, resulting in incorrect query results.

This patch fixes the issue by passing the expression's actual
collation to the equalimage procedure.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
Pushing aggregates containing volatile functions below a join can
violate volatility semantics by changing the number of times the
function is executed.

Here we check the Aggref nodes in the targetlist and havingQual for
volatile functions and disable eager aggregation when such functions
are present.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
…ion test

Previously the stats.sql regression test used conditions like
"datname = (SELECT current_database())" to check the current database name.

The subquery is unnecessary, so this commit simplifies these expressions to
"datname = current_database()".

Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/A1535A8F-65AF-4C3D-ACBE-25891CB5D38B@gmail.com
Alexander Lakhin has noticed that it can be possible on machines with
slow storage to have the spawned workers be stuck in
initialize_worker_spi(), before they reach their main loop.  Waiting for
a flush to happen would block the interrupt attempts done by the
database commands, causing the test to fail on timeout once the number
of interrupt attempts is reached in CountOtherDBBackends().

This commit switches the test to wait for the spawned bgworkers to reach
their main loops before attempting the database commands that would
trigger the interrupts, napping for a time larger than the default, with
worker_spi.naptime set at 10 minutes.  Another thing that could be
attempted is to enforce a larger number of tries in
CountOtherDBBackends(), if what is done here is not enough.  Let's see
first if what this commit does is enough for the buildfarm members
widowbird and jay.

Analyzed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/f913fba1-da59-404c-9eb3-07c7304be637@gmail.com
Previously, one LWLock was used for each lock type, adding complexity
without an observable performance benefit as data is gathered only for
paths involving lock waits, at least currently.  This commit replaces
the per-type set of LWLocks with a single LWLock protecting the stats
data of all the lock types, like the stats kinds for SLRU or WAL.  A
good chunk of the callbacks get simpler thanks to this change.

The previous approach also had one bug in the flush callback when nowait
was called with "true": a backend iterating over all entries could
successfully flush some entries while skipping others due to contention,
then unconditionally reset the pending data.  This would cause some
stats data loss.

Oversight in 4019f72.

Reported-by: Tomas Vondra <tomas@vondra.me>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/1af63e6d-16d5-4d5b-9b03-11472ef1adf9@vondra.me
This module allows plan advice strings to be provided automatically
from an in-memory advice stash. Advice stashes are stored in dynamic
shared memory and must be recreated and repopulated after a server
restart. If pg_stash_advice.stash_name is set to the name of an advice
stash, and if query identifiers are enabled, the query identifier
for each query will be looked up in the advice stash and the
associated advice string, if any, will be used each time that query
is planned.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: http://postgr.es/m/CA+TgmoaeNuHXQ60P3ZZqJLrSjP3L1KYokW9kPfGbWDyt+1t=Ng@mail.gmail.com
Some compilers didn't like the empty initializer when compiled without
USE_INJECTION_POINTS. Per buildfarm member 'drongo', using Visual
Studio 2019.

Author: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/adNHcBVJO5gIOp1l@paquier.xyz
When freeing pending_shmem_requests we should also free the ->options.

Author: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://www.postgresql.org/message-id/CAJ7c6TN9tp8MTc0WXM0zfSWqjfBqU8gpe+o5KqHB1-cQ7409Kw@mail.gmail.com
Child processes do not need the postmaster's working memory context and
normally release it at the start of their main entry point. However,
the slotsync worker forgot to do so.

This commit makes the slotsync worker release the postmaster's working
memory context at startup, preventing unintended use.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Tiancheng Ge <getiancheng_2012@163.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwHO05JaUpgKF8FBDmPdBUJsK22axRRcgmAUc2Jyi8OK8g@mail.gmail.com
This commit updates 011_lock_stats.pl to verify log_lock_waits behavior.

The tests check that messages are emitted both when a wait occurs and
when the lock is acquired, and that the "still waiting for" message is logged
exactly once per wait, even if the backend wakes up during the wait.

The latter covers the behavior introduced by commit fd6ecbf.

Author: Hüseyin Demir <huseyin.d3r@gmail.com>
Co-authored-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAB5wL7YB1my9W5k5i=SY+=sTjeozyJ0YkvGXrVfeDNzuRkoTPg@mail.gmail.com
Previously, this logic was embedded within SplitIdentifierString,
SplitDirectoriesString, and SplitGUCList. Factoring it out saves
a bit of duplicated code, and also makes it available to extensions
that might want to do similar things without necessarily wanting to
do exactly the same thing.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
The restructuring in commit 53b8ca6 revealed an interesting
corner case: if a table needs vacuuming for wraparound prevention
and autovacuum is disabled for it, we might still choose to analyze
it.  Research seems to indicate this was an accidental addition by
commit 48188e1, and further discussion indicates there is
consensus that it is unnecessary and can be removed.

Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/adB9nSsm_S0D9708%40nathan
It would be useful to be able to tell auto_explain to set a custom
EXPLAIN option, but it would be bad if it tried to do so and the
option name or value wasn't valid, because then every query would fail
with a complaint about the EXPLAIN option. So add a guc_check_handler
that auto_explain will be able to use to only try to set option
name/value/type combinations that have been determined to be legal,
and to emit useful messages about ones that aren't.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
This code missed the need to update the combined state's
nullbitmap if state1 already had a bitmap but state2 didn't.
We need to extend the existing bitmap with 1's but didn't.
This could result in wrong output from a parallelized
array_agg(anyarray) calculation, if the input has a mix of
null and non-null elements.  The errors depended on timing
of the parallel workers, and therefore would vary from one
run to another.

Also install guards against integer overflow when calculating
the combined object's sizes, and make some trivial cosmetic
improvements.

Author: Dmytro Astapov <dastapov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Backpatch-through: 16
contrib/pg_stash_advice and src/test/modules/test_shmem
missed these, leading to complaints from git after an
in-tree check-world run.

Use our standard boilerplate list of ignorable subdirectories,
although the two modules presently create different subsets
of that.
These columns haven't been computed yet when the filtering happens
(since we've not written the candidate tuple into the table); so
any check on them is wrong or useless.  Worse, since aa606b9 such a
reference results in an access off the end of a TupleDesc, potentially
causing a phony "generated columns are not supported in COPY FROM
WHERE conditions" error; and since c98ad08 it throws an Assert
instead.

Actually we could allow tableoid, which has been set to the OID of the
table named as the COPY target.  However, plausible uses for tests of
tableoid would involve a partitioned target table, and the user would
wish it to read as the OID of the destination partition.  There has
been some discussion of changing things to make it work like that,
but pending that happening we should just disallow tableoid along
with other system columns.

It seems best though to install this prohibition only in HEAD.
In the back branches we'll just guard the unsafe TupleDesc access,
and people will keep getting whatever semantics they got before.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com
CLUSTER is no longer the favored way to invoke this functionality, and
the code is about to shift its focus to the REPACK more ambitiously.
Rename the file to avoid leaving an unnecessary historical artifact
around.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603271635.owyhm7btgoic@alvherre.pgsql
Previously, autovacuum always disabled parallel vacuum regardless of
the table's index count or configuration. This commit enables
autovacuum workers to use parallel index vacuuming and index cleanup,
using the same parallel vacuum infrastructure as manual VACUUM.

Two new configuration options control the feature. The GUC
autovacuum_max_parallel_workers sets the maximum number of parallel
workers a single autovacuum worker may launch; it defaults to 0,
preserving existing behavior unless explicitly enabled. The per-table
storage parameter autovacuum_parallel_workers provides per-table
limits. A value of 0 disables parallel vacuum for the table, a
positive value caps the worker count (still bounded by the GUC), and
-1 (the default) defers to the GUC.

To handle cases where autovacuum workers receive a SIGHUP and update
their cost-based vacuum delay parameters mid-operation, a new
propagation mechanism is added to vacuumparallel.c. The leader stores
its effective cost parameters in a DSM segment. Parallel vacuum
workers poll for changes in vacuum_delay_point(); if an update is
detected, they apply the new values locally via VacuumUpdateCosts().

A new test module, src/test/modules/test_autovacuum, is added to
verify that parallel autovacuum workers are correctly launched and
that cost-parameter updates are propagated as expected.

The patch was originally proposed by Maxim Orlov, but the
implementation has undergone significant architectural changes
since then during the review process.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
transformCreateSchemaStmtElements has always believed that it is
supposed to re-order the subcommands of CREATE SCHEMA into a safe
execution order.  However, it is nowhere near being capable of doing
that correctly.  Nor is there reason to think that it ever will be,
or that that is a well-defined requirement.  (The SQL standard does
say that it should be possible to do foreign-key forward references
within CREATE SCHEMA, but it's not clear that the text requires
anything more than that.)  Moreover, the problem will get worse as
we add more subcommand types.  Let's just drop the whole idea and
execute the commands in the order given, which seems like a much
less astonishment-prone definition anyway.  The foreign-key issue
will be handled in a follow-up patch.

This will result in a release-note-worthy incompatibility,
which is that forward references like
	CREATE SCHEMA myschema
	    CREATE VIEW myview AS SELECT * FROM mytable
	    CREATE TABLE mytable (...);
used to work and no longer will.  Considering how many closely
related variants never worked, this isn't much of a loss.

Along the way, pass down a ParseState so that we can provide an
error cursor for "wrong schema name" and related errors, and fix
transformCreateSchemaStmtElements so that it doesn't scribble
on the parsetree passed to it.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
The previous patch simplified CREATE SCHEMA's behavior to "execute all
subcommands in the order they are written".  However, that's a bit too
simple, as the spec clearly requires forward references in foreign key
constraint clauses to work, see feature F311-01.  (Most other SQL
implementations seem to read more into the spec than that, but it's
not clear that there's justification for more in the text, and this is
the only case that doesn't introduce unresolvable issues.)  We never
implemented that before, but let's do so now.

To fix it, transform FOREIGN KEY clauses into ALTER TABLE ... ADD
FOREIGN KEY commands and append them to the end of the CREATE SCHEMA's
subcommand list.  This works because the foreign key constraints are
independent and don't affect any other DDL that might be in CREATE
SCHEMA.  For simplicity, we do this for all FOREIGN KEY clauses even
if they would have worked where they were.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/1075425.1732993688@sss.pgh.pa.us
Having rejected the principle that we should know how to re-order
the sub-commands of CREATE SCHEMA, there is not really anything
except a little coding to stop us from supporting more object types.
This patch adds support for creating functions (including procedures
and aggregates), operators, types (including domains), collations,
and text search objects.

SQL:2021 specifies that we should allow functions, procedures,
types, domains, and collations, so this moves us a great deal
closer to full SQL compatibility of CREATE SCHEMA.  What remains
missing from their list are casts, transforms, roles, and some
object types we don't support yet (e.g. CREATE CHARACTER SET).
Supporting casts or transforms would be problematic because
they don't have names at all, let alone schema-qualified names,
so it'd be quite a stretch to say that they belong to a schema.
Roles likewise are not schema-qualified, plus they are global
to a cluster, making it even less reasonable to consider them
as belonging to a schema.  So I don't see us trying to complete
the list.

User-defined aggregates and operators are outside the spec's ken,
as are text search objects, so adding them does not do anything for
spec compatibility.  But they go along with these other object types,
plus it takes no additional code to support them since they are
represented as DefineStmts like some variants of CREATE TYPE.
It would indeed take some effort to reject them.

Author: Kirill Reshke <reshkekirill@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CALdSSPh4jUSDsWu3K58hjO60wnTRR0DuO4CKRcwa8EVuOSfXxg@mail.gmail.com
The associated value should look like something that could be
part of an EXPLAIN options list, but restricted to EXPLAIN options
added by extensions.

For example, if pg_overexplain is loaded, you could set
auto_explain.log_extension_options = 'DEBUG, RANGE_TABLE'.
You can also specify arguments to these options in the same manner
as normal e.g. 'DEBUG 1, RANGE_TABLE false'.

Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Discussion: http://postgr.es/m/CA+Tgmob-0W8306mvrJX5Urtqt1AAasu8pi4yLrZ1XfwZU-Uj1w@mail.gmail.com
This function is a thin wrapper around relation_needs_vacanalyze()
that handles fetching and freeing the pgstat entry for the table.
Since all callers of relation_needs_vacanalyze() do that anyway, we
can teach that function to fetch/free the pgstat entry and use it
instead.

Suggested-by: Álvaro Herrera <alvherre@kurilemu.de>
Author: Sami Imseih <samimseih@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gmail.com
Use TupleDescInitBuiltinEntry instead of TupleDescInitEntry when building
the result tuple descriptor for the WAIT FOR command. This avoids a syscache
access that could re-establish a catalog snapshot after we've explicitly
released all snapshots before the wait.

Discussion: https://postgr.es/m/CABPTF7U%2BSUnJX_woQYGe%3D%3DR9Oz%2B-V6X0VO2stBLPGfJmH_LEhw%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
When the standby is passed as a PostgreSQL::Test::Cluster instance,
use the WAIT FOR LSN command on the standby server to implement
wait_for_catchup() for replay, write, and flush modes.  This is more
efficient than polling pg_stat_replication on the upstream, as the
WAIT FOR LSN command uses a latch-based wakeup mechanism.

The optimization applies when:
- The standby is passed as a Cluster object (not just a name string)
- The mode is 'replay', 'write', or 'flush' (not 'sent')

Rather than pre-checking pg_is_in_recovery() on the standby (which
would add an extra round-trip on every call), we issue WAIT FOR LSN
directly and handle the 'not in recovery' result as a signal to fall
back to polling.

For 'sent' mode, when the standby is passed as a string (e.g., a
subscription name for logical replication), when the standby has been
promoted, or when WAIT FOR LSN is interrupted by a recovery conflict,
the function falls back to the original polling-based approach using
pg_stat_replication on the upstream.  The recovery conflict fallback
is necessary because some conflicts are unavoidable - for example,
ResolveRecoveryConflictWithTablespace() kills all backends
unconditionally, regardless of what they are doing.

The recovery conflict detection matches the English error message
"conflict with recovery", which is reliable because the test suite
runs with LC_MESSAGES=C.

Discussion: https://postgr.es/m/CABPTF7UiArgW-sXj9CNwRzUhYOQrevLzkYcgBydmX5oDes1sjg%40mail.gmail.com
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@kurilemu.de>
petere and others added 29 commits April 23, 2026 21:52
pg_test_timing reports timing differences in nanoseconds in master, and
in microseconds in v14 through v18, but previously the backward-clock
warning incorrectly labeled the value as milliseconds.

This commit fixes the warning message to use "ns" in master and
"us" in v14 through v18, matching the actual unit being reported.

Backpatch to all supported versions.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Xiaopeng Wang <wxp_728@163.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/F780CEEB-A237-4302-9F55-60E9D8B6533D@gmail.com
Backpatch-through: 14
ExecEvalHashedScalarArrayOp(), when using a strict equality function,
performs a short-circuit when looking up NULL values.  When the function
is non-strict, the code incorrectly looked up the hash table for a
zero-valued Datum, which could have resulted in an accidental true
return if the hash table contained zero valued Datum, or could result
in a crash for non-byval types.

Here we fix this by adding an extra step when we build the hash table to
check what the result of a NULL lookup would be.  This requires looping
over the array and checking what the non-hashed version of the code
would do.  We cache the results of that in the expression so that we can
reuse the result any time we're asked to search for a NULL value.

It's important to note that non-strict equality functions are free to
treat any NULL value as equal to any non-NULL value.  For example,
someone may wish to design a type that treats an empty string and NULL
as equal.

All built-in types have strict equality functions, so this could affect
custom / user-defined types.

Author: Chengpeng Yan <chengpeng_yan@outlook.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/A16187AE-2359-4265-9F5E-71D015EC2B2D@outlook.com
Backpatch-through: 14
Commit 0b096e3 changed pg_test_timing to measure timing
differences in nanoseconds instead of microseconds, but the resulting
deltas continued to be stored in int32.

That can overflow for large gaps (for example, values greater than about
2.14 seconds in nanoseconds), leading to truncation or incorrect output.

This commit fixes the issue by storing measured timing deltas in int64.
This prevents overflow for large values and better matches
nanosecond-resolution measurements.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Xiaopeng Wang <wxp_728@163.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/F780CEEB-A237-4302-9F55-60E9D8B6533D@gmail.com
generate_queries_for_path_pattern_recurse() and
generate_setop_from_pathqueries() are recursive functions.  For a
property graph with hundreds of tables, a graph pattern with a handful
element patterns can cause stack overflow.  Fix it by calling
check_stack_depth() at the beginning of these functions.

Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAHg+QDfgK0xddH8f3eAb+UVn7sBDOnv8RvM6OkP4HtHAt6aD7w@mail.gmail.com
Reported-by: Lakshmi N <lakshmin.jhs@gmail.com>
Author: Lakshmi N <lakshmin.jhs@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA+3i_M9gpUGjH-BkJk=UFjK16jq9fEQHpmZ1cxpJO+xM4hWC+A@mail.gmail.com
GRAPH_TABLE clause is converted into a rangetable entry, which is
ignored by assign_query_collations().  Hence we assign collations
while transforming its parts.  But expressions in COLUMNS clause
missed that treatment, so fix that.

While at it, also add comments about collation assignment to the parts
of GRAPH_TABLE clause, and also fix a small grammar issue.

Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAHg+QDc4aaiufYSgrwMMPMMRTPtQ66SghcrPFbWJFZMqNaG+BA@mail.gmail.com
Expressions in GRAPH_TABLE COLUMNS list may have lateral references.
get_rule_expr() requires lateral namespaces to deparse such
references.  get_from_clause_item() does not pass them when processing
the expressions in COLUMNS list causing ERROR "bogus varlevelsup: 0
offset 0".  Fix get_from_clause_item() to pass input deparse_context
containing lateral namespaces to get_rule_expr() instead of the dummy
context.

Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHg%2BQDcLVa2iBnggkHxY4itZbXtDMfsYHEjnCUYe9hNbnxDi-w%40mail.gmail.com
We need to create top-level targets to run targets with the ninja
command like `ninja <target_name>`.

Some targets (man, html, ...) have the same target name on both
top-level and custom target.  This creates a confusion for the meson
build:

$ meson compile -C build html

```
ERROR: Can't invoke target `html`: ambiguous name. Add target type
and/or path:
- ./doc/src/sgml/html:custom
- ./doc/src/sgml/html:alias
```

Solve that problem by adding '-custom' suffix to these problematic
targets' custom target names.  Top-level targets can be called with
both meson and ninja now:

$ meson compile -C build html
$ ninja -C build html

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Suggested-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/5508e572-79ae-4b20-84d0-010a66d077f2%40eisentraut.org
British Columbia (America/Vancouver) moved to permanent UTC-07 on
2026-03-09, which will affect their clocks beginning on 2026-11-01.
For lack of any clarity on the point, assume their TZ abbreviation
will be MST from that time forward.

Moldova (Europe/Chisinau) has followed EU DST transition times since
2022.

Backpatch-through: 14
Quote pg_hosts.conf fields derived from the build directory, since
hba.c:next_token() treats a comma as a token separator.  Commit
4f43302 introduced pg_hosts.conf and
this test.  A build directory name containing a comma worked before that
commit.  A build directory name containing a quote character has not
worked, so don't handle that.

Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/20260426213252.7a@rfd.leadboat.com
These are old leaks, that can pile up if a WAL receiver stays alive,
waiting for new WAL data after the sender has switched to a new
timeline.

While this is technically a bug, the impact is minimal and would only
become noticeable if the WAL sender handles a lot of timeline switches,
so no backpatch is done.  Note that in most cases, primary_conninfo
would be updated in a standby to point to a new sender, meaning a
restart of the WAL receiver.  Let's be clean on HEAD, though.

Author: DaeMyung Kang <charsyam@gmail.com>
Discussion: https://postgr.es/m/20260426170100.847923-1-charsyam@gmail.com
Discussion: https://postgr.es/m/20260426170219.849330-1-charsyam@gmail.com
remove_self_join_rel() called adjust_relid_set() on all_result_relids
and leaf_result_relids but threw away the return value.  Since
adjust_relid_set() returns a freshly-built Relids and does not modify
the input in place, the calls did nothing.  This has been the case
since the SJE feature went in (commit fc069a3).

There has been no observable misbehavior, because the relid being
passed is guaranteed not to be a member of either set.  At the point
remove_self_join_rel() runs, those sets contain only resultRelation;
inheritance children have not been added yet, as that happens later in
query_planner(), in expand_single_inheritance_child() called from
add_other_rels_to_query().  And remove_self_joins_recurse() rejects
parse->resultRelation as an SJE candidate to preserve the EvalPlanQual
mechanism.  Even with the result assigned, the calls would be no-ops
in practice.

Rather than make the calls do the cleanup they pretend to do, replace
them with assertions of the invariant.  Any future loosening of the
SJE candidate filter -- for instance to allow eliminating a result
relation under provable conditions -- will trip the assertion and
force whoever does it to revisit this code.

Additionally, decorate adjust_relid_set() with pg_nodiscard so that
any future accidental discard of its return value is caught at compile
time.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49fYQcqJfJ_Gtn8r1GFNoYtb1=2AUab4ieuqY4Zid9ocQ@mail.gmail.com
Previously, these test cases would give internal errors or crash.  The
fix is to add some missing fields of ForPortionOfExpr to
expression_tree_walker.

Author: jian he <jian.universality@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://postgr.es/m/CACJufxHs1Hs00EqsZ4NbuAjmYzMzjJyP1sAj12Ne=cBsEVmQOA@mail.gmail.com
Similarly to logical replication, REPACK CONCURRENTLY needs to ability
to reliably locate a tuple based on an identity.  A replica identity
index is okay.  Primary keys normally also are, except when they are
deferrable, because a tuple being modified might not yet be indexed,
causing REPACK to fail.

Change the REPACK CONCURRENTLY code to use GetRelationIdentityOrPK(),
similar to what the logical replication code does.  (Though we don't yet
support locating tuples based on arbitrary indexes for replica identity
FULL.)

While at it, add a few more test cases for situations that aren't
supported by REPACK, to improve coverage.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: Yuchen Li <liyuchen_xyz@163.com>
Discussion: https://postgr.es/m/10DD5E13-B45D-44F1-BE08-C63E00ABCAC0@gmail.com
Use BoolGetDatum() instead of Int32GetDatum() when storing the boolean
subretentionactive column in pg_subscription. This was an oversight in
a850be2.

Author: Lakshmi N <lakshmin.jhs@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Discussion: https://postgr.es/m/CA+3i_M98-XjE-_fw0p+8xOnw64y2_YLtJfcwvCfsVMn-z2ZjGg@mail.gmail.com
When a subscription has retain_dead_tuples enabled and maxretention is
zero (unlimited), adjust_xid_advance_interval() mistakenly caps
xid_advance_interval to zero.

This zero interval forces get_candidate_xid() to evaluate
TimestampDifferenceExceeds() as always true, causing the apply worker to
call GetOldestActiveTransactionId() for every WAL message. This
leads to unnecessary ProcArrayLock acquisitions.

Fix this by only capping the interval when maxretention > 0, allowing
the exponential back-off to function properly.

Author: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Discussion: https://postgr.es/m/CAHg+QDdKVnCLHot=AcoPpEiSyDzGz7wGYjAFHVOw57oDtmUDWQ@mail.gmail.com
Do minor comment fixes and remove implicit cast to Datum.

While here, let's prefer crashing instead of entering an infinite
loop in case of future programming mistakes when computing next_level,
suggested by ChangAo Chen.

Discussion: https://postgr.es/m/tencent_49E3F11E74D8A584A2144ED532A490CBC40A@qq.com
btestimateparallelscan neglected to add btps_arrElems[] space overhead
for skip array scan keys that were later output by nbtree preprocessing.
Skip arrays don't actually need to use this space, but a scan with a
subsequent SAOP array will need to subscript btps_arrElems[] using a
simple so->arrayKeys[]-wise offset.  so->arrayKeys[] has entries for
both kinds of arrays.

As a result of this oversight, it was possible for an index scan with a
skip array and a lower-order SAOP array to write past the allocated
shared memory boundary when storing the SAOP array's cur_elem.  In
practice the problem seems to be limited to scans with many skipped
index columns, since our general approach to estimating the amount of
shared memory that will be required is fairly conservative.

To fix, have btestimateparallelscan request an extra sizeof(int) space
for key columns that might require a skip array later on.

Oversight in commit 92fe23d, which added the nbtree skip scan
optimization.

Author: Siddharth Kothari <sidkot@google.com>
Discussion: https://postgr.es/m/CAGCUe0Lwk3C0qdkBa+OLpYc7yXwW=pbaz8Sju4xMXEQAmyp+5g@mail.gmail.com
Backpatch-through: 18
The regression tests for pg_get_role_ddl(), pg_get_database_ddl(),
and pg_get_tablespace_ddl() created databases and tablespaces, which
are heavyweight operations.  As noted by Andres Freund, this is
wasteful in the core regression suite which gets run repeatedly.

Convert the three test files (role_ddl.sql, database_ddl.sql,
tablespace_ddl.sql) into a single TAP test that runs once, covering
all the same functionality: basic DDL generation, pretty-printing,
option handling, error cases, permission checks, and edge cases like
quoted names and role memberships.

Discussion: https://postgr.es/m/5c67dc79-909a-4e17-8606-6686667da6c6@dunslane.net
The tests introduced in c529ee3 are timezone sensitive.

Pin the cluster's timezone to UTC at init time so timestamptz output
is deterministic regardless of the host's local timezone.
TidStoreSetBlockOffsets() requires its offsets array to be strictly
ascending and asserts this precondition. In test_tidstore, we were
passing random offset numbers deduplicated by a DISTINCT clause in an
array_agg() call directly to the do_set_block_offsets() test
harness. However, DISTINCT without an ORDER BY clause does not
guarantee sorted results according to the SQL standard.

Fix this by sorting the offsets in-place inside do_set_block_offsets()
before calling TidStoreSetBlockOffsets().

While this assertion failure is not observed during regular regression
tests because they use queries simple enough that the optimizer
consistently chooses plans yielding sorted results, it makes sense to
stabilize the test. The failure could theoretically occur depending on
the optimizer's plan choice, and has been reported when experimenting
with certain third-party extensions.

Backpatch to v17, where test_tidstore was introduced, to ensure
extension development on stable branches does not hit this assertion.

Reported-by: Andrei Lepikhov <lepihov@gmail.com>
Author: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/b97f1850-fc7b-43c4-9b04-4e97bb9e7dc0@gmail.com
Backpatch-through: 17
After a recent macOS update, building Postgres produces warnings
that look like this:

    ranlib: warning: 'libpgport_shlib.a(pg_cpu_x86.c.o)' has no symbols
    ranlib: warning: 'libpgport_shlib.a(pg_popcount_x86.c.o)' has no symbols

To fix, add a dummy symbol to files that may otherwise have none.
Per project policy, this is a candidate for back-patching into
out-of-support branches: it suppresses annoying compiler warnings
but changes no behavior.

Reported-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/229aaaf3-f529-44ed-8e50-00cb6909af21%40Spark
Backpatch-through: 13
"lock" a values is supported since 4019f72, but the error message
of the function used when specifying an incorrect value forgot about it.

Author: Maksim Logvinenko <logvinenko-ms@yandex.ru>
Discussion: https://postgr.es/m/433431777389005@mail.yandex.ru
This reverts portions of commit 6dcfac9, which is wrong in trying
to use a *GetDatum() that matches with the C types of the values read.
*GetDatum() should match with the output argument types of the SQL
functions.

The portions of 6dcfac9 that are right regarding this rule are:
- gistget.c, where the GiST support functions use DatumGetUInt16() to
retrieve the strategy number.
- The BRIN code for strategynum, used in syscache lookups.

The adjustments done in this commit are for pageinspect, pg_buffercache
and pg_lock_status().

While double-checking the whole state of the tree regarding non-matching
pairs of DatumGet*() and *GetDatum(), I have found much more code paths
that are incorrect, unrelated to 6dcfac9.  These may be adjusted in
the future, in a different patch (perhaps not for v19, as we are already
past feature freeze).

Reported-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/97f9375a-be61-4272-a44d-408337fe8fa6@eisentraut.org
Discussion: https://postgr.es/m/CAJ7c6TMcGu8qmRe1gZfJ-gOzVnZq-t=fwn-UuyStx1w6ZyydMw@mail.gmail.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.