Add comprehensive wait events coverage gap analysis#2
Open
NikolayS wants to merge 2730 commits into
Open
Conversation
NikolayS
pushed a commit
that referenced
this pull request
Dec 23, 2025
truncate_useless_pathkeys() seems to have neglected to account for PathKeys that might be useful for WindowClause evaluation. Modify it so that it properly accounts for that. Making this work required adjusting two things: 1. Change from checking query_pathkeys to check sort_pathkeys instead. 2. Add explicit check for window_pathkeys For #1, query_pathkeys gets set in standard_qp_callback() according to the sort order requirements for the first operation to be applied after the join planner is finished, so this changes depending on which upper planner operations a particular query needs. If the query has window functions and no GROUP BY, then query_pathkeys gets set to window_pathkeys. Before this change, this meant PathKeys useful for the ORDER BY were not accounted for in queries with window functions. Because of #1, #2 is now required so that we explicitly check to ensure we don't truncate away PathKeys useful for window functions. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvrj3HTKmXoLMbUjTO=_MNMxM=cnuCSyBKidAVibmYPnrg@mail.gmail.com
NikolayS
pushed a commit
that referenced
this pull request
Feb 6, 2026
cost_tidrangescan() was setting the disabled_nodes value correctly, and then immediately resetting it to zero, due to poor code editing on my part. materialized_finished_plan correctly set matpath.parent to zero, but forgot to also set matpath.parallel_workers = 0, causing an access to uninitialized memory in cost_material. (This shouldn't result in any real problem, but it makes valgrind unhappy.) reparameterize_path was dereferencing a variable before verifying that it was not NULL. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1) Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1) Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1) Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2) Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3) Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz
Add a new SQL-callable function that returns the DDL statements needed to recreate a role. It takes a regrole argument and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output and memberships (boolean) to include GRANT statements for role memberships and membership options. The return is one or multiple rows where the first row is a CREATE ROLE statement and subsequent rows are ALTER ROLE statements to set some role properties. Password information is never included in the output. The caller must have SELECT privilege on pg_authid. Author: Mario Gonzalez <gonzalemario@gmail.com> Author: Bryan Green <dbryan.green@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/4c5f895e-3281-48f8-b943-9228b7da6471@gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
Add a new SQL-callable function that returns the DDL statements needed to recreate a tablespace. It takes a tablespace name or OID and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output and owner (boolean) to include OWNER. (It includes two variants because there is no regtablespace pseudotype.) The return is one or multiple rows where the first row is a CREATE TABLESPACE statement and subsequent rows are ALTER TABLESPACE statements to set some tablespace properties. The caller must have SELECT privilege on pg_tablespace. get_reloptions() in ruleutils.c is made non-static so it can be called from the new ddlutils.c file. Author: Nishant Sharma <nishant.sharma@enterprisedb.com> Author: Manni Wood <manni.wood@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAKWEB6rmnmGKUA87Zmq-s=b3Scsnj02C0kObQjnbL2ajfPWGEw@mail.gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
Add a new SQL-callable function that returns the DDL statements needed to recreate a database. It takes a regdatabase argument and an optional VARIADIC text argument for options that are specified as alternating name/value pairs. The following options are supported: pretty (boolean) for formatted output, owner (boolean) to include OWNER and tablespace (boolean) to include TABLESPACE. The return is one or multiple rows where the first row is a CREATE DATABASE statement and subsequent rows are ALTER DATABASE statements to set some database properties. The caller must have CONNECT privilege on the target database. Author: Akshay Joshi <akshay.joshi@enterprisedb.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Co-authored-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Quan Zongliang <quanzongliang@yeah.net> Discussion: https://postgr.es/m/CANxoLDc6FHBYJvcgOnZyS+jF0NUo3Lq_83-rttBuJgs9id_UDg@mail.gmail.com Discussion: https://postgr.es/m/e247c261-e3fb-4810-81e0-a65893170e94@dunslane.net
While working on refactoring how shmem is allocated, I made a mistake where the main LWLock array did not reserve space for the LWLocks allocated with RequestNamedLWLockTranche(), and the test still passed. Matthias van de Meent spotted that before it got committed, but in order to catch such mistakes in the future, add checks in test_lwlock_tranches that the locks allocated with RequestNamedLWLockTranche() can be acquired and released. Another change is to stop requesting multiple tranches with the same name with RequestNamedLWLockTranche(). As soon as I started to test using the locks I realized that's bogus, and the next commit will forbid it. Keep test coverage for duplicates requested with LWLockNewTrancheId() for now, but make it more clear that that's what the test does. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi Discussion: https://www.postgresql.org/message-id/CAEze2WjgCROMMXY0+j8FFdm3iFcr7By-+6Mwiz=PgGSEydiW3A@mail.gmail.com
You could request two tranches with same name, but things would get confusing when you called GetNamedLWLockTranche() to get the LWLocks allocated for them; it would always return the first tranche with the name. That doesn't make sense, so forbid duplicates. We still allow duplicates with LWLockNewTrancheId(). That works better as you don't use the name to look up the tranche ID later. It's still confusing in wait events, for example, but it's not dangerous in the same way. Reviewed-by: Sami Imseih <samimseih@gmail.com> Discussion: https://www.postgresql.org/message-id/463a28db-0c0b-4af6-bac6-3891828bbbfe@iki.fi
The two new functions allow to extract the block number and offset from a tid. There are existing ways to do so (e.g. by doing (ctid::text::point)[0]), but they are hard to remember and not pretty. tid_block() returns int8 (bigint) because BlockNumber is uint32, which exceeds the range of int4. tid_offset() returns int4 (integer) because OffsetNumber is uint16, which fits safely in int4. Bumps catversion. Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAJTYsWUzok2+mvSYkbVUwq_SWWg-GdHqCuYumN82AU97SjwjCA@mail.gmail.com
The database name was warned about when building with -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS, leading to BF and CI failures. It is somewhat confusing that the required prefix is different for databases than other object types. Also fix a pgindent violation that caused koel to start to fail. Discussion: https://postgr.es/m/ptyiexyhmtxf4lm524s7o7w64r26ra237uusv4tjav4yhpmeoo@vfwwllz7tivb
Introduce TriggerInstrumentation to capture trigger timing and firings (previously counted in "ntuples"), to aid a future refactoring that splits out all Instrumentation fields beyond timing and WAL/buffers into more specific structs. In passing, drop the "n" argument to InstrAlloc, as all remaining callers need exactly one Instrumentation struct. The duplication between InstrAlloc() and InstrInit(), as well as the conditional initialization of async_mode will be addressed in a subsequent commit. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
Previously, different places (e.g. query "total time") were repurposing the Instrumentation struct initially introduced for capturing per-node statistics during execution. This overuse of the same struct is confusing, e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated code paths, and prevents future refactorings. Instead, simplify the Instrumentation struct to only track time and WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation - these calls were added without any apparent benefit since the relevant fields were never read. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information. WorkerInstrumentation is renamed to WorkerNodeInstrumentation for clarity. In passing, clarify that InstrAggNode is expected to only run after InstrEndLoop (as it does in practice), and drop unused code. This also fixes a consequence-less bug: Previously ->async_mode was only set when a non-zero instrument_option was passed. That turns out to be harmless right now, as ->async_mode only affects a timing related field. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
A little refactoring in preparation for the next commit, to make the material changes in that commit more clear. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This replaces the [Subsystem]ShmemSize() and [Subsystem]ShmemInit()
functions called at postmaster startup with a new set of callbacks.
The new mechanism is designed to be more ergonomic. Notably, the size
of each shmem area is specified in the same ShmemRequestStruct() call,
together with its name. The same mechanism is used in extensions,
replacing the shmem_{request/startup}_hooks.
ShmemInitStruct() and ShmemInitHash() become backwards-compatibility
wrappers around the new functions. In future commits, I will replace
all ShmemInitStruct() and ShmemInitHash() calls with the new
functions, although we'll still need to keep them around for
extensions.
Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
The old ShmemInit{Struct/Hash}() functions could be used after
postmaster statup, as long as the allocation is small enough to fit in
spare shmem reserved at startup. I believe some extensions do that,
although we hadn't really documented it and had not coverage for it.
The new test module covers that after-startup usage with the new
ShmemRequestStruct() functions.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
As part of this, embed the LWLock it needs in the shared memory struct itself, so that we don't need to use RequestNamedLWLockTranche() anymore. LWLockNewTrancheId() + LWLockInitialize() is more convenient to use in extensions. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
To add a new built-in subsystem, add it to subsystemslist.h. That hooks up its shmem callbacks so that they get called at the right times during postmaster startup. For now this is unused, but will replace the current SubsystemShmemSize() and SubsystemShmemInit() calls in the next commits. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
It seems like a good candidate to convert first because it needs to initialized before any other subsystem, but other than that it's nothing special. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
These subsystems have some complicating properties, making them slightly harder to convert than most: - The initialization callbacks of some of these subsystems have dependencies, i.e. they need to be initialized in the right order. - The ProcGlobal pointer still needs to be inherited by the BackendParameters mechanism on EXEC_BACKEND builds, because ProcGlobal is required by InitProcess() to get a PGPROC entry, and the PGPROC entry is required to use LWLocks, and usually attaching to shared memory areas requires the use of LWLocks. - Similarly, ProcSignal pointer still needs to be handled by BackendParameters, because query cancellation connections access it without calling InitProcess Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This is in preparation to convert it to use the new shmem allocation
functions, making the next commit that does that smaller. This inlines
SerialInit() to the caller, and moves all the initialization steps
within PredicateLockShmemInit() to happen after all the
ShmemInit{Struct|Hash}() calls.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
I replaced the old SimpleLruInit() function without a backwards compatibility wrapper, because few extensions define their own SLRUs. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This replaces the "shmem_size" and "shmem_init" callbacks in the IO methods table with the same ShmemCallback struct that we now use in other subsystems Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
The buffer blocks, converted to use ShmemRequestStruct() in the next commit, are IO-aligned. This might come handy in other places too, so make it an explicit feature of ShmemRequestStruct(). Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This rectifies the initialization functions a little, making the "buffer strategy" stuff in freelist.c and buffer mapping hash table in buf_init.c top-level "subsystems" of their own, registered directly in subsystemlist.h. Previously they were called indirectly from BufferManagerShmemInit() and BufferManagerShmemSize() Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
This removes all remaining uses of ShmemInitStruct() and ShmemInitHash() from built-in code. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://www.postgresql.org/message-id/CAExHW5vM1bneLYfg0wGeAa=52UiJ3z4vKd3AJ72X8Fw6k3KKrg@mail.gmail.com
Commit 7c64d56 has removed the isolation test providing coverage for lock statistics due to some instability in the CI, where the deadlock timeout may not have enough time to process, preventing the stats data to be updated. These also relied on a set of hardcoded sleeps. This commit switches the test suite to TAP, instead, that uses an injection point with a wait to avoid the sleeps. The injection point is added in ProcSleep(), once we know that the deadlock timeout has fired and that the stats have been updated. Multiple lock patterns are checked, all rely on the same workflow, with two sessions: - session 1 holds a given lock type. - session 2 attaches to the new injection point with the wait action. - session 2 attempts to acquire a lock conflicting with the lock of session 1, waiting for the injection point to be reached. - session 1 releases its lock, session 2 commits. - pg_stat_lock is polled until the counters are updated for the lock type. Bertrand's version of the patch introduced a new routine to BackgroundPsql() to detect the blocked background sessions. I have tweaked the test so as we use the same method as some of the other tests instead, based on some \echo commands. This test has been run multiple times in the CI, all passing, so I'd like to think that this is more stable than the first version attempted. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/acNTR1lLHwQJ0o+P@ip-10-97-1-34.eu-west-3.compute.internal
If the background worker for processing databases manages to finish before the launcher starts waiting for it, the launcher would treat it erroneously as an error. Fix by ensureing to check result state in this case. Identified on CI and synthetically reproduced during local testing. Also while, make sure to properly lock the shared memory structure before updating tje result state. Author: Daniel Gustafsson <daniel@yesql.seA Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/4fxw37ge47v5baeozla5phymi233hxbcjbwwsfwv3mpg3kyl2z@6jk4nkf6jp4
Postcommit review and buildfarm/CI failures revealed a few issues in
the test code which this commit attempts to resolve. These failures
are verified using synthetic means.
* Wait for launcher exit in enable/disable checksum tests
When enabling or disabling data checksums in a test with waiting
for an end state (on or off), the test typically want to perform
more test against the cluster immediately. Make sure to wait for
the launcher to exit in these cases before returning in order to
know it can immediately be acted on. This is a more generic way
of implementating 0036232.
* Refactor injection point tests to use the injection_points test
extension. Two injection points added for online checksums were
better expressed using the injection_points extension with the
test code embedded in datachecksum_state.c.
* Make tests less timing dependent and allow transitions to "on"
and not just "inprogress-on" in case a test manages to finish
before it's checked for state.
* When waiting on a blocking background psql keeping a temporary
table open, the test first closed the background session abd
then the server. This could cause data checksums to manage to
get enabled in the brief window between dropping the temporary
table and closing the server. Fix by closing the server first
before the background session.
* Remove a few superfluous duplicate checks and general cleanup
of comments as well as making LSN logging consistent.
These issues were reported by Andres as well as spotted in the
buildfarm and on CI.
Author: Daniel Gustafsson <daniel@yesql.se>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/92F25C14-801E-4198-994D-D83E31FEB0D8@yesql.se
On MSVC Arm, USE_ARMV8_CRC32C is defined, but __builtin_constant_p is not available. Use pg_integer_constant_p and add appropriate guards. There is a similar potential hazard for the x86 path, but for now let's get the buildfarm green. Oversight in commit fbc57f2, per buildfarm member hoatzin.
…ation Previously, during shutdown, walsenders always waited until all pending data was replicated to receivers. This ensures sender and receiver stay in sync after shutdown, which is important for physical replication switchovers, but it can significantly delay shutdown. For example, in logical replication, if apply workers are blocked on locks, walsenders may wait until those locks are released, preventing shutdown from completing for a long time. This commit introduces a new GUC, wal_sender_shutdown_timeout, which specifies the maximum time a walsender waits during shutdown for all pending data to be replicated. When set, shutdown completes once all data is replicated or the timeout expires. A value of -1 (the default) disables the timeout. This can reduce shutdown time when replication is slow or stalled. However, if the timeout is reached, the sender and receiver may be left out of sync, which can be problematic for physical replication switchovers. Author: Andrey Silitskiy <a.silitskiy@postgrespro.ru> Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Takamichi Osumi <osumi.takamichi@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Vitaly Davydov <v.davydov@postgrespro.ru> Reviewed-by: Ronan Dunklau <ronan@dunklau.fr> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Japin Li <japinli@hotmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/TYAPR01MB586668E50FC2447AD7F92491F5E89@TYAPR01MB5866.jpnprd01.prod.outlook.com
When determining if it is safe to use an expression as a grouping key for partial aggregation, eager aggregation relies on the B-tree equalimage support function to ensure that equality implies image equality. Previously, the code incorrectly passed the default collation of the expression's data type to the equalimage procedure, rather than the expression's actual collation. As a result, if a column used a non-deterministic collation but the base type's default collation was deterministic, eager aggregation would incorrectly assume that the column was safe for byte-level grouping. This could cause rows to be prematurely grouped and subsequently discarded by strict join conditions, resulting in incorrect query results. This patch fixes the issue by passing the expression's actual collation to the equalimage procedure. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAMbWs48A53PY1Y4zoj7YhxPww9fO1hfnbdntKfA855zpXfVFRA@mail.gmail.com
If the SET or WHERE clause of an INSERT ... ON CONFLICT command references EXCLUDED.col, where col is a virtual generated column, the column was not properly expanded, leading to an "unexpected virtual generated column reference" error, or incorrect results. The problem was that expand_virtual_generated_columns() would expand virtual generated columns in both the SET and WHERE clauses and in the targetlist of the EXCLUDED pseudo-relation (exclRelTlist). Then fix_join_expr() from set_plan_refs() would turn the expanded expressions in the SET and WHERE clauses back into Vars, because they would be found to match the expression entries in the indexed tlist produced from exclRelTlist. To fix this, arrange for expand_virtual_generated_columns() to not expand virtual generated columns in exclRelTlist. This forces set_plan_refs() to resolve generation expressions in the query using non-virtual columns, as required by the executor. In addition, exclRelTlist now always contains only Vars. That was something already claimed in a couple of existing comments in the planner, which relied on that fact to skip some processing, though those did not appear to constitute active bugs. Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAHg+QDf7wTLz_vqb1wi1EJ_4Uh+Vxm75+b4c-Ky=6P+yOAHjbQ@mail.gmail.com Backpatch-through: 18
Formerly, attempting to use WHERE CURRENT OF to update or delete from a table with virtual generated columns would fail with the error "WHERE CURRENT OF on a view is not implemented". The reason was that the check preventing WHERE CURRENT OF from being used on a view was in replace_rte_variables_mutator(), which presumed that the only way it could get there was as part of rewriting a query on a view. That is no longer the case, since replace_rte_variables() is now also used to expand the virtual generated columns of a table. Fix by doing the check for WHERE CURRENT OF on a view at parse time. This is safe, since it is no longer possible for the relkind to change after the query is parsed (as of b23cd18). Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAHg+QDc_TwzSgb=B_QgNLt3mvZdmRK23rLb+RkanSQkDF40GjA@mail.gmail.com Backpatch-through: 18
When using ALTER TABLE ... MERGE PARTITIONS or ALTER TABLE ... SPLIT PARTITION, extension dependencies on partition indexes were being lost. This happened because the new partition indexes are created fresh from the parent partitioned table's indexes, while the old partition indexes (with their extension dependencies) are dropped. Fix this by collecting extension dependencies from source partition indexes before detaching them, then applying those dependencies to the corresponding new partition indexes after they're created. The mapping between old and new indexes is done via their common parent partitioned index. For MERGE operations, all source partition indexes sharing a parent partitioned index must have the same extension dependencies; if they differ, an error naming both conflicting partition indexes is raised. The check is implemented by collecting one entry per partition index, sorting by parent index OID, and comparing adjacent entries in a single pass. This is order-independent: the same set of partitions produces the same decision regardless of the order they are listed in the MERGE command, and subset mismatches are caught in both directions. For SPLIT operations, the new partition indexes simply inherit all extension dependencies from the source partition's index. The regression tests exercising this feature live under src/test/modules/test_extensions, where the test_ext3 and test_ext5 extensions are available; core regression tests cannot assume any particular extension is installed. Author: Matheus Alcantara <matheusssilv97@gmail.com> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Dmitry Koval <d.koval@postgrespro.ru> Discussion: https://www.postgresql.org/message-id/CALdSSPjXtzGM7Uk4fWRwRMXcCczge5uNirPQcYCHKPAWPkp9iQ%40mail.gmail.com
This function writes into a caller-supplied buffer of length 2 * MAXNORMLEN, which should be plenty in real-world cases. However a malicious affix file could supply an affix long enough to overrun that. Defend by just rejecting the match if it would overrun the buffer. I also inserted a check of the input word length against Affix->replen, just to be sure we won't index off the buffer, though it would be caller error for that not to be true. Also make the actual copying steps a bit more readable, and remove an unnecessary requirement for the whole input word to fit into the output buffer (even though it always will with the current caller). The lack of documentation in this code makes my head hurt, so I also reverse-engineered a basic header comment for CheckAffix. Reported-by: Xint Code Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/641711.1776792744@sss.pgh.pa.us Backpatch-through: 14
parse_affentry() and addCompoundAffixFlagValue() each collect fields from an affix file into working buffers of size BUFSIZ. They failed to defend against overlength fields, so that a malicious affix file could cause a stack smash. BUFSIZ (typically 8K) is certainly way longer than any reasonable affix field, but let's fix this while we're closing holes in this area. I chose to do this by silently truncating the input before it can overrun the buffer, using logic comparable to the existing logic in get_nextfield(). Certainly there's at least as good an argument for raising an error, but for now let's follow the existing precedent. Reported-by: Igor Stepansky <igor.stepansky@orca.security> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/864123.1776810909@sss.pgh.pa.us Backpatch-through: 14
to_char() allocates its output buffer with 8 bytes per formatting code in the pattern. If the locale's currency symbol, thousands separator, or decimal or sign symbol is more than 8 bytes long, in principle we could overrun the output buffer. No such locales exist in the real world, so it seems sufficient to truncate the symbol if we do see it's too long. Reported-by: Xint Code Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/638232.1776790821@sss.pgh.pa.us Backpatch-through: 14
Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. Most of these inconsistencies were introduced during Postgres 19 development. This commit was written with help from clang-tidy, by mechanically applying the same rules as similar clean-up commits (the earliest such commit was commit 035ce1f).
Commit 7a1f0f8 optimized the slot verification query but overlooked cases where all logical replication slots are already invalidated. In this scenario, the CTE returns no rows, causing the main query (which used a cross join) to return an empty result even when invalid slots exist. This commit fixes this by using a LEFT JOIN with the CTE, ensuring that slots are properly reported even if the CTE returns no rows. Author: Lakshmi N <lakshmin.jhs@gmail.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CA+3i_M8eT6j8_cBHkYykV-SXCxbmAxpVSKptjDVq+MFtpT-Paw@mail.gmail.com
The problem report was about setting GUCs in the startup packet for a physical replication connection. Setting the GUC required an ACL check, which performed a lookup on pg_parameter_acl.parname. The catalog cache was hardwired to use DEFAULT_COLLATION_OID for texteqfast() and texthashfast(), but the database default collation was uninitialized because it's a physical walsender and never connects to a database. In versions 18 and later, this resulted in a NULL pointer dereference, while in version 17 it resulted in an ERROR. As the comments stated, using DEFAULT_COLLATION_OID was arbitrary anyway: if the collation actually mattered, it should have used the column's actual collation. (In the catalog, some text columns are the default collation and some are "C".) Fix by using C_COLLATION_OID, which doesn't require any initialization and is always available. When any deterministic collation will do, it's best to consistently use the simplest and fastest one, so this is a good idea anyway. Another problem was raised in the thread, which this commit doesn't fix (see second discussion link). Reported-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/D18AD72A-5004-4EF8-AF80-10732AF677FA@yandex-team.ru Discussion: https://postgr.es/m/4524ed61a015d3496fc008644dcb999bb31916a7.camel%40j-davis.com Backpatch-through: 17
c3b39dc to
aeb6775
Compare
NikolayS
added a commit
that referenced
this pull request
Apr 22, 2026
The July 2025 "Requested WAL segment has already been removed" thread was authored by Japin Li, not Kukushkin. Kukushkin is a discussion participant alongside Fujii Masao et al.; his contribution was mailing-list commentary, not a patch. Fixed in four live references: - §1 "What exists today" bullet - §8.2 Future Work entry #2 (walsender restore_command integration) - §9 Reference #2 - §10.4 US-1 verdict prior-art note Changelog entries for v0.1 and v0.2 left as frozen history — their misattribution is preserved as part of the iteration record, not rewritten. Sawada (dynamic wal_level PoC, §9 ref #7) and Amul Sul (pg_waldump tarfile support, §9 ref #8) verified correctly attributed — both are thread originators / patch authors on their respective pgsql-hackers threads, no change needed.
NikolayS
pushed a commit
that referenced
this pull request
Apr 22, 2026
Generated from full postgres/postgres history (upstream/master, 64,103 commits). Bruce Momjian has 14,078 commits matching the screenshot, but ranks #2 — Tom Lane leads with 16,755. https://claude.ai/code/session_01SAWhyKXpDkFpwPiCP3EZ7T
This analysis identifies 92 specific code locations across PostgreSQL where operations may block or consume significant time without proper wait event instrumentation, causing monitoring tools to incorrectly show activity as "CPU" load. Key findings: - 35 critical I/O operations (fsync, stat, unlink, directory ops) - 20 authentication operations (LDAP, DNS, ident, RADIUS, SCRAM) - 6 compression operations (gzip, LZ4, Zstandard in base backup) - 5 cryptographic operations (SCRAM auth, SQL hash functions) - 6 executor operations (hash joins, aggregates without interruption) - 9 maintenance operations (vacuum, analyze, sorting) - 10 logical replication operations - Various other synchronization and buffer management issues The report includes: - Detailed code locations with file paths and line numbers - Impact assessment for each category - Proposed new wait events (~40-50 new events) - Implementation roadmap in 5 phases - Cost-benefit analysis Estimated impact: This work could eliminate 70-80% of "CPU*" false positives in monitoring tools, providing accurate visibility into what PostgreSQL is actually doing.
Updated the wait events analysis document to include: - Repository URL and commit hash at the top - Direct GitHub links for all file references - Line number links for all code locations - Line range links for multi-line references All links point to commit b9bcd15 to prevent code drift. Sections updated with links: - I/O Operations (fd.c, md.c, xlogrecovery.c, dsm_impl.c) - Authentication (LDAP, Ident, RADIUS, DNS lookups, SCRAM) - Compression (gzip, LZ4, Zstandard) - Cryptography (SCRAM auth, SQL hash functions, CRC) - Executor operations (Hash joins, aggregates, batch loading) This makes it easy to navigate directly to the specific code being discussed without having to manually search.
Major improvements to the analysis document: 1. Added "Important Distinction" section explaining three types: - Missing Wait Events (true blocking operations on I/O/network) - CPU-Intensive Operations Without Interrupt Checks (cancellability) - CPU-Intensive Operations That Should Have Wait Events (observability) 2. Updated analyst attribution to "@NikolayS + Claude Code Sonnet 4.5" 3. Added context about fd.c having two instrumentation layers: - High-level File*() APIs (already instrumented) - Low-level primitives (missing instrumentation) 4. Clarified compression operations (Category 3): - Added note these are CPU-bound, not blocking I/O - Explained wait events provide operational visibility 5. Clarified cryptography operations (Category 4): - Added note these are CPU-bound operations - Explained value of labeling during auth storms 6. Renamed and restructured executor operations (Category 5): - Changed from "Missing Wait Events" to "Missing Interrupt Checks" - Clearly explained these need CHECK_FOR_INTERRUPTS(), not wait events - Updated all subsections with "Issue" and "Solution" format - Changed "Impact" to "Issue" and added "Solution" recommendations This makes the document much clearer about what problems exist and what solutions are appropriate for each category.
Split the comprehensive analysis into two focused documents: 1. WAIT_EVENTS_ANALYSIS.md (86 issues): - Focus on TRUE wait events (blocking I/O, network, auth) - Mark compression/crypto as OPTIONAL (observability only) - Remove Category 5 (Executor Operations) entirely - Renumber remaining categories 6→5, 7→6, 8→7, 9→8 - Update counts from 92 to 86 issues - Add cross-reference to cancellation document 2. QUERY_CANCELLATION_ISSUES.md (4 issues - NEW): - Document CPU-intensive loops needing CHECK_FOR_INTERRUPTS() - Hash join building (serial + parallel) - Hash aggregate building - Ordered aggregate processing - Hash join batch loading - Clear focus on cancellability, not wait events Rationale: Interrupt checks and wait events are different concerns. Interrupt checks don't fix "CPU*" attribution - they enable query cancellation. Only true blocking operations need wait events to eliminate false "CPU" reporting.
…lysis Cleaned up WAIT_EVENTS_ANALYSIS.md to focus purely on wait events: - Removed Category 5 (Maintenance Operations) entirely - these were mostly about interrupt checks, not wait events - Removed all implementation phases, testing strategy, cost-benefit analysis, and monitoring impact sections - Renumbered categories: 6→5, 7→6, 8→7 - Updated counts: 86→78 issues - Updated file lists to reflect removed categories The document now focuses ONLY on true blocking operations that need wait event instrumentation to fix "CPU*" false positives.
Removed all mentions of interrupt checks from the wait events analysis: - Removed entire 'Important Distinction' section explaining interrupts - Removed CHECK_FOR_INTERRUPTS() mentions from code examples - Removed interrupt-related commentary from impact descriptions - Simplified type legend to focus only on wait events This document now focuses purely on wait event instrumentation gaps, without discussing interrupt checks or query cancellation at all.
Corrected Category 1 (I/O Operations) analysis: - Removed misleading 'two-layer' explanation of fd.c - Removed 15 items that were incorrectly listed as needing wait events (most pg_fsync() calls ARE already instrumented at call sites) - Kept only specific uninstrumented call sites: - Recovery signal file syncs in xlogrecovery.c (2 items) - Storage manager unlink() calls in md.c (3 items) - DSM fstat() calls in dsm_impl.c (2 items) Updated totals: - Category 1: 35 → 7 items (0 Critical, 5 High, 2 Medium) - Overall: 78 → 50 items (28 Critical, 12 High, 10 Medium) Removed proposed wait events for fd.c primitives that don't need them.
Updated WAIT_EVENTS_ANALYSIS.md to reflect accurate counts: - Total: 45 locations (previously incorrectly stated as 38 or 50) - Required: 30 locations (15 Critical + 13 High + 2 Medium) - Optional: 15 locations (7 Compression + 8 Crypto) Summary of corrections: - Executive summary: Updated from 38 to 45 total locations - Category table: Fixed Authentication count (23 not 20), Compression (7 not 6), Crypto (8 not 5) - Conclusion: Updated all counts to match actual findings - Appendix: Removed references to deleted categories (reorderbuffer, worker, bufmgr, lwlock) All numbers now internally consistent throughout the document. The 45 locations break down as: - Category 1 (I/O): 7 locations (required) - Category 2 (Auth): 23 locations (required) - Category 3 (Compression): 7 locations (optional) - Category 4 (Crypto): 8 locations (optional)
After user requested careful re-examination of all findings, discovered significant undercounting in authentication operations: **New Findings Added:** 1. LDAP - Added 5 missing operations (12 total, was 7): - Line 2220: ldap_sslinit() (Windows SSL init) - Line 2268: ldap_domain2hostlist() (DNS SRV lookup) - Line 2363/2365: ldap_start_tls_s() (TLS handshake) - Line 2526: ldap_simple_bind_s() (initial bind for search) - Line 2626: ldap_simple_bind_s() (user authentication bind) 2. PAM Authentication - NEW (2 operations): - Line 2115: pam_authenticate() - can invoke ANY external service - Line 2128: pam_acct_mgmt() - account management 3. GSSAPI/Kerberos - NEW (1 operation): - Line 996: gss_accept_sec_context() - may contact KDC 4. Backend Startup DNS - NEW (1 operation): - backend_startup.c lines 206-209: pg_getnameinfo_all() - Affects EVERY connection when log_hostname=on **Updated Totals:** - Total: 54 locations (was 45, +9) - Required: 39 locations (was 30, +9) - Optional: 15 locations (unchanged) - Authentication: 32 locations (was 23, +9) **Priority Breakdown:** - CRITICAL: 22 (was 15) - LDAP 12 + Ident 8 + PAM 2 - HIGH: 15 (was 13) - I/O 5 + RADIUS 5 + DNS 3 + GSSAPI 1 + Backend startup 1 - MEDIUM: 2 (unchanged) - DSM 2 Authentication is now the biggest gap with 32/39 required locations. All new wait events added to proposed events section.
After exhaustive verification, discovered COPY PROGRAM operations lack wait event instrumentation when communicating with external programs via pipes. **New Finding:** Category 1.4 - COPY FROM/TO PROGRAM (HIGH priority) **Locations:** 1. src/backend/commands/copyfromparse.c:252 - fread() from pipe when reading from external program - NO pgstat_report_wait wrapper - Blocks waiting for external program output 2. src/backend/commands/copyto.c:452-453 - fwrite() to pipe when writing to external program - NO pgstat_report_wait wrapper anywhere in copyto.c - Blocks waiting for external program to consume data **Verification performed:** - Checked OpenPipeStream() uses raw popen() - no instrumentation - Verified CopyGetData() callers have no wait event wrappers - Confirmed entire copyto.c file has zero pgstat_report_wait calls - Tested that these use stdio FILE*, NOT PostgreSQL's File* APIs - Verified no higher-level code path provides instrumentation **Impact:** - COPY FROM PROGRAM 'slow_script.sh' appears as CPU when blocked - COPY TO PROGRAM 'gzip > /nfs/file.gz' appears as CPU when blocked - file_fdw with PROGRAM mode has same issue (uses COPY infrastructure) **Updated Totals:** - Total: 56 locations (was 54, +2) - Required: 41 locations (was 39, +2) - Category 1 (I/O): 9 locations (was 7, +2) - HIGH priority: 17 (was 15, +2) **Proposed Wait Events:** - COPY_FROM_PROGRAM_READ - COPY_TO_PROGRAM_WRITE
Comprehensive analysis of pg_terminate_backend() failures across
PostgreSQL codebase, expanding from 4 to 11 locations.
**New Findings:**
**Authentication Operations (4 locations - CRITICAL/HIGH):**
5. LDAP Authentication (auth.c:2526, 2551, 2626)
- ldap_simple_bind_s() and ldap_search_s() are synchronous C calls
- NO interrupt checks - backends remain unkillable during LDAP
- Production impact: failed LDAP servers accumulate unkillable backends
6. Ident Authentication (auth.c:1659+)
- XXX comment explicitly notes missing WaitLatchOrSocket()
- Uses raw recv()/send() without interrupt handling
- Documented deficiency
7. RADIUS Authentication (auth.c:3094+)
- XXX comment explicitly notes missing WaitLatchOrSocket()
- Manual select() loop instead of proper latch handling
- Documented deficiency
8. PAM Authentication (auth.c:2115, 2128)
- pam_authenticate() and pam_acct_mgmt() are synchronous
- Can invoke ANY external service (LDAP, AD, scripts)
- Completely unkillable if PAM module blocks
**Base Backup Compression (3 locations - HIGH):**
9. Gzip (basebackup_gzip.c:176-215)
- while (zs->avail_in > 0) loop with deflate()
- NO CHECK_FOR_INTERRUPTS() - cannot cancel during compression
10. LZ4 (basebackup_lz4.c)
- LZ4F_compressUpdate() loops without interrupt checks
11. Zstandard (basebackup_zstd.c:198-224)
- ZSTD_compressStream2() loops without interrupt checks
**Updated Totals:**
- Total: 11 locations (was 4, +7)
- CRITICAL: 5 (hash join, hash agg, LDAP, PAM)
- HIGH: 6 (ordered agg, ident, RADIUS, 3x compression)
**By Category:**
- Executor: 4 locations (hash/agg operations)
- Authentication: 4 locations (all auth methods except password)
- Compression: 3 locations (all backup compression formats)
**Critical Impact - Authentication:**
Authentication issues are especially severe:
- Block BEFORE query processing starts
- pg_terminate_backend() completely ineffective
- Failed auth infrastructure causes unkillable backend accumulation
- LDAP/PAM use synchronous C APIs with no async alternatives
**Recommended Solutions:**
- Executor/Compression: Add CHECK_FOR_INTERRUPTS() to loops
- Ident/RADIUS: Use WaitLatchOrSocket() pattern (XXX comments)
- LDAP/PAM: Require async API migration or accept limitation
aeb6775 to
98f0e32
Compare
NikolayS
added a commit
that referenced
this pull request
Apr 22, 2026
The July 2025 "Requested WAL segment has already been removed" thread was authored by Japin Li, not Kukushkin. Kukushkin is a discussion participant alongside Fujii Masao et al.; his contribution was mailing-list commentary, not a patch. Fixed in four live references: - §1 "What exists today" bullet - §8.2 Future Work entry #2 (walsender restore_command integration) - §9 Reference #2 - §10.4 US-1 verdict prior-art note Changelog entries for v0.1 and v0.2 left as frozen history — their misattribution is preserved as part of the iteration record, not rewritten. Sawada (dynamic wal_level PoC, §9 ref #7) and Amul Sul (pg_waldump tarfile support, §9 ref #8) verified correctly attributed — both are thread originators / patch authors on their respective pgsql-hackers threads, no change needed.
NikolayS
pushed a commit
that referenced
this pull request
Apr 22, 2026
Generated from full postgres/postgres history (upstream/master, 64,103 commits). Bruce Momjian has 14,078 commits matching the screenshot, but ranks #2 — Tom Lane leads with 16,755. https://claude.ai/code/session_01SAWhyKXpDkFpwPiCP3EZ7T
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This analysis identifies 92 specific code locations across PostgreSQL where operations may block or consume significant time without proper wait event instrumentation, causing monitoring tools to incorrectly show activity as "CPU" load.
Key findings:
The report includes:
Estimated impact: This work could eliminate 70-80% of "CPU*" false positives in monitoring tools, providing accurate visibility into what PostgreSQL is actually doing.