Skip to content

Conversation

@robertmhaas
Copy link
Owner

Mandatory description

@robertmhaas robertmhaas force-pushed the pg_plan_advice branch 5 times, most recently from 3124131 to 30a9774 Compare January 28, 2026 16:49
robertmhaas and others added 3 commits January 29, 2026 08:04
cost_tidrangescan() was setting the disabled_nodes value correctly,
and then immediately resetting it to zero, due to poor code editing on
my part.

materialized_finished_plan correctly set matpath.parent to
zero, but forgot to also set matpath.parallel_workers = 0, causing
an access to uninitialized memory in cost_material. (This shouldn't
result in any real problem, but it makes valgrind unhappy.)

reparameterize_path was dereferencing a variable before verifying that
it was not NULL.

Reported-by: Tom Lane <tgl@sss.pgh.pa.us> (issue #1)
Reported-by: Michael Paquier <michael@paquier.xyz> (issue #1)
Diagnosed-by: Lukas Fittl <lukas@fittl.com> (issue #1)
Reported-by: Zsolt Parragi <zsolt.parragi@percona.com> (issue #2)
Reported-by: Richard Guo <guofenglinux@gmail.com> (issue #3)
Discussion: http://postgr.es/m/CAN4CZFPvwjNJEZ_JT9Y67yR7C=KMNa=LNefOB8ZY7TKDcmAXOA@mail.gmail.com
Discussion: http://postgr.es/m/aXrnPgrq6Gggb5TG@paquier.xyz
Use the proper constant InvalidXLogRecPtr instead of literal 0 when
assigning XLogRecPtr variables and struct fields.

This improves code clarity by making it explicit that these are
invalid LSN values rather than ambiguous zero literals.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/aRtd2dw8FO1nNX7k@ip-10-97-1-34.eu-west-3.compute.internal
The leaks were hard to reach in practice and the impact was low.

The callers provide a buffer the same number of bytes as the source
string (plus one for NUL terminator) as a starting size, and libc
never increases the number of characters. But, if the byte length of
one of the converted characters is larger, then it might need a larger
destination buffer. Previously, in that case, the working buffers
would be leaked.

Even in that case, the call typically happens within a context that
will soon be reset. Regardless, it's worth fixing to avoid such
assumptions, and the fix is simple so it's worth backporting.

Discussion: https://postgr.es/m/e2b7a0a88aaadded7e2d19f42d5ab03c9e182ad8.camel@j-davis.com
Backpatch-through: 18
tglsfdc and others added 20 commits January 29, 2026 16:16
Commit 6ceef94 was still one brick shy of a load, because it caused
any usage at all of PGIOAlignedBlock or PGAlignedXLogBlock to fail
under older g++.  Notably, this broke "headerscheck --cplusplus".
We can permit references to these structs as abstract structs though;
only actual declaration of such a variable needs to be forbidden.

Discussion: https://www.postgresql.org/message-id/3119480.1769189606@sss.pgh.pa.us
In fcb9c97 I included an assertion in BufferLockConditional() to detect if
a conditional lock acquisition is done on a buffer that we already have
locked. The assertion was added in the course of adding other assertions.
Unfortunately I failed to realize that some of our code relies on such lock
acquisitions to silently fail. E.g. spgist and nbtree may try to conditionally
lock an already locked buffer when acquiring a empty buffer.

LWLockAcquireConditional(), which was previously used to implement
ConditionalLockBuffer(), does not have such an assert.

Instead of just removing the assert, and relying on the lock acquisition to
fail due to the buffer already locked, this commit changes the behaviour of
conditional content lock acquisition to fail if the current backend has any
pre-existing lock on the buffer, even if the lock modes would not
conflict. The reason for that is that we currently do not have space to track
multiple lock acquisitions on a single buffer. Allowing multiple locks on the
same buffer by a backend also seems likely to lead to bugs.

There is only one non-self-exclusive conditional content lock acquisition, in
GetVictimBuffer(), but it only is used if the target buffer is not pinned and
thus can't already be locked by the current backend.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/90bd2cbb-49ce-4092-9f61-5ac2ab782c94@gmail.com
Previously, the CheckXidAlive check was performed within the table_scan*next*
functions. This caused the check to be executed for every fetched tuple, an
unnecessary overhead.

To fix, move the check to table_beginscan* so it is performed once per scan
rather than once per row.

Note: table_tuple_fetch_row_version() does not use a scan descriptor;
therefore, the CheckXidAlive check is retained in that function. The overhead
is unlikely to be relevant for the existing callers.

Reported-by: Andres Freund <andres@anarazel.de>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Suggested-by: Amit Kapila <akapila@postgresql.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/tlpltqm5jjwj7mp66dtebwwhppe4ri36vdypux2zoczrc2i3mp%40dhv4v4nikyfg
Author: Yugo Nagata <nagata@sraoss.co.jp>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/20260128120056.b2a3e8184712ab5a537879eb@sraoss.co.jp
This makes the arrays somewhat easier to read.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/202601281204.sdxbr5qvpunk@alvherre.pgsql
These changes should have been done by 2f96613, but were
overlooked.  I noticed while reviewing the code for commit b8926a5.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org
For readability. It was a slight modularity violation to have fields
in PGShmemHeader that were only used by the allocator code in
shmem.c. And it was inconsistent that ShmemLock was nevertheless not
stored there. Moving all the allocator-related fields to a separate
struct makes it more consistent and modular, and removes the need to
allocate and pass ShmemLock separately via BackendParameters.

Merge InitShmemAccess() and InitShmemAllocation() into a single
function that initializes the struct when called from postmaster, and
when called from backends in EXEC_BACKEND mode, re-establishes the
global variables. That's similar to all the *ShmemInit() functions
that we have.

Co-authored-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAExHW5uNRB9oT4pdo54qAo025MXFX4MfYrD9K15OCqe-ExnNvg@mail.gmail.com
BackgroundPsql needs to wait for all the output from an interactive
psql command to come back.  To make sure that's happened, it issues
the command, then issues \echo and \warn psql commands that echo
a "banner" string (which we assume won't appear in the command's
output), then waits for the banner strings to appear.  The hazard
in this approach is that the banner will also appear in the echoed
psql commands themselves, so we need to distinguish those echoes from
the desired output.  Commit 8b886a4 tried to do that by positing
that the desired output would be directly preceded and followed by
newlines, but it turns out that that assumption is timing-sensitive.
In particular, it tends to fail in builds made --without-readline,
wherein the command echoes will be made by the pty driver and may
be interspersed with prompts issued by psql proper.

It does seem safe to assume that the banner output we want will be
followed by a newline, since that should be the last output before
things quiesce.  Therefore, we can improve matters by putting quotes
around the banner strings in the \echo and \warn psql commands, so
that their echoes cannot include banner directly followed by newline,
and then checking for just banner-and-newline in the match pattern.

While at it, spruce up the pump() call in sub query() to look like
the neater version in wait_connect(), and don't die on timeout
until after printing whatever we got.

Reported-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Diagnosed-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com>
Discussion: https://postgr.es/m/db6fdb35a8665ad3c18be01181d44b31@postgrespro.ru
Backpatch-through: 14
Similarly to the preceding commit, 030_pager.pl was assuming
that patterns it looks for in interactive psql output would
appear by themselves on a line, but that assumption tends to
fall over in builds made --without-readline: the output we
get might have a psql prompt immediately followed by the
expected line of output.

For several of these tests, just checking for the pattern
followed by newline seems sufficient, because we could not
get a false match against the command echo, nor against the
unreplaced command output if the pager fails to be invoked
when expected.  However, that's fairly scary for the test
that was relying on information_schema.referential_constraints:
"\d+" could easily appear at the end of a line in that view.
Let's get rid of that hazard by making a custom test view
instead of using information_schema.referential_constraints.

This test script is new in v19, so no need for back-patch.

Reported-by: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Author: Oleg Tselebrovskiy <o.tselebrovskiy@postgrespro.ru>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Soumya S Murali <soumyamurali.work@gmail.com>
Discussion: https://postgr.es/m/db6fdb35a8665ad3c18be01181d44b31@postgrespro.ru
The build generates four files based on the wait event contents stored
in wait_event_names.txt:
- wait_event_types.h
- pgstat_wait_event.c
- wait_event_funcs_data.c
- wait_event_types.sgml

The SGML file is generated as part of a documentation build, with its
data stored in doc/src/sgml/ for meson and configure.  The three others
are handled differently for meson and configure:
- In configure, all the files are created in src/backend/utils/activity/.
A link to wait_event_types.h is created in src/include/utils/.
- In meson, all the files are created in src/include/utils/.

The two C files, pgstat_wait_event.c and wait_event_funcs_data.c, are
then included in respectively wait_event.c and wait_event_funcs.c,
without the "utils/" path.

For configure, this does not present a problem.  For meson, this has to
be combined with a trick in src/backend/utils/activity/meson.build,
where include_directories needs to point to include/utils/ to make the
inclusion of the C files work properly, causing builds to pull in
PostgreSQL headers rather than system headers in some build paths, as
src/include/utils/ would take priority.

In order to fix this issue, this commit reworks the way the C/H files
are generated, becoming consistent with guc_tables.inc.c:
- For meson, basically nothing changes.  The files are still generated
in src/include/utils/.  The trick with include_directories is removed.
- For configure, the files are now generated in src/backend/utils/, with
links in src/include/utils/ pointing to the ones in src/backend/.  This
requires extra rules in src/backend/utils/activity/Makefile so as a
make command in this sub-directory is able to work.
- The three files now fall under header-stamp, which is actually simpler
as guc_tables.inc.c does the same.
- wait_event_funcs_data.c and pgstat_wait_event.c are now included with
"utils/" in their path.

This problem has not been an issue in the buildfarm; it has been noted
with AIX and a conflict with float.h.  This issue could, however, create
conflicts in the buildfarm depending on the environment with unexpected
headers pulled in, so this fix is backpatched down to where the
generation of the wait-event files has been introduced.

While on it, this commit simplifies wait_event_names.txt regarding the
paths of the files generated, to mention just the names of the files
generated.  The paths where the files are generated became incorrect.
The path of the SGML path was wrong.

This change has been tested in the CI, down to v17.  Locally, I have run
tests with configure (with and without VPATH), as well as meson, on the
three branches.

Combo oversight in fa88928 and 1e68e43.

Reported-by: Aditya Kamath <aditya.kamath1@ibm.com>
Discussion: https://postgr.es/m/LV8PR15MB64888765A43D229EA5D1CFE6D691A@LV8PR15MB6488.namprd15.prod.outlook.com
Backpatch-through: 17
A failing unlink() was reporting an incorrect error message, referring
to stat().

Author: Man Zeng <zengman@halodbtech.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/tencent_3BBE865C5F49D452360FF190@qq.com
Backpath-through: 17
Up to now we've used GNU-style local labels for branch targets
in s_lock.h's assembly blocks.  But there's an alternative style,
which I for one didn't know about till recently: use regular
assembler labels, and insert a per-asm-block number in them
using %= to ensure they are distinct across multiple TAS calls
within one source file.  gcc has had %= since gcc 2.0, and
I've verified that clang knows it too.

While the immediate motivation for changing this is that AIX's
assembler doesn't do local labels, it seems to me that this is a
superior solution anyway.  There is nothing mnemonic about "1:",
while a regular label can convey something useful, and at least
to me it feels less error-prone.  Therefore let's standardize on
this approach, also converting the one other usage in s_lock.h.

Discussion: https://postgr.es/m/399291.1769998688@sss.pgh.pa.us
Separate att_align_nominal() into two macros, similarly to what
was already done with att_align_datum() and att_align_pointer().
The inner macro att_nominal_alignby() is really just TYPEALIGN(),
while att_align_nominal() retains its previous API by mapping
TYPALIGN_xxx values to numbers of bytes to align to and then
calling att_nominal_alignby().  In support of this, split out
tupdesc.c's logic to do that mapping into a publicly visible
function typalign_to_alignby().

Having done that, we can replace performance-critical uses of
att_align_nominal() with att_nominal_alignby(), where the
typalign_to_alignby() mapping is done just once outside the loop.

In most places I settled for doing typalign_to_alignby() once
per function.  We could in many places pass the alignby value
in from the caller if we wanted to change function APIs for this
purpose; but I'm a bit loath to do that, especially for exported
APIs that extensions might call.  Replacing a char typalign
argument by a uint8 typalignby argument would be an API change
that compilers would fail to warn about, thus silently breaking
code in hard-to-debug ways.  I did revise the APIs of array_iter_setup
and array_iter_next, moving the element type attribute arguments to
the former; if any external code uses those, the argument-count
change will cause visible compile failures.

Performance testing shows that ExecEvalScalarArrayOp is sped up by
about 10% by this change, when using a simple per-element function
such as int8eq.  I did not check any of the other loops optimized
here, but it's reasonable to expect similar gains.

Although the motivation for creating this patch was to avoid a
performance loss if we add some more typalign values, it evidently
is worth doing whether that patch lands or not.

Discussion: https://postgr.es/m/1127261.1769649624@sss.pgh.pa.us
…porary table.

The test relies on VACUUM being able to mark a page all-visible, but
this can fail when autovacuum in other sessions prevents the visibility
horizon from advancing. Making the test table temporary isolates its
horizon from other sessions, including catalog table vacuums, ensuring
reliable test behavior.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/2b09fba6-6b71-497a-96ef-a6947fcc39f6%40gmail.com
This commit introduces a new prompt escape %i for psql, which shows
whether the connected server is operating in hot standby mode. It
expands to standby if the server reports in_hot_standby = on, and
primary otherwise.

This is useful for distinguishing standby servers from primary ones
at a glance, especially when working with multiple connections in
replicated environments where libpq's multi-host connection strings
are used.

Author: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/016f6738-f9a9-4e98-bb5a-e1e4b9591d46@uni-muenster.de
…changes.

Previously, when synchronous_standby_names was changed (for example,
by reducing the number of required synchronous standbys or modifying
the standby list), backends waiting for synchronous replication were not
released immediately, even if the new configuration no longer required them
to wait. They could remain blocked until additional messages arrived from
standbys and triggered their release.

This commit improves walsender so that backends waiting for synchronous
replication are released as soon as the updated configuration takes effect and
the new settings no longer require them to wait, by calling
SyncRepReleaseWaiters() when configuration changes are processed.

As part of this change, the duplicated code that handles configuration changes
in walsender has been refactored into a new helper function, which is now used
at the three existing call places.

Since this is an improvement rather than a bug fix, it is applied only to
the master branch.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSRii0tEYhu5cePmRcvS=ZrxTLEvxm3Kj0d7_uKGdM23g@mail.gmail.com
This routine has an option to bypass an error if a WAL summary file is
opened for read but is missing (missing_ok=true).  However, the code
incorrectly checked for EEXIST, that matters when using O_CREAT and
O_EXCL, rather than ENOENT, for this case.

There are currently only two callers of OpenWalSummaryFile() in the
tree, and both use missing_ok=false, meaning that the check based on the
errno is currently dead code.  This issue could matter for out-of-core
code or future backpatches that would like to use missing_ok set to
true.

Issue spotted while monitoring this area of the code, after
a9afa02.

Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aYAf8qDHbpBZ3Rml@paquier.xyz
Backpatch-through: 17
Two wait events are added to the COPY FROM/TO code:
* COPY_FROM_READ: reading data from a copy_file.
* COPY_TO_WRITE: writing data to a copy_file.

In the COPY code, copy_file can be set when processing a command through
the pipe mode (for the non-DestRemote case), the program mode or the
file mode, when processing fread() or fwrite() on it.

Author: Nikolay Samokhvalov <nik@postgres.ai>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CAM527d_iDzz0Kqyi7HOfqa-Xzuq29jkR6AGXqfXLqA5PR5qsng@mail.gmail.com
This keeps run-time assertions and static assertions clearly separate.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2273bc2a-045d-4a75-8584-7cd9396e5534%40eisentraut.org
alvherre and others added 23 commits February 9, 2026 19:15
They are not and never have been used by any known code -- apparently we
just cargo-culted them in commit 37484ad (or their ancestor macros
anyway, which begat these functions in commit 34694ec).  Allegedly
they're also potentially dangerous; users are better off going through
HeapTupleSetHintBits instead.

Author: Andy Fan <zhihuifan1213@163.com>
Discussion: https://postgr.es/m/87sejogt4g.fsf@163.com
Clean up a few leftovers from when the deadlock checker was called
from signal handler. We stopped doing that in commit 6753333, in
year 2015.

- CheckDeadLock can return a return value directly to the caller,
  there's no need to use a global variable for that.

- Remove outdated comments that claimed that CheckDeadLock "signals
  ProcSleep".

- It should be OK to ereport() from DeadLockCheck now. I considered
  getting rid of InitDeadLockChecking() and moving the workspace
  allocations into DeadLockCheck, but it's still good to avoid doing
  the allocations while we're holding all the partition locks. So just
  update the comment to give that as the reason we do the allocations
  up front.
For binary upgrades from v16 or newer, pg_upgrade transfers the
files for pg_largeobject_metadata from the old cluster, as opposed
to using COPY or ordinary SQL commands to reconstruct its contents.
While this approach adds complexity, it can greatly reduce
pg_upgrade's runtime when there are many large objects.

Large objects with comments or security labels are one source of
complexity for this approach.  During pg_upgrade, schema
restoration happens before files are transferred.  Comments and
security labels are transferred in the former step, but the COMMENT
and SECURITY LABEL commands will fail if their corresponding large
objects do not exist.  To deal with this, pg_upgrade first copies
only the rows of pg_largeobject_metadata that are needed to avoid
failures.  Later, pg_upgrade overwrites those rows by replacing
pg_largeobject_metadata's files with its files in the old cluster.

Unfortunately, there's a subtle problem here.  Simply put, there's
no guarantee that pg_upgrade will overwrite all of
pg_largeobject_metadata's files on the new cluster.  For example,
the new cluster's version might more aggressively extend relations
or create visibility maps, and pg_upgrade's file transfer code is
not sophisticated enough to remove files that lack counterparts in
the old cluster.  These extra files could cause problems
post-upgrade.

More fortunately, we can simultaneously fix the aforementioned
problem and further optimize binary upgrades for clusters with many
large objects.  If we teach the COMMENT and SECURITY LABEL commands
to allow nonexistent large objects during binary upgrades,
pg_upgrade no longer needs to transfer pg_largeobject_metadata's
contents beforehand.  This approach also allows us to remove the
associated dependency tracking from pg_dump, even for upgrades from
v12-v15 that use COPY to transfer pg_largeobject_metadata's
contents.

In addition to what is described in the previous paragraph, this
commit modifies the query in getLOs() to only retrieve LOs with
comments or security labels for upgrades from v12 or newer.  We
have long assumed that such usage is rare, so this should reduce
pg_upgrade's memory usage and runtime in many cases.  We might also
be able to remove the "upgrades from v12 or newer" restriction on
the recent batch of optimizations by adding special handling for
pg_largeobject_metadata's hidden OID column on older versions
(since this catalog previously used the now-removed WITH OIDS
feature), but that is left as a future exercise.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/3yd2ss6n7xywo6pmhd7jjh3bqwgvx35bflzgv3ag4cnzfkik7m%40hiyadppqxx6w
The IS DISTINCT FROM construct compares values acting as though NULL
were a normal data value, rather than "unknown".  Semantically, "x IS
DISTINCT FROM y" yields true if the values differ or if exactly one is
NULL, and false if they are equal or both NULL.  Unlike ordinary
comparison operators, it never returns NULL.

Previously, the planner only simplified this construct if all inputs
were constants, folding it to a constant boolean result.  This patch
extends the optimization to cases where inputs are non-constant but
proven to be non-nullable.  Specifically, "x IS DISTINCT FROM NULL"
folds to constant TRUE if "x" is known to be non-nullable.  For cases
where both inputs are guaranteed not to be NULL, the expression
becomes semantically equivalent to "x <> y", and the DistinctExpr is
converted into an inequality OpExpr.

This transformation provides several benefits.  It converts the
comparison into a standard operator, allowing the use of partial
indexes and constraint exclusion.  Furthermore, if the clause is
negated (i.e., "IS NOT DISTINCT FROM"), it simplifies to an equality
operator.  This enables the planner to generate better plans using
index scans, merge joins, hash joins, and EC-based qual deduction.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
The BooleanTest construct (IS [NOT] TRUE/FALSE/UNKNOWN) treats a NULL
input as the logical value "unknown".  However, when the input is
proven to be non-nullable, this special handling becomes redundant.
In such cases, the construct can be simplified directly to a boolean
expression or a constant.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
In the spirit of 8d19d0e, this patch teaches the planner about the
principle that NullTest with !argisrow is fully equivalent to SQL's IS
[NOT] DISTINCT FROM NULL.

The parser already performs this transformation for literal NULLs.
However, a DistinctExpr expression with one input evaluating to NULL
during planning (e.g., via const-folding of "1 + NULL" or parameter
substitution in custom plans) currently remains as a DistinctExpr
node.

This patch closes the gap for const-folded NULLs.  It specifically
targets the case where one input is a constant NULL and the other is a
nullable non-constant expression.  (If the other input were otherwise,
the DistinctExpr node would have already been simplified to a constant
TRUE or FALSE.)

This transformation can be beneficial because NullTest is much more
amenable to optimization than DistinctExpr, since the planner knows a
good deal about the former and next to nothing about the latter.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAMbWs49BMAOWvkdSHxpUDnniqJcEcGq3_8dd_5wTR4xrQY8urA@mail.gmail.com
This commit adds three attributes to the system view pg_stats_ext_exprs,
whose data can exist when involving a range type in an expression:
range_length_histogram
range_empty_frac
range_bounds_histogram

These statistics fields exist since 918eee0, and have become
viewable in pg_stats later in bc3c8db.  This puts the definition of
pg_stats_ext_exprs on par with pg_stats.

This issue has showed up during the discussion about the restore of
extended statistics for expressions, so as it becomes possible to query
the stats data to restore from the catalogs.  Having access to this data
is useful on its own, without the restore part.

Some documentation and some tests are added, written by me.  Corey has
authored the part in system_views.sql.

Bump catalog version.

Author: Corey Huinker <corey.huinker@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aYmCUx9VvrKiZQLL@paquier.xyz
The log messages used in this file applied too much quoting logic:
- No need for quote_identifier(), which is fine to not use in the
context of a log entry.
- The usual project style is to group the namespace and object together
in a quoted string, when mentioned in an log message.  This code quoted
the namespace name and the extended statistics object name separately,
which was confusing.

Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20260210.143752.1113524465620875233.horikyota.ntt@gmail.com
This helps the next commit.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
Share the same PROCSIG_RECOVERY_CONFLICT flag for all recovery
conflict reasons. To distinguish, have a bitmask in PGPROC to indicate
the reason(s).

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
Two changes here:

1. Introduce a separate RECOVERY_CONFLICT_BUFFERPIN_DEADLOCK flag to
indicate a suspected deadlock that involves a buffer pin. Previously
the startup process used the same flag for a deadlock involving just
regular locks, and to check for deadlocks involving the buffer
pin. The cases are handled separately in the startup process, but the
receiving backend had to deduce which one it was based on
HoldingBufferPinThatDelaysRecovery(). With a separate flag, the
receiver doesn't need to guess.

2. Rewrite the ProcessRecoveryConflictInterrupt() function to not rely
on fallthrough through the switch-statement. That was difficult to
read.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/4cc13ba1-4248-4884-b6ba-4805349e7f39@iki.fi
Commit 4020b37 had the idea that
it would be a good idea to handle testing PGS_CONSIDER_NONPARTIAL
within cost_material to save callers the trouble, but that turns
out not to be a very good idea. One concern is that it makes
cost_material() dependent on the caller having initialized
certain fields in the MaterialPath, which is a bit awkward for
materialize_finished_plan, which wants to use a dummy path.

Another problem is that it can result in generated materialized
nested loops where the Materialize node is disabled, contrary to
the intention of joinpath.c's logic in match_unsorted_outer() and
consider_parallel_nestloop(), which aims to consider such paths
only when they would not need to be disabled. In the previous
coding, it was possible for the pgs_mask on the joinrel to have
PGS_CONSIDER_NONPARTIAL set, while the inner rel had the same
bit clear. In that case, we'd generate and then disable a
Materialize path.

That seems wrong, so instead, pull up the logic to test the
PGS_CONSIDER_NONPARTIAL bit into joinpath.c, restoring the
historical behavior that we don't generate a given materialized
nested loop, or we don't disable it.

Discussion: http://postgr.es/m/CA+TgmoawzvCoZAwFS85tE5+c8vBkqgcS8ZstQ_ohjXQ9wGT9sw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com
Commit 94f3ad3 failed to do this
because I couldn't think of a use for the information, but this has
proven to be short-sighted. Best to fix it before this code is
officially released.

Now, the only argument to standard_planenr that isn't passed to
planner_setup_hook is boundParams, but that is accessible via
glob->boundParams, and so doesn't need to be passed separately.

Discussion: https://www.postgresql.org/message-id/CA+TgmoYS4ZCVAF2jTce=bMP0Oq_db_srocR4cZyO0OBp9oUoGg@mail.gmail.com
Suppose that we're currently planning a query and, when that same
query was previously planned and executed, we learned something about
how a certain table within that query should be planned. We want to
take note when that same table is being planned during the current
planning cycle, but this is difficult to do, because the RTI of the
table from the previous plan won't necessarily be equal to the RTI
that we see during the current planning cycle. This is because each
subquery has a separate range table during planning, but these are
flattened into one range table when constructing the final plan,
changing RTIs.

Commit 8c49a48 allows us to match up
subqueries seen in the previous planning cycles with the subqueries
currently being planned just by comparing textual names, but that's
not quite enough to let us deduce anything about individual tables,
because we don't know where each subquery's range table appears in
the final, flattened range table.

To fix that, store a list of SubPlanRTInfo objects in the final
planned statement, each including the name of the subplan, the offset
at which it begins in the flattened range table, and whether or not
it was a dummy subplan -- if it was, some RTIs may have been dropped
from the final range table, but also there's no need to control how
a dummy subquery gets planned. The toplevel subquery has no name and
always begins at rtoffset 0, so we make no entry for it.

This commit teaches pg_overexplain's RANGE_TABLE option to make use
of this new data to display the subquery name for each range table
entry.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
An extension (or core code) might want to reconstruct the planner's
choice of join order from the final plan. To do so, it must be possible
to find all of the RTIs that were part of the join problem in that plan.
The previous commit, together with the earlier work in
8c49a48, is enough to let us match up
RTIs we see in the final plan with RTIs that we see during the planning
cycle, but we still have a problem if the planner decides to drop some
RTIs out of the final plan altogether.

To fix that, when setrefs.c removes a SubqueryScan, single-child Append,
or single-child MergeAppend from the final Plan tree, record the type of
the removed node and the RTIs that the removed node would have scanned
in the final plan tree. It would be natural to record this information
on the child of the removed plan node, but that would require adding
an additional pointer field to type Plan, which seems undesirable.
So, instead, store the information in a separate list that the
executor need never consult, and use the plan_node_id to identify
the plan node with which the removed node is logically associated.

Also, update pg_overexplain to display these details.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
An extension (or core code) might want to reconstruct the planner's
decisions about whether and where to perform partitionwise joins from
the final plan. To do so, it must be possible to find all of the RTIs
of partitioned tables appearing in the plan. But when an AppendPath
or MergeAppendPath pulls up child paths from a subordinate AppendPath
or MergeAppendPath, the RTIs of the subordinate path do not appear
in the final plan, making this kind of reconstruction impossible.

To avoid this, propagate the RTI sets that would have been present
in the 'apprelids' field of the subordinate Append or MergeAppend
nodes that would have been created into the surviving Append or
MergeAppend node, using a new 'child_append_relid_sets' field for
that purpose. The value of this field is a list of Bitmapsets,
because each relation whose append-list was pulled up had its own
set of RTIs: just one, if it was a partitionwise scan, or more than
one, if it was a partitionwise join. Since our goal is to see where
partitionwise joins were done, it is essential to avoid losing the
information about how the RTIs were grouped in the pulled-up
relations.

This commit also updates pg_overexplain so that EXPLAIN (RANGE_TABLE)
will display the saved RTI sets.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
Up until now, the only way for a loadable module to disable the use of a
particular index was to use get_relation_info_hook to remove it from the
index list. While that works, it has some disadvantages. First, the
index becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.

Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it's more desirable to fall back to the next-cheapest
plan than to be forced into a sequential scan.
Provide a facility that (1) can be used to stabilize certain plan choices
so that the planner cannot reverse course without authorization and
(2) can be used by knowledgeable users to insist on plan choices contrary
to what the planner believes best. In both cases, terrible outcomes are
possible: users should think twice and perhaps three times before
constraining the planner's ability to do as it thinks best; nevertheless,
there are problems that are much more easily solved with these facilities
than without them.

This patch takes the approach of analyzing a finished plan to produce
textual output, which we call "plan advice", that describes key
decisions made during plan; if that plan advice is provided during
future planning cycles, it will force those key decisions to be made in
the same way.  Not all planner decisions can be controlled using advice;
for example, decisions about how to perform aggregation are currently
out of scope, as is choice of sort order. Plan advice can also be edited
by the user, or even written from scratch in simple cases, making it
possible to generate outcomes that the planner would not have produced.
Partial advice can be provided to control some planner outcomes but not
others.

Currently, plan advice is focused only on specific outcomes, such as
the choice to use a sequential scan for a particular relation, and not
on estimates that might contribute to those outcomes, such as a
possibly-incorrect selectivity estimate. While it would be useful to
users to be able to provide plan advice that affects selectivity
estimates or other aspects of costing, that is out of scope for this
commit.

For more details, see contrib/pg_plan_advice/README or the SGML
documentation.

NOTE: There are still some known problems with this code, which are
called out by XXX.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Dian Fay <di@nmfay.com>
Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
Commit e222534 adjusted the logic in
add_path() to keep the path list sorted by disabled_nodes and then by
total_cost, but failed to make the corresponding adjustment to
add_partial_path. As a result, add_partial_path might sort the path list
just by total cost, which could lead to later planner misbehavior.

In principle, this should be back-patched to v18, but we are typically
reluctant to back-patch planner fixes for fear of destabilizing working
installations, and it is unclear to me that this has sufficiently
serious consequences to justify an exception, so for now, no back-patch.
Previously, the coments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).

Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e222534
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.

This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
The TAP test included in this new module runs the regression tests
with pg_plan_advice loaded. It arranges for each query to be planned
twice.  The first time, we generate plan advice. The second time, we
replan the query using the resulting advice string. If the tests
fail, that means that using pg_plan_advice to tell the planner to
do what it was going to do anyway breaks something, which indicates
a problem either with pg_plan_advice or with the planner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.