pg_rewind: detect independent same-TLI promotions#36
Open
NikolayS wants to merge 2807 commits into
Open
Conversation
Suggested-by: vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/CALDaNm3tiKhtegx5Cawi34UjbHmNGEDNAtScGM1RgWRtV-5_0Q@mail.gmail.com
The investigation into the negative test performance impact of 7e8aeb9 lead to discovering that there are a few issues with WAIT FOR. This commit is just a minimal fix to prevent hangs in standby_flush mode, due to WAIT FOR ... 'standby_flush' seeing a 0 LSN if a newly started walreceiver does not receive any writes, because the stanby is already caught up. There are several other issues and this is isn't necessarily the best fix. But this way we get the hangs out of the way. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/zqbppucpmkeqecfy4s5kscnru4tbk6khp3ozqz6ad2zijz354k@w4bdf4z3wqoz
If pg_stash_advice.persist = true, stashed advice will be written to pg_stash_advice.tsv in the data directory, periodically and at shutdown. On restart, stash modifications are locked out until this file has been reloaded, but queries will not be, so there may be a short window after startup during which previously-stashed advice is not automatically applied. Author: Robert Haas <rhaas@postgresql.org> Co-authored-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/CA+Tgmob87qsWa-VugofU6epuV0H5XjWZGMbQas4Q-ADKmvSyBg@mail.gmail.com
When a backend is terminated via pg_terminate_backend() or an external SIGTERM, the error message now includes the sender's PID and UID as errdetail, making it easier to identify the source of unexpected terminations in multi-user environments. On platforms that support SA_SIGINFO (Linux, FreeBSD, and most modern Unix systems), the signal handler captures si_pid and si_uid from the siginfo_t structure. On platforms without SA_SIGINFO, the detail is simply omitted. Author: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Chao Li <1356863904@qq.com> Discussion: https://postgr.es/m/CAKZiRmyrOWovZSdixpLd3PGMQXuQL_zw2Ght5XhHCkQ1uDsxjw@mail.gmail.com
Checking for 'havePin' is sufficient here. An earlier version of the patch didn't have the 'havePin' variable and used 'so->hashso_bucket_buf == so->currPos.buf' as the condition when both locking and unlocking the page. The havePin variable was added later during development, but the unlocking condition wasn't fully updated. Tidy it up. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/b9de8d05-3b02-4a27-9b0b-03972fa4bfd3@iki.fi
Add a new GUC max_repack_replication_slots, which lets the user reserve some additional replication slots for concurrent repack (and only concurrent repack). With this, the user doesn't have to worry about changing the max_replication_slots in order to cater for use of concurrent repack. (We still use the same pool of bgworkers though, but that's less commonly a problem than slots.) Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com> Discussion: https://postgr.es/m/202604012148.nnnmyxxrr6nh@alvherre.pgsql
PGOAUTHDEBUG is a blunt instrument: you get all the debugging features, or none of them. The most annoying consequence during manual use is the Curl debug trace, which tends to obscure the device flow prompt entirely. The promotion of PGOAUTHCAFILE into its own feature in 9933681 improved the situation somewhat, but there's still the discomfort of knowing you have to opt into many dangerous behaviors just to get the single debug feature you wanted. Explode the PGOAUTHDEBUG syntax into a comma-separated list. The old "UNSAFE" value enables everything, like before. Any individual unsafe features still require the envvar to begin with an "UNSAFE:" prefix, to try to interrupt the flow of someone who is about to do something they should not. So now, rather than PGOAUTHDEBUG=UNSAFE # enable all the unsafe things a developer can say PGOAUTHDEBUG=call-count # only show me the call count. safe! PGOAUTHDEBUG=UNSAFE:trace # print secrets, but don't allow HTTP To avoid adding more build system scaffolding to libpq-oauth, implement this entirely in a small private header. This unfortunately can't be standalone, so it needs a headerscheck exception. Author: Zsolt Parragi <zsolt.parragi@percona.com> Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAOYmi%2B%3DfbZNJSkHVci%3DGpR8XPYObK%3DH%2B2ERRha0LDTS%2BifsWnw%40mail.gmail.com Discussion: https://postgr.es/m/CAN4CZFMmDZMH56O9vb_g7vHqAk8ryWFxBMV19C39PFghENg8kA%40mail.gmail.com
OAuth validators can already use custom GUCs to configure behavior globally. But we currently provide no ability to adjust settings for individual HBA entries, because the original design focused on a world where a provider covered a "single audience" of users for one database cluster. This assumption does not apply to multitenant use cases, where a single validator may be controlling access for wildly different user groups. To improve this use case, add two new API calls for use by validator callbacks: RegisterOAuthHBAOptions() and GetOAuthHBAOption(). Registering options "foo" and "bar" allows a user to set "validator.foo" and "validator.bar" in an oauth HBA entry. These options are stringly typed (syntax validation is solely the responsibility of the defining module), and names are restricted to a subset of ASCII to avoid tying our hands with future HBA syntax improvements. Unfortunately, we can't check the custom option names during a reload of the configuration, like we do with standard HBA options, without requiring all validators to be loaded via shared_preload_libraries. (I consider this to be a nonstarter: most validators should probably use session_preload_libraries at most, since requiring a full restart just to update authentication behavior will be unacceptable to many users.) Instead, the new validator.* options are checked against the registered list at connection time. Multiple alternatives were proposed and/or prototyped, including extending the GUC system to allow per-HBA overrides, joining forces with recent refactoring work on the reloptions subsystem, and giving the ability to customize HBA options to all PostgreSQL extensions. I personally believe per-HBA GUC overrides are the best option, because several existing GUCs like authentication_timeout and pre_auth_delay would fit there usefully. But the recent addition of SNI per-host settings in 4f43302 indicates that a more general solution is needed, and I expect that to take multiple releases' worth of discussion. This compromise patch, then, is intentionally designed to be an architectural dead end: simple to describe, cheap to maintain, and providing just enough functionality to let validators move forward for PG19. The hope is that it will be replaced in the future by a solution that can handle per-host, per-HBA, and other per-context configuration with the same functionality that GUCs provide today. In the meantime, the bulk of the code in this patch consists of strict guardrails on the simple API, to try to ensure that we don't have any reason to regret its existence during its unknown lifespan. I owe particular thanks here to Zsolt Parragi, who prototyped several approaches that guided the final design. Suggested-by: Zsolt Parragi <zsolt.parragi@percona.com> Suggested-by: VASUKI M <vasukianand0119@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAN4CZFM3b8u5uNNNsY6XCya257u%2BDofms3su9f11iMCxvCacag%40mail.gmail.com
The timing infrastructure (INSTR_* macros) measures time elapsed using clock_gettime() on POSIX systems, which returns the time as nanoseconds, and QueryPerformanceCounter() on Windows, which is a specialized timing clock source that returns a tick counter that needs to be converted to nanoseconds using the result of QueryPerformanceFrequency(). This conversion currently happens ad-hoc on Windows, e.g. when calling INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every invocation, despite the frequency being stable after program start, incurring unnecessary overhead. It also causes a fractured implementation where macros are defined differently between platforms. To ease code readability, and prepare for a future change that intends to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce new pg_ticks_to_ns() / pg_ns_to_ticks() functions that get called from INSTR_* macros on all platforms. These functions rely on a separately initialized ticks_per_ns_scaled value, that represents the conversion ratio. This value is initialized from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 POSIX systems, which results in the ticks being treated as nanoseconds. Other architectures always directly return the original ticks. To support this, pg_initialize_timing() is introduced, and is now mandatory for both the backend and any frontend programs to call before utilizing INSTR_* macros. In passing, fix variable names in comment documenting INSTR_TIME_ADD_NANOSEC(). Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
This adds additional x86 specific CPUID checks for flags needed for determining whether the Time-Stamp Counter (TSC) is usable on a given system, as well as a helper function to retrieve the TSC frequency from CPUID. This is intended for a future patch that will utilize the TSC to lower the overhead of timing instrumentation. In passing, always make pg_cpuid_subleaf reset the variables used for its result, to avoid accidentally using stale results if __get_cpuid_count errors out and the caller doesn't check for it. Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: John Naylor <john.naylor@postgresql.org> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version) Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSCP instructions, instead of APIs like
clock_gettime() on POSIX systems.
This reduces the overhead of EXPLAIN with ANALYZE and TIMING ON. Tests showed
that the overhead on top of actual runtime when instrumenting queries moving
lots of rows through the plan can be reduced from 2x as slow to 1.2x as slow
compared to the actual runtime. More complex workloads such as TPCH queries
have also shown ~20% gains when instrumented compared to before.
To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when reliable, for example
when running on modern Intel CPUs, or when running on Linux and the system
clocksource is reported as "tsc". The use of the operating system clock source
can be enforced by setting "system", or on x86-64 architectures the use of TSC
can be enforced by explicitly setting "tsc".
In order to use the TSC the frequency is first determined by use of CPUID, and
if not available, by running a short calibration loop at program start,
falling back to the system clock source if TSC values are not stable.
Note, that we split TSC usage into the RDTSC CPU instruction which does not
wait for out-of-order execution (faster, less precise) and the RDTSCP
instruction, which waits for outstanding instructions to retire. RDTSCP is
deemed to have little benefit in the typical InstrStartNode() /
InstrStopNode() use case of EXPLAIN, and can be up to twice as slow. To
separate these use cases, the new macro INSTR_TIME_SET_CURRENT_FAST() is
introduced, which uses RDTSC.
The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be
used when precision is more important than performance. When the system timing
clock source is used both of these macros instead utilize the system
APIs (clock_gettime / QueryPerformanceCounter) like before.
Additional users of interval timing, such as track_io_timing and
track_wal_io_timing could also benefit from being converted to use
INSTR_TIME_SET_CURRENT_FAST() but are left for future changes.
Author: Lukas Fittl <lukas@fittl.com>
Author: Andres Freund <andres@anarazel.de>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an earlier version)
Reviewed-by: Maciek Sakrejda <m.sakrejda@gmail.com> (in an earlier version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (in an earlier version)
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> (in an earlier version)
Discussion: https://postgr.es/m/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de
UNIQUE/PRIMARY KEY ... WITHOUT OVERLAPS requires the no-overlap column to be a range or multirange, but it should allow a domain over such a type too. This requires minor adjustments in both the parser and executor. In passing, fix a nearby break-instead-of-continue thinko in transformIndexConstraint. This had the effect of disabling parse-time validation of the no-overlap column's type in the context of ALTER TABLE ADD CONSTRAINT, if it follows a dropped column. We'd still complain appropriately at runtime though. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxGoAmN_0iJ=hjTG0vGpOSOyy-vYyfE+-q0AWxrq2_p5XQ@mail.gmail.com Backpatch-through: 18
Use unaligned output for multiple EXPLAIN queries using non-text format in regression tests. With aligned output adding/removing explain fields can be very disruptive, as it often modifies the whole block because of padding. Unaligned output does not have this issue. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
Allows collecting details about AIO / prefetch for scan nodes backed by a ReadStream. This may be enabled by a new "IO" option in EXPLAIN, and it shows information about the prefetch distance and I/O requests. As of this commit this applies only to BitmapHeapScan, because that's the only scan node using a ReadStream and collecting instrumentation from workers in a parallel query. Support for SeqScan and TidRangeScan, the other scan nodes using ReadStream, will be added in subsequent commits. The stats are collected only when required by EXPLAIN ANALYZE, with the IO option (disabled by default). The amount of collected statistics is very limited, but we don't want to clutter EXPLAIN with too much data. The IOStats struct is stored in the scan descriptor as a field, next to other fields used by table AMs. A pointer to the field is passed to the ReadStream, and updated directly. It's the responsibility of the table AM to allocate the struct (e.g. in ambeginscan) whenever the flag SO_SCAN_INSTRUMENT flag is passed to the scan, so that the executor and ReadStream has access to it. The collected stats are designed for ReadStream, but are meant to be reasonably generic in case a TAM manages I/Os in different ways. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
Allows enabling the new EXPLAIN "IO" option for auto_explain. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
x86 machines lacking HAVE__CPUIDEX saw a complaint about "unused variable 'reg'", per buildfarm as well as local experience. Oversight in bcb2cf4.
Adds support for EXPLAIN (IO) instrumentation for sequential scans. This requires adding shared instrumentation, using the separate DSM approach introduced by dd78e69. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
…equency This adds support to pg_test_timing for the different timing sources added by 294520c. Author: Lukas Fittl <lukas@fittl.com> Author: David Geier <geidav.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> (in an earlier version) Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
Adds support for EXPLAIN (IO) instrumentation for TidRange scans. This requires adding shared instrumentation for parallel scans, using the separate DSM approach introduced by dd78e69. Author: Tomas Vondra <tomas@vondra.me> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/flat/a177a6dd-240b-455a-8f25-aca0b1c08c6e%40vondra.me
This moves the implementation of ExecProcNodeInstr, the ExecProcNode variant that gets used when instrumentation is on, to be defined in instrument.c instead of execProcNode.c, and marks functions it uses as inline. This allows compilers to generate an optimized implementation, and shows a 4 to 12% reduction in instrumentation overhead for queries that move lots of rows. Author: Lukas Fittl <lukas@fittl.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
Previously, on standby promotion, the startup process sent SIGUSR1 to the slotsync worker (or a backend performing slot synchronization) and waited for it to exit. This worked in most cases, but if the process was blocked waiting for a response from the primary (e.g., due to a network failure), SIGUSR1 would not interrupt the wait. As a result, the process could remain stuck, causing the startup process to wait for a long time and delaying promotion. This commit fixes the issue by introducing a new procsignal reason, PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process sends this signal, and the handler sets interrupt flags so the process exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing promotion to complete without delay. Backpatch to v17, where slotsync was introduced. Author: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com Backpatch-through: 17
Until now extensions that wanted to measure overall query execution could create QueryDesc->totaltime, which the core executor would then start and stop. That's a bit odd and composes badly, e.g. extensions always had to use INSTRUMENT_ALL, because otherwise another extension might not get what they need. Instead this introduces a new field, QueryDesc->query_instr_options, that extensions can use to indicate whether they need query level instrumentation populated, and with which instrumentation options. Extensions should take care to only add options they need, instead of replacing the options of others. The prior name of the field, totaltime, sounded like it would only measure time, but these days the instrumentation infrastructure can track more resources. The secondary benefit is that this will make it obvious to extensions that they may not create the Instrumentation struct themselves anymore (often extensions build only against a postgres build without assertions). Adjust pg_stat_statements and auto_explain to match, and lower the requested instrumentation level for auto_explain to INSTRUMENT_TIMER, since the summary instrumentation it needs is only runtime. The reason to push this now, rather in the PG 20 cycle, is that 5a79e78 already required extensions using query level instrumentations to adjust their code, and it seemed undesirable to require them to do so again for 20. Author: Lukas Fittl <lukas@fittl.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAP53Pkyqsht+exJQYRsjhSWYKu+vFGHhPub7m6PmFD6Or0=p1g@mail.gmail.com
This was useful before widespread Unicode adoption, and was based on the internal encoding Emacs used to mix multiple sub-encodings. Emacs itself has stopped using it, and our implementation hadn't been updated with modern underlying standards. It is thought to be very unlikely that anyone is still using it in the field. Since such a complex encoding comes with costs and risks, we agreed to drop support. Any existing database using this encoding would need to be dumped and restored with a new encoding to upgrade to PostgreSQL 19, most likely UTF8, since pg_upgrade would fail. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Tatsuo Ishii <ishii@postgresql.org> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com
Since we have dropped MULE_INTERNAL, add a check that all encodings used in the source cluster are still supported according to PG_ENCODING_BE_VALID(). This is done generically, in case we decide to drop another encoding some day. Suggested-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGKXDXh-FdU0orjfv%2BF08f%3DD91BhV3Ra-4zL-q%2BJmGYqTA%40mail.gmail.com
The vectorized path in commit fbc57f2 had a side effect of putting more branches in the path taken for small inputs. To reduce risk of regressions, only proceed with the vectorized path if we can guarantee that the remaining input after the alignment preamble is greater than 64 bytes. That also allows removing length checks in the alignment preamble. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/CANWCAZZ48GuLYhJCcTy8TXysjrMVJL6n1n7NP94=iG+t80YKPw@mail.gmail.com
The size of the I/O worker pool used to implement io_method=worker was previously controlled by the io_workers setting, defaulting to 3. It was hard to know how to tune it effectively. That is replaced with: io_min_workers=2 io_max_workers=8 (up to 32) io_worker_idle_timeout=60s io_worker_launch_interval=100ms The pool is automatically sized within the configured range according to recent variation in demand. It grows when existing workers detect that latency might be introduced by queuing, and shrinks when the highest-numbered worker is idle for too long. Work was already concentrated into low-numbered workers in anticipation of this logic. The logic for waking extra workers now also tries to measure and reduce the number of spurious wakeups, though they are not entirely eliminated. Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com
Add a new FDW callback routine that allows importing remote statistics for a foreign table directly to the local server, instead of collecting statistics locally. The new callback routine is called at the beginning of the ANALYZE operation on the table, and if the FDW failed to import the statistics, the existing callback routine is called on the table to collect statistics locally. Also implement this for postgres_fdw. It is enabled by "restore_stats" option both at the server and table level. Currently, it is the user's responsibility to ensure remote statistics to import are up-to-date, so the default is false. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Etsuro Fujita <etsuro.fujita@gmail.com> Discussion: https://postgr.es/m/CADkLM%3DchrYAx%3DX2KUcDRST4RLaRLivYDohZrkW4LLBa0iBhb5w%40mail.gmail.com
Our RADIUS implementation supported only the deprecated RADIUS/UDP variant, without the recommended Message-Authenticator attribute to mitigate against the Blast-RADIUS vulnerability. By now, popular RADIUS servers are expected to generate loud warnings or reject our authentication attempts outright. Since there have been no user reports about this, it seems unlikely that there are users. Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Discussion: https://postgr.es/m/CA%2BhUKG%2BSH309V8KECU5%3DxuLP9Dks0v9f9UVS2W74fPAE5O21dg%40mail.gmail.com
Add regression coverage for several WAIT FOR LSN edge cases. First, cover fresh walreceiver shared-memory initialization after a standby restart. Restart the standby while its upstream is down, so RequestXLogStreaming() seeds writtenUpto/flushedUpto to the segment-aligned receiveStart and the walreceiver cannot immediately advance them. Verify that the seeded flush position is segment-aligned, that replay can be ahead of it, and that standby_write/standby_flush still succeed for an already-replayed LSN via the replay-position floor in GetCurrentLSNForWaitType(). Second, add fencepost checks for the target <= currentLSN predicate. With replay paused and walreceiver stopped, verify exact boundaries for standby_replay using pg_last_wal_replay_lsn(), and for standby_flush using pg_last_wal_receive_lsn(). Also verify that a waiter for current + 1 sleeps while replay is paused and wakes with success once new WAL is delivered and replay advances. Finally, add a cascading-standby timeline-switch test. Start a waiter on the downstream standby, promote its upstream, generate WAL on the new timeline, and verify that the cascade follows the new timeline and the wait completes successfully once replay reaches the target LSN. Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/1957514.1775526774%40sss.pgh.pa.us Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com>
WAIT FOR LSN compares only the numeric LSN and has no notion of which timeline a WAL record belongs to. There are many possible scenarios when timeline-switching can break read-your-writes consistency. The proper analysis and timeline support is possible in the next major release. Yet just document the current behaviour. Reported-by: Xuneng Zhou <xunengzhou@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com>
The XLogRecordPageWithFreeSpace function updates the freespace map (FSM) data while replaying data-level WAL records during the recovery. If the FSM block is updated, it needs to be marked as modified. Currently, this is done with the MarkBufferDirtyHint call (as in all other cases for modifying FSM data). However, in the recovery context, this function will actually do nothing if checksums are enabled. It's assumed that the page should not be dirtied during recovery while modifying hints to protect against torn pages, since no new WAL data can be generated at this point to store FPI. Such logic does not seem fully aligned with the FSM case, as its blocks could be simply zeroed if a checksum mismatch is detected. Currently, changes to an FSM block could be lost if each change to that block occurs infrequently enough to allow it to be evicted from the cache. To persist the change, the modification needs to be performed while the FSM block is still kept in buffers and marked as dirty after receiving its FPI. If the block has already been cleaned, the change won't be persisted, so stored FSM blocks may remain in an obsolete state. If a large number of discrepancies between the data in leaf FSM blocks and the actual data blocks accumulate on the replica server, this could cause significant delays in insert operations after switchover. Such an insert operation may need to visit many data blocks marked as having sufficient space in the FSM, only to discover that the information is incorrect and the FSM records need to be corrected. In a heavily trafficked insert-only table with many concurrent clients performing inserts, this has been observed to cause several-second stalls, causing visible application malfunction. The desire to avoid such cases was the reason behind the commit ab7dbd6, which introduced an update of FSM data during the heap_xlog_visible invocation. However, an update to the FSM data on the standby side could be lost due to a missing 'dirty' flag, so there is still a possibility that a large number of FSM records will contain incorrect data. Note that having a zeroed FSM page in such a case (due to a checksum mismatch) is preferable, as a zero value will be interpreted as an indication of full data blocks, and the inserter will be routed to the next FSM block or to the end of the table. Given that FSM is ready to handle torn page writes and XLogRecordPageWithFreeSpace is called only during the recovery, there seems to be no reason to use MarkBufferDirtyHint here instead of a regular MarkBufferDirty call. Discussion: https://postgr.es/m/596c4f1c-f966-4512-b9c9-dd8fbcaf0928%40postgrespro.ru Author: Alexey Makhmutov <a.makhmutov@postgrespro.ru> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
append_tuple_value_detail() constructed user-visible messages using separately translated fragments such as ": ", ", ", and ".",. This makes correct translation difficult or impossible in some languages. Refactor append_tuple_value_detail() to move all punctuation and sentence construction to the callers, which now use a single translatable string with a %s placeholder for the tuple data. Reported-by: David Rowley <dgrowleyml@gmail.com> Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Discussion: https://postgr.es/m/227279.1775956328%40sss.pgh.pa.us#8f3a5f50543556c60cc5a13270cb7ba4 Discussion: https://postgr.es/m/CAApHDvohYOdrvhVxXzCJNX_GYMSWBfjTTtB6hgDauEtZ8Nar2A@mail.gmail.com
The new pg_restore option --no-globals (commit 3c19983) appeared out of order in the documentation and help output. Fix that.
Even though a property graph is defined in pg_class it does not contain any rows by itself and need not have a type defined. Avoid creating a type for it. Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAExHW5ucu7ZTgYkO6rB_1ShJP3e%3DGAT2T3CP4XWN8rUVEsiJoA%40mail.gmail.com
expression_tree_mutator_impl() did not handle T_GraphPattern, T_GraphElementPattern, and T_GraphPropertyRef. The corresponding expression_tree_walker_impl() already handles all three node types. This causes an "unrecognized node type" error whenever a GRAPH_TABLE appeared in an expression tree. While at it, also update raw_expression_tree_walker() and expression_tree_walker() to handle missing nodes that may appear in GraphPattern expression trees. When raw_expression_tree_walker() is called, GraphElementPattern::labelexpr contains ColumnRefs instead of GraphLabelRefs. Hence those are not handled in raw_expression_tree_walker(). Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHg%2BQDc97WFTSkXg%3Dg_ZAH8GnY2gJrvq72cs%2BYjqEAuZgXnkAQ%40mail.gmail.com
A blunder of mine (Álvaro) in commit 28d534e. Author: Lakshmi N <lakshmin.jhs@gmail.com> Reviewed-by: Xiaopeng Wang <wxp_728@163.com> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Discussion: https://postgr.es/m/CA+3i_M9ytFufvD8Tm0rhpfxuC4XrpgQDBHxM7NJQYxv488JW7w@mail.gmail.com
This function returns some value of enum HostsFileLoadResult, but for reasons lost in the development process was declared to return "int". Fix that, for clarity and so that our typedefs collection tooling sees the typedef as used. Also fix the variable that the sole call assigns into. Move the typedef to the header file that declares load_hosts() to avoid creating header dependency problems. Discussion: https://postgr.es/m/359138.1777922557@sss.pgh.pa.us
relation_has_unique_index_for() has long had an XXX noting that it doesn't check collations when matching a unique index's columns against equality clauses. This was benign as long as all collations in play reduced to the same notion of equality, but has been incorrect since nondeterministic collations were introduced in PG 12: a unique index under a deterministic collation does not prove uniqueness under a nondeterministic collation, nor vice versa. The consequence is wrong query results for any planner optimization that consumes the faulty proof, including inner-unique join execution (which stops the inner search after the first match per outer row), useless-left-join removal, semijoin-to-innerjoin reduction, and self-join elimination. Fix by requiring the index's collation to agree on equality with the clause's input collation. Two collations agree on equality if either is InvalidOid (denoting a non-collation-sensitive operation, which cannot conflict with the other side), if they have the same OID, or if both are deterministic: by definition a deterministic collation treats two strings as equal iff they are byte-wise equal (see CREATE COLLATION), so any two deterministic collations share the same equality relation and the uniqueness proof carries over. Any mismatch involving a nondeterministic collation is rejected. Back-patch to all supported branches; the bug has existed since nondeterministic collations were introduced in PG 12. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAMbWs4_XUUSTyzCaRjUeeahWNqi=8ZOA5Q4coi8zUVEDSBkM6A@mail.gmail.com Backpatch-through: 14
rel_is_distinct_for()'s RTE_SUBQUERY branch passed only the equality operator from each join clause to query_is_distinct_for(), discarding the operator's input collation. query_is_distinct_for() then verified opfamily compatibility but never checked collations, so a DISTINCT / GROUP BY / set-op operating under one collation was trusted to prove uniqueness for a comparison performed under an unrelated collation. As with the recent fix in relation_has_unique_index_for(), this is unsound for nondeterministic collations and yields wrong query results in any optimization that consumes the proof. Fix by carrying each clause's operator input collation into query_is_distinct_for() and validating it at every check-site against the subquery target expression's collation. Back-patch to all supported branches. query_is_distinct_for() is declared in an installed header, so on stable branches the existing two-list signature is retained as a thin wrapper that forwards to a new collation-aware entry point; external callers continue to receive the historical collation-blind answer. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAMbWs4_XUUSTyzCaRjUeeahWNqi=8ZOA5Q4coi8zUVEDSBkM6A@mail.gmail.com Backpatch-through: 14
"vertexes" -> "vertices" Reported-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAJTYsWXFy1j_T82%2BM_S9kFxU414tQYnZQD-b82%3DoL_LbG_5fPQ%40mail.gmail.com
Commit 28d534e introduced reform_tuple() with a fast path that returns the source tuple verbatim when no dropped columns require fixing up. I (Álvaro) failed to realize that this broke handling of columns with a 'missingval' defined: after a VACUUM FULL, CLUSTER, or REPACK operation, the catalogued missingval is thrown away, so the tuples are no longer correct. Fix by forcing the rewrite when the tuple is shorter than the tuple descriptor. Author: Satya Narlapuram <satyanarlapuram@gmail.com> Discussion: https://postgr.es/m/CAHg+QDeoccU5CudrJpmSKZfKZ1gRMNY=5BxSC=JpHgkonzgcOw@mail.gmail.com
As connections that failed abort cleanup can't safely be further used, if a remote query tries to get such a connection, we reject it. Previously, this rejection involved dropping the connection if it was open, without accounting for the possibility of open cursors using it, causing a server crash when such an open cursor tried to use an already-dropped connection, as a cursor-handling function (create_cursor, fetch_more_data, or close_cursor) was called on a freed PGconn. To fix, delay dropping failed connections until abort cleanup of the main transaction, to ensure open cursors using such a connection can safely refer to the PGconn for it. Oversight in commit 8bf58c0. Reported-by: Zhibai Song <songzhibai1234@gmail.com> Diagnosed-by: Zhibai Song <songzhibai1234@gmail.com> Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Discussion: https://postgr.es/m/CAPmGK176y6JP017-Cn%2BhS9CEJx_6iVhRoYbAqzuLU4d8-XPPNg%40mail.gmail.com Backpatch-through: 14
Oversight in commit e2809e3. While at it, use pg_integer_constant_p in master. Discussion: https://postgr.es/m/CANWCAZbOha-x5MCreQn3TRA56VdKWNMAKMy3fAV1kJSw9Vp4pw@mail.gmail.com Backpatch-through: 18
get_tables_to_repack() and get_all_vacuum_rels() were including other sessions' temporary tables in their output work list, causing REPACK, CLUSTER and VACUUM FULL (when executed without a table list) to attempt to acquire AccessExclusiveLock on them, potentially blocking for an extended time. Fix by skipping other-session temp tables early, before they are added to the list. This issue is ancient, but there have been no complaints about it that I know of, so I'm opting for not backpatching at present. Author: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/0b555318-2bf2-46df-9377-09629a2a59db@uni-muenster.de
Commit b3cf461 renamed --wal-directory to --wal-path but retained the former as a silent alias. Per project policy, all options, including deprecated ones, should be documented to assist users transitioning between versions. This patch restores --wal-directory to the documentation and --help output. Author: Amul Sul <sulamul@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/E1w3fZp-000gje-31%40gemulon.postgresql.org
ProcessSingleRelationFork() unconditionally generated an FPI WAL record for every page of every relation when enabling checksums. Unlogged relations, which by definition never generate WAL for data changes, were not exempt which generated excessive WAL to be emitted. Fix by guarding the FPI WAL record call with RelationNeedsWAL() to avoid emitting WAL for unlogged main forks. Unlogged pages are still dirtied to ensure the checksum is written to disk at the next checkpoint. The init fork remains WAL-logged even for unlogged relations, as it's needed on the standby to materialize the relation after promotion (see ResetUnloggedRelations()). Skipping init-fork WAL would leave the standby with a stale init fork that, once copied to the main fork on promotion, would fail checksum verification on every read of the unlogged relation. A test which creates an unlogged table with an index, enables checksums, promotes the standby, and verifies that the unlogged relation and its indexes are still readable post-promotion has been added. Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAHg+QDeGrpZbNZdLjd_T4b43xKEEXZN0HGhkFm-1bkBdyzK7AQ@mail.gmail.com
The DataChecksumsWorker accepts cost_delay and cost_limit parameters from pg_enable_data_checksums() so users can throttle the I/O caused by enabling checksums. Due to the API for setting the cost parameters changing between when the code was written, and when it was committed the new cost update function call was omitted and thus the parameters were silently ignored. Fix by calling VacuumUpdateCosts() after assigning the parameters (both during worker startup and on the runtime cost-update path), and by leaving the page-cost weights at their GUC-controlled defaults. Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAHg+QDeevH6aTyWdXYBJW0wOmfoZy66gDi5TfinK_dXeCrHQLg@mail.gmail.com
The test for finding page verification failures in the logfiles were missing the /m modifier to make sure it anchors to every newline in the search space buffer, and not just the last one. Spotted while adding a test for the recently reported issue with excessive WAL for unlogged relations. Author: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Reviewed-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAHg+QDeGrpZbNZdLjd_T4b43xKEEXZN0HGhkFm-1bkBdyzK7AQ@mail.gmail.com
WAIT FOR LSN registers the current backend in shared memory before entering an interruptible wait loop. Top-level abort and backend exit already call WaitLSNCleanup(), but subtransaction abort did not. If an interrupt, such as statement_timeout, occurred while waiting inside a savepoint, rolling back to the savepoint left the backend marked as present in the WAIT FOR LSN heap. Clean up WAIT FOR LSN state from AbortSubTransaction() as well, and add a TAP test covering reuse of WAIT FOR LSN after a savepoint rollback. Reported-by: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAJTYsWXDRwo-RVRaQgwxVcXgURVFeX8BKnijQrPiPcSCkDDX9A%40mail.gmail.com Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Author: Xuneng Zhou <xunengzhou@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Two PostgreSQL standbys can independently promote to the same timeline ID if their primary stopped before either had a chance to promote. In that situation both clusters share a timeline history prefix that looks identical to pg_rewind: same TLI numbers and same begin/end LSNs. The existing same-TLI shortcut therefore treated the source as a valid rewind target and skipped the rewind entirely, leaving the target's diverged WAL intact. Fix this by embedding a UUIDv7 value in every timeline history file entry at promotion time. Each promotion generates a fresh UUID, so two independent promotions to the same TLI will carry different UUIDs even though the TLI number and begin LSN are identical. When loading the timeline history, pg_rewind uses these UUIDs in two places: 1. findCommonAncestorTimeline checks that the TLI and UUID in each entry match. A mismatch signals independent promotions and the search continues to earlier entries to find the true common ancestor. 2. The same-TLI shortcut (source and target on the same current TLI) compares the UUID stored in the last completed history entry and a mismatch forces a full rewind instead of a no-op. UUIDs are zero for clusters that predate this change, and the comparison function treats a zero UUID on either side as "unknown / compatible", so the new code is fully backward-compatible with old history files. A new test in t/005_same_timeline.pl covers the same-TLI shortcut case: two standbys independently promote to TLI 2, each with a distinct UUID.
Add a test for the case where the target has gone through three timelines (TLI1 -> TLI2 -> TLI3) while the source independently promoted from TLI1 to a numerically identical but UUID-distinct TLI2 (call it TLI2'). Without UUID detection, findCommonAncestorTimeline accepts TLI2 as the common ancestor and begins its WAL scan from the TLI2 shutdown checkpoint. That scan misses the 'x' INSERT that happened earlier in TLI2, so the source page is never copied and 'x' survives the rewind. With the UUID fix in the previous commit the algorithm detects the TLI2 / TLI2' mismatch, backs up to TLI1 as the true common ancestor, and starts the WAL scan from the last TLI1 checkpoint. The scan covers the 'x' INSERT, the source page (containing 'b') is copied, and the rewound cluster ends up with only 'b' and 'origin' as expected. The target has deliberately no insert on TLI3 to ensure the test actually exercises the UUID-based ancestor search rather than passing by coincidence (as it would if a TLI3 insert put the page into the scan range even on unpatched code).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR applies patch v2 from the PostgreSQL hackers thread about
pg_rewindfailing to detect independently promoted timelines with the same numeric TLI.The patch adds UUIDs to timeline history entries and teaches
pg_rewindto compare them, so independently promoted timelines with identical TLI numbers are not treated as the same history.Attribution
Patch authored by Mats Kindahl. I preserved the patch author metadata from the mailing list attachments; the only local adjustment was removing one trailing whitespace line so
git diff --checkis clean.Links
Local verification
Patch build directory:
/tmp/postgres-build-rewind-uuidMaster reproduction build directory:
/tmp/postgres-build-rewind-mastergit diff --check master..HEAD— passedmeson setup /tmp/postgres-build-rewind-uuid --buildtype=debug -Dcassert=true— passedninja -C /tmp/postgres-build-rewind-uuid— passed, 2102/2102 targetsmeson test -C /tmp/postgres-build-rewind-uuid --suite setup --print-errorlogs— passed, 3/3meson test -C /tmp/postgres-build-rewind-uuid 'pg_rewind/005_same_timeline' --print-errorlogs— passed, 1/1, 5 subtestsmeson test -C /tmp/postgres-build-rewind-uuid --suite pg_rewind --print-errorlogs— passed, 11/11Bug reproduction on unpatched master
I copied the patched
src/bin/pg_rewind/t/005_same_timeline.pltest scenario onto unpatchedmaster(5cdec423193) and ran:Result: the new scenario reproduces the bug on master.
pg_rewindexits successfully, but the target keeps its own divergent data instead of matching the source:The same 5-subtest scenario passes with this patch.
Review notes
Local review found one thing worth discussing upstream before treating this as commit-ready: the patch stores
ThisTimeLineUUIDinXLogCtland comments say the UUID is later embedded in theXLOG_END_OF_RECOVERYrecord, butxl_end_of_recoveryis not extended andCreateEndOfRecoveryRecord()does not copy the UUID into the record. The currentpg_rewindbehavior relies on history-file UUIDs, so the TAP coverage passes, but the WAL-record part looks incomplete/dead and should be clarified or finished.The current v2 as applied does not actually change the
xl_end_of_recoveryWAL record layout; it changes timeline history file contents by adding a UUID field before the reason string. Existing parsing accepts old history files, and old code should still parse the first three fields of new history lines. If WAL embedding is not needed, the deadXLogCtl->ThisTimeLineUUIDstate andxlog_redo()short-record compatibility code should probably be removed to keep the patch surgical.