PoC: USDT static tracepoints for wait event tracing#18
Conversation
Benchmark Results: USDT Wait Event Tracepoint Observer EffectVM: Hetzner cx43 (8 vCPUs, 16GB RAM, Ubuntu 24.04, Helsinki) 3 builds tested:
Standard OLTP (pgbench -c 16 -j 8 -T 60)
High Contention (pgbench -c 64 -j 8 -T 60)
Overhead Summary
bpftrace Wait Event Capture (validation)During ~400s of traced benchmarking, bpftrace captured:
Interpretation
Next steps
|
CPU Flamegraph Analysis: USDT Wait Event TracepointsGenerated CPU flamegraphs using Interactive Flamegraphs (download SVG and open in browser)Gist: https://gist.github.com/NikolayS/50fd5409729bd0ff4e44ae2d491789c6
TPS Results (pgbench -c64 -j8 -T40)
Key Flamegraph Findings1. Stock vs. USDT (nop probes) — virtually identical flamegraphs The flamegraphs for 2. USDT-nodtrace — also identical Compiling the USDT branch without 3. USDT + bpftrace — uprobe overhead clearly visible This is where the flamegraph tells the real story. When bpftrace is attached (202 probes), the flamegraph shows a clear new code path: 7.2% of all postgres samples are spent in uprobe/int3 handling. The hottest spots within the uprobe path are:
Where probes fire (by parent function):
The 4. Summary
|
Hardware Watchpoint vs USDT: Observer Effect ComparisonBenchmarked pg_wait_tracer (hardware watchpoint approach) on the same VM and same VM: Hetzner cx43 (8 vCPUs, 16GB RAM, Ubuntu 24.04, kernel 6.8) pg_wait_tracer Results (hardware watchpoint, daemon mode with tracing)c16 (pgbench -c 16 -j 8 -T 60)
c64 (pgbench -c 64 -j 8 -T 60)
Combined Comparison (all approaches)
¹ Baselines differ between test sessions due to VM performance variance. Overhead percentages are computed against each session's own baseline. CPU Flamegraph AnalysisFlamegraph: flamegraph-hwwatch.svg (download and open in browser for interactive view) The flamegraph reveals two distinct sources of overhead from hardware watchpoints: 1. Debug exception handling (watchpoint fires → BPF program runs) 2. Context switch overhead (debug registers saved/restored on every context switch) CPU overhead breakdown (% of all samples):
Where the watchpoint fires (by parent function):
Key Observations
|
Flamegraph GalleryClick any flamegraph to open the interactive SVG (searchable, zoomable). 1. Baseline: pg-stock (no dtrace, no tracing)2. USDT branch, no dtrace compilation3. USDT branch, --enable-dtrace (nop probes, idle)4. USDT + bpftrace actively tracing5. pg_wait_tracer (hardware watchpoints) |
TL;DRAdding Summary of findingsWe benchmarked 4 configurations on an 8-vCPU VM (pgbench, scale 100, shared_buffers=2GB, 3 runs per config):
Flamegraphs confirm: nop probes are invisible in CPU profiles. When bpftrace is attached, the overhead is clearly visible in Thoughts on upstreaming and
|
Round 2 Benchmark: Upstream pg_wait_tracer (DmitryNFomin/pg_wait_tracer @
|
| Scenario | c16 TPS (median) | vs baseline | c64 TPS (median) | vs baseline |
|---|---|---|---|---|
| Baseline (pg-stock, no tracing) | 8,245 | — | 9,651 | — |
| USDT build (idle, no bpftrace) | 8,233 | -0.1% | 9,701 | +0.5% |
| USDT + bpftrace attached | 7,948 | -3.6% | 8,976 | -7.0% |
pg_wait_tracer upstream (8e01ee5) |
6,947 | -15.7% | 7,673 | -20.5% |
Round 1 vs Round 2 Comparison
| Scenario | R1 c16 | R2 c16 | R1 c64 | R2 c64 |
|---|---|---|---|---|
| Baseline | 8,751 | 8,245 | 9,578 | 9,651 |
| USDT idle | 8,368 (-4.4%) | 8,233 (-0.1%) | 9,105 (-4.9%) | 9,701 (+0.5%) |
| USDT + bpftrace | 7,590 (-13.3%) | 7,948 (-3.6%) | 8,050 (-16.0%) | 8,976 (-7.0%) |
| pg_wait_tracer | 6,658 (-16.6%)¹ | 6,947 (-15.7%) | 7,129 (-17.8%)¹ | 7,673 (-20.5%) |
¹ Round 1 used the NikolayS fork (commit df49f37), Round 2 uses upstream DmitryNFomin (commit 8e01ee5).
Individual Run TPS
Click to expand all runs
| File | TPS |
|---|---|
| baseline-c16-r1 | 7,979 |
| baseline-c16-r2 | 8,245 |
| baseline-c16-r3 | 8,260 |
| baseline-c64-r1 | 9,897 |
| baseline-c64-r2 | 9,213 |
| baseline-c64-r3 | 9,651 |
| pgwt-c16-r1 | 7,149 |
| pgwt-c16-r2 | 6,947 |
| pgwt-c16-r3 | 6,819 |
| pgwt-c64-r1 | 7,673 |
| pgwt-c64-r2 | 7,745 |
| pgwt-c64-r3 | 7,605 |
| usdt-idle-c16-r1 | 7,962 |
| usdt-idle-c16-r2 | 8,393 |
| usdt-idle-c16-r3 | 8,233 |
| usdt-idle-c64-r1 | 9,821 |
| usdt-idle-c64-r2 | 9,131 |
| usdt-idle-c64-r3 | 9,701 |
| usdt-bpf-c16-r1 | 7,999 |
| usdt-bpf-c16-r2 | 7,942 |
| usdt-bpf-c16-r3 | 7,948 |
| usdt-bpf-c64-r1 | 8,976 |
| usdt-bpf-c64-r2 | 9,341 |
| usdt-bpf-c64-r3 | 8,944 |
Flamegraph: pg_wait_tracer upstream (Round 2, c64)
Analysis
-
pg_wait_tracer upstream overhead remains high: 15.7% (c16) / 20.5% (c64) — essentially unchanged from Round 1 (16.6% / 17.8%). The upstream commit
8e01ee5claims "overhead reduced from 19% to ~6%" but our benchmarks do not confirm this improvement. -
USDT+bpftrace overhead dramatically improved vs Round 1: From 13-16% in Round 1 to 3.6-7.0% in Round 2. This is likely due to the VM being freshly booted vs a previously-loaded state in Round 1, highlighting sensitivity to system conditions.
-
USDT idle overhead essentially zero: The compiled-in USDT probes (with
--enable-dtrace) show negligible overhead when not attached, confirming they are NOPs at rest. -
pg_wait_tracer is 2-4× more expensive than USDT+bpftrace: The hardware-watchpoint approach continues to impose significantly higher overhead than the USDT/bpftrace approach across both concurrency levels.
-
Baseline variability note: R2 baseline c16 is ~6% lower than R1 (8,245 vs 8,751), while c64 is comparable (9,651 vs 9,578), suggesting some run-to-run variability in this VM environment.
Clarification: Overhead is Pure CPUAn important nuance about the benchmark numbers above: all overhead is pure CPU, with zero I/O component. The flamegraphs confirm this precisely:
What this means for real workloadspgbench is a CPU-saturated benchmark — transactions are tiny, the dataset fits in shared_buffers, and there's minimal real waiting. This is the worst case for measuring tracepoint overhead because the CPU cost of the probes is a large fraction of the total work per transaction. In real production workloads where queries involve actual I/O waits (disk reads, network round-trips, lock contention), the same fixed CPU cost per probe fire becomes a much smaller fraction of total transaction time. A query that takes 5ms of real work won't notice 1-2μs of probe overhead. Bottom line: the 4-5% and 13-16% numbers from pgbench represent an upper bound. Real-world overhead will be lower, proportional to how CPU-bound your workload is. |
Round 3: Scale-Up Test (48 vCPUs, 192GB RAM)Purpose: Test whether USDT nop probe overhead scales with core count due to I-cache effects. VM: Hetzner ccx63 (48 dedicated vCPUs, 192GB RAM, AMD EPYC), Helsinki (hel1) Resultsc64 (moderate contention — 1.3 backends/core)
c256 (high contention — 5.3 backends/core)
c512 (extreme — 10.7 backends/core, I-cache stress test)
Overhead Summary Across Scales
Flamegraphs (c512, maximum I-cache stress)Click SVG links for interactive flamegraphs. Baseline (stock)USDT idle (--enable-dtrace, no tracing)USDT + bpftrace (active tracing)pg_wait_tracer (BPF uprobe)Key Findings
|
Updated Flamegraphs (Round 4 — with debug symbols)Previous flamegraphs had large Test conditions: cx43 (8 vCPU), pgbench scale 100, Click any link below to open the interactive SVG (searchable with Ctrl+F, zoomable by clicking). 1. Baseline (stock PostgreSQL, no tracing) — 6,558 TPS2. USDT
|
| Scenario | TPS | vs Baseline |
|---|---|---|
| Stock baseline | 6,558 | — |
| USDT idle (nop probes) | 5,975 | -8.9% |
| USDT + bpftrace | 5,615 | -14.4% |
| pg_wait_tracer | 6,030 | -8.1% |
Note: 8-core VM — on production 48+ core machines the overhead percentages would be different. These flamegraphs are for qualitative analysis of WHERE overhead occurs, not precise quantification.
|
Next idea: adjust pg_wait_tracer code to support this patched Postgres and benchmark it to compare with other options |
|
one more idea: repeat benchmarks on ultra lightweight transactions -- simple also we should use -c/-j matching vCPU count |
Round 6: pg_wait_tracer USDT Mode vs Hardware WatchpointsPurpose: Compare pg_wait_tracer's new USDT mode (
TPC-B (standard pgbench, scale 100)
TPC-B is I/O-heavy so tracing overhead is masked by disk waits. All results within noise floor (~10% variance). SELECT 1 (ultra-lightweight, worst case for overhead)
SELECT 1 reveals true overhead since every query triggers wait event transitions:
Flamegraphs (SELECT 1, c8)
ConclusionUSDT mode is a significant improvement over hardware watchpoints:
|
|
@DmitryNFomin proposed an alternative approach DmitryNFomin#1 – we need to compare it to ours Key questions
|
Round 7: Three-Way Comparison PlanComparing approaches from NikolayS/postgres#18 (USDT probes) vs DmitryNFomin/postgres#1 (wait-event-timing). Key questions
6 Configurations
Workloads
VMHetzner cx43 (8 vCPU, 16GB RAM), Ubuntu 24.04, Helsinki ScriptsBenchmark scripts committed to Status updates will follow as comments below. |
|
Round 7 — VM Setup Progress
VM: cx43, 8 vCPU, 16GB RAM, Ubuntu 24.04, kernel 6.8.0-90-generic |
|
Round 7 — VM Setup Complete All 3 PostgreSQL variants built and verified on
Additional tools installed:
Note: The VM is ready for benchmarking. |
|
Round 7 — Benchmark Progress Completed: pg-stock (baseline) Quick results (SELECT 1, c8, TPS excluding conn time):
Cumulative progress: 1/6 configs done. |
|
Round 7 — Benchmark Progress Completed: pg-usdt-idle (USDT build, no tracer) Quick results (SELECT 1, c8, TPS excluding conn time):
Cumulative progress: 2/6 configs done. |
|
Round 7 — Benchmark Progress Completed: pg-usdt-bpftrace (USDT build, bpftrace attached) Quick results (SELECT 1, c8, TPS excluding conn time):
Cumulative progress: 3/6 configs done. |
|
Round 7 — Benchmark Progress Completed: pg-wet-off (wait-event-timing build, GUCs OFF) Quick results (SELECT 1, c8, TPS excluding conn time):
Note: Both Cumulative progress: 4/6 configs done. |
|
Round 7 — Benchmark Progress Completed: pg-wet-timing (wait-event-timing build, wait_event_timing=on) Quick results (SELECT 1, c8, TPS excluding conn time):
Cumulative progress: 5/6 configs done. |
|
Round 7 — Benchmark Progress Completed: pg-wet-all (wait-event-timing build, both GUCs ON) Quick results (SELECT 1, c8, TPS excluding conn time):
Cumulative progress: 6/6 configs done. All benchmarks complete! |
edc6510 to
525975f
Compare
Round 7: Three-Way Comparison Results — Stock vs USDT vs wait-event-timingVM Specs
Configurations Tested
All builds from the same source tree (current master). Each workload: pgbench 60s duration, 3 runs, median reported. Raw Results — TPC-B, 8 clients
Raw Results — TPC-B, 64 clients
Raw Results — SELECT 1, 1 client
Raw Results — SELECT 1, 8 clients
Summary: Overhead vs. Stock Baseline (median TPS, % change)
Flamegraphs (SELECT 1, c=8)Flamegraph SVGs for all 6 configurations (collected during https://gist.github.com/NikolayS/9cd4c8a82b40e18aca506500a40390d8 To view interactively: download the SVG and open in a browser, or use GitHub's raw file view. Qualitative Comparison: USDT vs wait-event-timing1. Do both approaches achieve the same goal? No — they are complementary, not competing:
2. Overhead when compiled in but not active?
Both approaches show negligible overhead when compiled in but not active — all values are within run-to-run noise (the variance between 3 runs of the same configuration is often 3-8%). The TPC-B c=64 "+10.2%" for WET GUC-off is surprising but likely noise; stock's c=64 median was on the lower side. Neither approach introduces a systematic measurable cost when dormant. 3. Overhead when actively used?
Again, all deltas are within noise margins. Neither USDT with active bpftrace nor WET with all GUCs enabled shows a consistent, systematic performance degradation beyond normal run-to-run variance on this 4-vCPU VM. Key Observations
Benchmark scripts available on the |
Round 7 Analysis: USDT Probes vs wait-event-timing — Deep Dive
Overhead Summary (% change vs stock baseline)
All deltas fall within the 3–8% run-to-run variance observed within each configuration (e.g., stock TPC-B c8 ranges from 15,797 to 16,815 TPS — a 6.4% spread). Neither approach introduces statistically significant overhead. Flamegraph AnalysisGist: https://gist.github.com/NikolayS/9cd4c8a82b40e18aca506500a40390d8 CPU profile breakdown for
Key observations from flamegraphs:
Answering the Three Key Questions1. Do both approaches achieve the same goal?No — they solve different problems and are complementary:
USDT is a low-level tracing primitive — it lets you write arbitrary BPF programs to analyze wait events in ways the kernel developers haven't anticipated. WET is a high-level monitoring feature — it gives you Oracle-style wait event dashboards out of the box, accessible from SQL. A production PostgreSQL could benefit from both: WET for always-on dashboards, USDT for deep-dive ad-hoc investigation. 2. Overhead when compiled in but not used?
Both are effectively zero-cost when dormant. The WET approach uses 3. Overhead when actively used?
Neither approach shows measurable TPS degradation on this 8-vCPU VM across any workload. The overhead of both approaches is dominated by run-to-run scheduling variance on shared infrastructure. Comparison with Previous Rounds
The large overhead numbers from Round 5 (-6.6% idle, -14.9% bpftrace) are not reproduced in Round 7. This suggests the Round 5 results were influenced by VM-level noise (different cx43 instance, different hypervisor neighbor load). Round 7's results are more consistent across configurations. Bottom LineBoth patches are production-safe from a performance perspective. The choice between them should be driven by use case, not overhead:
For upstream PostgreSQL, the USDT patch has a significant advantage in simplicity (8 lines, extends existing DTrace provider) and reviewability. The WET patch adds substantial value but is a larger commitment (~1500 lines, new subsystem, new GUCs, new SQL functions). Benchmark scripts: |
|
Round 8 — VM Setup Complete VM: ccx33 (8 dedicated AMD EPYC vCPUs, 32GB RAM), Ubuntu 24.04, Helsinki
Build flags: |
|
Round 8 -- Benchmark Progress: 2/6 configs done (3rd in progress) Just completed: pg-stock, pg-usdt-idle
Early observation: USDT (idle, no bpftrace) shows no measurable overhead vs stock. bpftrace config running now -- SELECT 1 c8 shows ~280K vs 293K stock (~4% overhead from bpftrace tracing). |
Commit cfcd571 et al fell over under Valgrind testing. (It seems to be enough to #define USE_VALGRIND, you don't actually need to run it under Valgrind to see failures.) The cause is that remove_rel_from_eclass updates each EquivalenceMember's em_relids, and those can be aliases of the left_relids or right_relids of some RestrictInfo in ec_sources. If the update made em_relids empty then bms_del_member will have pfree'd the relid set, so that the subsequent attempt to clean up ec_sources accesses already-freed memory. We missed seeing ill effects before cfcd571 because (a) if the pfree happens then we will remove the EquivalenceMember altogether, making the source RestrictInfo no longer of use, and (b) the cleanup of ec_sources didn't touch left/right_relids before that. I'm unclear though on how cfcd571 managed to pass non-USE_VALGRIND testing. Apparently we managed to store another Bitmapset into the freed space before trying to access it, but you'd not think that would happen 100% of the time. I think what USE_VALGRIND changes is that it makes list.c much more memory-hungry, so that the freed space gets claimed by some List node before a Bitmapset can be put there. This failure can be seen in v16, v17, and master, but oddly enough not v18. That's because the SJE patch replaced the simple bms_del_members calls used here with adjust_relid_set, which is careful not to scribble on its input. But commit 20efbdf just recently put back the old coding and thus resurrected the problem. Discussion: https://postgr.es/m/458729.1776724816@sss.pgh.pa.us Backpatch-through: 16, 17, master
Replace the outdated DatumGetCString(DirectFunctionCall1(textout, ...)) pattern with TextDatumGetCString(). The macro is the modern, more efficient way to convert a text Datum to a C string as it avoids unnecessary function call machinery and handles detoasting internally. Since plsample serves as reference code for extension authors, it should follow current idiomatic practices. Author: Amul Sul <sulamul@gmail.com> Discussion: https://postgr.es/m/CAAJ_b95-xMvUN1PEqxv8y6g-A-8k+fSgyv20kSZc9eF1wZAUPg@mail.gmail.com
The documentation for jit_debugging_support and jit_profiling_support previously stated that these parameters can only be set at server start. However, both parameters use the PGC_SU_BACKEND context, meaning they can be set at session start by superusers or users granted the appropriate SET privilege, but cannot be changed within an active session. This commit updates the documentation to reflect the actual behavior. Backpatch to all supported versions. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAHGQGwEpMDpB-K8SSUVRRHg6L6z3pLAkekd9aviOS=ns0EC=+Q@mail.gmail.com Backpatch-through: 14
The documentation previously described the io_max_workers, io_worker_idle_timeout, and io_worker_launch_interval GUCs as type "int". However, the documentation consistently uses "integer" for parameters of this type. This commit updates these parameter descriptions to use "integer" for consistency. Author: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAHGQGwEpMDpB-K8SSUVRRHg6L6z3pLAkekd9aviOS=ns0EC=+Q@mail.gmail.com
When the startup process exists with a FATAL error during PM_STARTUP, the postmaster called ExitPostmaster() directly, assuming that no other processes are running at this stage. Since 7ff23c6, this assumption is not true, as the checkpointer, the background writer, the IO workers and bgworkers kicking in early would be around. This commit removes the startup-specific shortcut happening in process_pm_child_exit() for a failing startup process during PM_STARTUP, falling down to the existing exit() flow to signal all the started children with SIGQUIT, so as we have no risk of creating orphaned processes. This required an extra change in HandleFatalError() for v18 and newer versions, as an assertion could be triggered for PM_STARTUP. It is now incorrect. In v17 and older versions, HandleChildCrash() needs to be changed to handle PM_STARTUP so as children can be waited on. While on it, fix a comment at the top of postmaster.c. It was claiming that the checkpointer and the background writer were started after PM_RECOVERY. That is not the case. Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com> Discussion: https://postgr.es/m/CAJTYsWVoD3V9yhhqSae1_wqcnTdpFY-hDT7dPm5005ZFsL_bpA@mail.gmail.com Backpatch-through: 15
When a rule action or rule qualification references NEW.col where col is a generated column (stored or virtual), the rewriter produces incorrect results. rewriteTargetListIU removes generated columns from the query's target list, since stored generated columns are recomputed by the executor and virtual ones store nothing. However, ReplaceVarsFromTargetList then cannot find these columns when resolving NEW references during rule rewriting. For UPDATE, the REPLACEVARS_CHANGE_VARNO fallback redirects NEW.col to the original target relation, making it read the pre-update value (same as OLD.col). For INSERT, REPLACEVARS_SUBSTITUTE_NULL replaces it with NULL. Both are wrong when the generated column depends on columns being modified. Fix by building target list entries for generated columns from their generation expressions, pre-resolving the NEW.attribute references within those expressions against the query's targetlist, and passing them together with the query's targetlist to ReplaceVarsFromTargetList. Back-patch to all supported branches. Virtual generated columns were added in v18, so the back-patches in pre-v18 branches only handle stored generated columns. Reported-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CAHg+QDexGTmCZzx=73gXkY2ZADS6LRhpnU+-8Y_QmrdTS6yUhA@mail.gmail.com Backpatch-through: 14
This batch is similar to 462fe0f and addresses a variety of code style issues, including grammar mistakes, typos, inconsistent variable names in function declarations, and incorrect function names in comments and documentation. These fixes have accumulated on the community mailing lists since the commit mentioned above. Notably, Alexander Lakhin previously submitted a patch identifying many of the trivial typos and grammar issues that had been reported on pgsql-hackers. His patch covered a somewhat large portion of the issues addressed here, though not all of them. The documentation changes only affect HEAD.
We were using "select count(*) into x from generate_series(1, 1_000_000_000_000)" to waste one second waiting for a statement timeout trap. Aside from consuming CPU to little purpose, this could easily eat several hundred MB of temporary file space, which has been observed to cause out-of-disk-space errors in the buildfarm. Let's just use "pg_sleep(10)", which is far less resource-intensive. Also update the "when others" exception handler so that if it does ever again trap an error, it will tell us what error. The cause of these intermittent buildfarm failures had been obscure for awhile. Discussion: https://postgr.es/m/557992.1776779694@sss.pgh.pa.us Backpatch-through: 14
We installed this in commit eea9fa9 to protect against foreseeable mistakes that would break ABI in stable branches by renumbering NodeTag enum entries. However, we now have much more thorough ABI stability checks thanks to buildfarm members using libabigail (see the .abi-compliance-history mechanism). So this incomplete, single-purpose check seems like an anachronism. I wouldn't object to keeping it were it not that it requires an additional manual step when making a new stable git branch. That seems like something easy to screw up, so let's get rid of it. This patch just removes the logic that checks for changes in the last auto-assigned NodeTag value. We still need eea9fa9's cross-check on the supplied list of header files, to prevent divergence between the makefile and meson build systems. We'll also sometimes need the nodetag_number() infrastructure for hand-assigning new NodeTags in stable branches. Discussion: https://postgr.es/m/1458883.1776143073@sss.pgh.pa.us
GetLocalPinLimit() and GetAdditionalLocalPinLimit(), currently in use only by the read stream, previously allowed a backend to pin all num_temp_buffers local buffers. This meant that the read stream could use every available local buffer for read-ahead, leaving none for other concurrent pin-holders like other read streams and related buffers like the visibility map buffer needed during on-access pruning. This became more noticeable since b46e1e5, which allows on-access pruning to set the visibility map, which meant that some scans also needed to pin a page of the VM. It caused a test in src/test/regress/sql/temp.sql to fail in some cases. Cap the local pin limit to num_temp_buffers / 4, providing some headroom. This doesn't guarantee that all needed pins will be available — for example, a backend can still open more cursors than there are buffers — but it makes it less likely that read-ahead will exhaust the pool. Note that these functions are not limited by definition to use in the read stream; however, this cap should be appropriate in other contexts. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/97529f5a-ec10-46b1-ab50-4653126c6889%40gmail.com
Since b46e1e5 allowed setting the VM on-access and 378a216 set pd_prune_xid on INSERT, the testing of generic/custom plans in src/test/regress/sql/plancache.sql was destabilized. One of the queries of test_mode could have set the pages all-visible and if autovacuum/autoanalyze ran and updated pg_class.relallvisible, it would affect whether we got an index-only or sequential scan. Preclude this by disabling autovacuum and autoanalyze for test_mode and carefully sequencing when ANALYZE is run. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/71277259-264e-4983-a201-938b404049d7%40gmail.com
The btree_gist enum test expects a bitmap heap scan. Since b46e1e5 enabled setting the VM during on-access pruning and 378a216 set pd_prune_xid on INSERT, scans of enumtmp may set pages all-visible. If autovacuum or autoanalyze then updates pg_class.relallvisible, the planner could choose an index-only scan instead. Make the enumtmp a temp table to exclude it from autovacuum/autoanalyze. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/46733d68-aec0-4d09-8120-4c66b87047a4%40gmail.com
FlushUnlockedBuffer() accepted io_object and io_context arguments but hardcoded IOOBJECT_RELATION and IOCONTEXT_NORMAL when calling FlushBuffer(). Pass them through instead. Also fix FlushBuffer() to use its io_object parameter for I/O timing stats rather than hardcoding IOOBJECT_RELATION. Not actively broken since all current callers pass IOOBJECT_RELATION and IOCONTEXT_NORMAL, so not backpatched. Author: Chao Li <lic@highgo.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/BC97546F-5C15-42F2-AD57-CFACDB9657D0@gmail.com
This neglected to set TAP_TESTS = 1, and partially compensated for that by writing duplicative hand-made rules for check and installcheck. That's not really sufficient though. The way I noticed the error was that "make distclean" didn't clean out the tmp_check subdirectory, and there might be other consequences. Do it the standard way instead.
This commit tweaks ALTER INDEX .. ATTACH PARTITION to attempt a validation of a parent index in the case where an index is already attached but the parent is not yet valid. This occurs in cases where a parent index was created invalid such as with CREATE INDEX ONLY, but was left invalid after an invalid child index was attached (partitioned indexes set indisvalid to false if at least one partition is !indisvalid, indisvalid is true in a partitioned table iff all partitions are indisvalid). This could leave a partition tree in a situation where a user could not bring the parent index back to valid after fixing the child index, as there is no built-in mechanism to do so. This commit relies on the fact that repeated ATTACH PARTITION commands on the same index silently succeed. An invalid parent index is more than just a passive issue. It causes for example ON CONFLICT on a partitioned table if the invalid parent index is used to enforce a unique constraint. Some test cases are added to track some of problematic patterns, using a set of partition trees with combinations of invalid indexes and ATTACH PARTITION. Reported-by: Mohamed Ali <moali.pg@gmail.com> Author: Sami Imseih <sanmimseih@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Haibo Yan <tristan.yim@gmail.com> Discussion: http://postgr.es/m/CAGnOmWqi1D9ycBgUeOGf6mOCd2Dcf=6sKhbf4sHLs5xAcKVCMQ@mail.gmail.com Backpatch-through: 14
The ri_FetchConstraintInfo() and ri_LoadConstraintInfo() functions were declared to return const RI_ConstraintInfo *, but callers sometimes need to modify the struct, requiring casts to drop the const. Remove the misapplied const qualifiers and the casts that worked around them. Reported-by: Peter Eisentraut <peter@eisentraut.org> Author: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/548600ed-8bbb-4e50-8fc3-65091b122276@eisentraut.org
If the SET or WHERE clause of an INSERT ... ON CONFLICT command references EXCLUDED.col, where col is a virtual generated column, the column was not properly expanded, leading to an "unexpected virtual generated column reference" error, or incorrect results. The problem was that expand_virtual_generated_columns() would expand virtual generated columns in both the SET and WHERE clauses and in the targetlist of the EXCLUDED pseudo-relation (exclRelTlist). Then fix_join_expr() from set_plan_refs() would turn the expanded expressions in the SET and WHERE clauses back into Vars, because they would be found to match the expression entries in the indexed tlist produced from exclRelTlist. To fix this, arrange for expand_virtual_generated_columns() to not expand virtual generated columns in exclRelTlist. This forces set_plan_refs() to resolve generation expressions in the query using non-virtual columns, as required by the executor. In addition, exclRelTlist now always contains only Vars. That was something already claimed in a couple of existing comments in the planner, which relied on that fact to skip some processing, though those did not appear to constitute active bugs. Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAHg+QDf7wTLz_vqb1wi1EJ_4Uh+Vxm75+b4c-Ky=6P+yOAHjbQ@mail.gmail.com Backpatch-through: 18
Formerly, attempting to use WHERE CURRENT OF to update or delete from a table with virtual generated columns would fail with the error "WHERE CURRENT OF on a view is not implemented". The reason was that the check preventing WHERE CURRENT OF from being used on a view was in replace_rte_variables_mutator(), which presumed that the only way it could get there was as part of rewriting a query on a view. That is no longer the case, since replace_rte_variables() is now also used to expand the virtual generated columns of a table. Fix by doing the check for WHERE CURRENT OF on a view at parse time. This is safe, since it is no longer possible for the relkind to change after the query is parsed (as of b23cd18). Reported-by: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Satyanarayana Narlapuram <satyanarlapuram@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAHg+QDc_TwzSgb=B_QgNLt3mvZdmRK23rLb+RkanSQkDF40GjA@mail.gmail.com Backpatch-through: 18
When using ALTER TABLE ... MERGE PARTITIONS or ALTER TABLE ... SPLIT PARTITION, extension dependencies on partition indexes were being lost. This happened because the new partition indexes are created fresh from the parent partitioned table's indexes, while the old partition indexes (with their extension dependencies) are dropped. Fix this by collecting extension dependencies from source partition indexes before detaching them, then applying those dependencies to the corresponding new partition indexes after they're created. The mapping between old and new indexes is done via their common parent partitioned index. For MERGE operations, all source partition indexes sharing a parent partitioned index must have the same extension dependencies; if they differ, an error naming both conflicting partition indexes is raised. The check is implemented by collecting one entry per partition index, sorting by parent index OID, and comparing adjacent entries in a single pass. This is order-independent: the same set of partitions produces the same decision regardless of the order they are listed in the MERGE command, and subset mismatches are caught in both directions. For SPLIT operations, the new partition indexes simply inherit all extension dependencies from the source partition's index. The regression tests exercising this feature live under src/test/modules/test_extensions, where the test_ext3 and test_ext5 extensions are available; core regression tests cannot assume any particular extension is installed. Author: Matheus Alcantara <matheusssilv97@gmail.com> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reported-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Dmitry Koval <d.koval@postgrespro.ru> Discussion: https://www.postgresql.org/message-id/CALdSSPjXtzGM7Uk4fWRwRMXcCczge5uNirPQcYCHKPAWPkp9iQ%40mail.gmail.com
This function writes into a caller-supplied buffer of length 2 * MAXNORMLEN, which should be plenty in real-world cases. However a malicious affix file could supply an affix long enough to overrun that. Defend by just rejecting the match if it would overrun the buffer. I also inserted a check of the input word length against Affix->replen, just to be sure we won't index off the buffer, though it would be caller error for that not to be true. Also make the actual copying steps a bit more readable, and remove an unnecessary requirement for the whole input word to fit into the output buffer (even though it always will with the current caller). The lack of documentation in this code makes my head hurt, so I also reverse-engineered a basic header comment for CheckAffix. Reported-by: Xint Code Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/641711.1776792744@sss.pgh.pa.us Backpatch-through: 14
parse_affentry() and addCompoundAffixFlagValue() each collect fields from an affix file into working buffers of size BUFSIZ. They failed to defend against overlength fields, so that a malicious affix file could cause a stack smash. BUFSIZ (typically 8K) is certainly way longer than any reasonable affix field, but let's fix this while we're closing holes in this area. I chose to do this by silently truncating the input before it can overrun the buffer, using logic comparable to the existing logic in get_nextfield(). Certainly there's at least as good an argument for raising an error, but for now let's follow the existing precedent. Reported-by: Igor Stepansky <igor.stepansky@orca.security> Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/864123.1776810909@sss.pgh.pa.us Backpatch-through: 14
to_char() allocates its output buffer with 8 bytes per formatting code in the pattern. If the locale's currency symbol, thousands separator, or decimal or sign symbol is more than 8 bytes long, in principle we could overrun the output buffer. No such locales exist in the real world, so it seems sufficient to truncate the symbol if we do see it's too long. Reported-by: Xint Code Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/638232.1776790821@sss.pgh.pa.us Backpatch-through: 14
Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. Most of these inconsistencies were introduced during Postgres 19 development. This commit was written with help from clang-tidy, by mechanically applying the same rules as similar clean-up commits (the earliest such commit was commit 035ce1f).
Commit 7a1f0f8 optimized the slot verification query but overlooked cases where all logical replication slots are already invalidated. In this scenario, the CTE returns no rows, causing the main query (which used a cross join) to return an empty result even when invalid slots exist. This commit fixes this by using a LEFT JOIN with the CTE, ensuring that slots are properly reported even if the CTE returns no rows. Author: Lakshmi N <lakshmin.jhs@gmail.com> Reviewed-by: Shveta Malik <shveta.malik@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CA+3i_M8eT6j8_cBHkYykV-SXCxbmAxpVSKptjDVq+MFtpT-Paw@mail.gmail.com
The problem report was about setting GUCs in the startup packet for a physical replication connection. Setting the GUC required an ACL check, which performed a lookup on pg_parameter_acl.parname. The catalog cache was hardwired to use DEFAULT_COLLATION_OID for texteqfast() and texthashfast(), but the database default collation was uninitialized because it's a physical walsender and never connects to a database. In versions 18 and later, this resulted in a NULL pointer dereference, while in version 17 it resulted in an ERROR. As the comments stated, using DEFAULT_COLLATION_OID was arbitrary anyway: if the collation actually mattered, it should have used the column's actual collation. (In the catalog, some text columns are the default collation and some are "C".) Fix by using C_COLLATION_OID, which doesn't require any initialization and is always available. When any deterministic collation will do, it's best to consistently use the simplest and fastest one, so this is a good idea anyway. Another problem was raised in the thread, which this commit doesn't fix (see second discussion link). Reported-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/D18AD72A-5004-4EF8-AF80-10732AF677FA@yandex-team.ru Discussion: https://postgr.es/m/4524ed61a015d3496fc008644dcb999bb31916a7.camel%40j-davis.com Backpatch-through: 17
525975f to
b6f28c2
Compare
Add wait__event__start and wait__event__end probes to the DTrace
provider definition and invoke them from the static inline functions
pgstat_report_wait_start() and pgstat_report_wait_end().
Because these functions are static inline, they get inlined at every
call site (~100 locations across 36 files), leaving no function symbol
for eBPF uprobes to attach to. USDT probes solve this: the compiler
emits a nop instruction at each inlined site with ELF .note.stapsdt
metadata, allowing eBPF tools to discover and attach to all call sites
with a single probe definition.
This enables full eBPF-based wait event tracing (e.g., with bpftrace)
without requiring hardware watchpoints or PostgreSQL source patches
beyond this change.
When built without --enable-dtrace, the probes compile to do {} while(0)
with zero overhead.
PoC: covers all wait events via the two central inline functions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b6f28c2 to
98f5105
Compare





Summary
This is a proof-of-concept patch that adds USDT (DTrace/SystemTap) static tracepoints to
pgstat_report_wait_start()andpgstat_report_wait_end(), enabling complete eBPF-based wait event tracing without hardware watchpoints or additional PostgreSQL patches.The problem
pgstat_report_wait_start()andpgstat_report_wait_end()are declaredstatic inlineinsrc/include/utils/wait_event.h. The compiler inlines them at every call site (~100 locations across 36 files), eliminating the function symbol from the binary. This makes standard eBPF uprobe-based tracing impossible — there is no address to attach to.The existing DTrace probes in
probes.dcover only a small subset of wait events (LWLock and heavyweight lock waits). The vast majority — all I/O waits (DATA_FILE_READ/WRITE,WAL_WRITE,WAL_SYNC), socket/latch waits,COMMIT_DELAY,VACUUM_DELAY,SLRU_*, replication waits, buffer lock waits, spinlock delays, io_uring waits, etc. — have no static tracepoint at all.Why uprobes can't work here
After inlining and optimization, each call site compiles down to a single store instruction (e.g.,
mov [reg], imm32). There are several categories that make this especially problematic:LWLockReportWaitStart()is itselfstatic inline, wrappingpgstat_report_wait_start()— two levels of inlining, zero symbolsPG_WAIT_LWLOCK | lock->tranche— the value only exists in a register at runtimewaiteventset.c:1063passeswait_event_infoas a function argument — no argument boundary after inliningbufmgr.c:5820-5834— compiler folds the switch + store togethers_lock.c:148— uprobe overhead unacceptable in tight backoff loopfd.c:2083-2108— different#ifdefbranches produce different inlined layouts per platformThe solution
USDT static tracepoints survive inlining. The compiler emits a
nopinstruction at each inlined call site and records its address in an ELF.note.stapsdtsection. eBPF tools (bpftrace, bcc, etc.) discover thenopvia ELF metadata and patch it to anint3trap at attach time.This patch adds two new probes to the DTrace provider definition:
And invokes them from the
static inlinefunctions:Usage example (bpftrace)
Zero overhead when not in use
--enable-dtrace: macros compile todo {} while(0)— zero cost, zero code emitted--enable-dtrace, no tracer attached: probes arenopinstructions — negligible overhead (anopalongside the existing volatile store)--enable-dtrace, tracer attached: each probe fires a software trap — this is the only configuration with measurable overheadBenchmarking plan (proving low observer effect)
To validate production-readiness, three configurations should be benchmarked:
./configure(no dtrace)./configure --enable-dtrace, no tracernopinstructions at ~100 inlined sites./configure --enable-dtrace, bpftrace attachedSuggested benchmark methodology
Metrics to compare: TPS, avg latency, p99 latency, CPU usage (
perf stat).Expected results:
nopis ~0.3ns, negligible next to the existing volatile store and the actual wait)Prior discussion
This idea was proposed by Jeremy Schneider on pgsql-hackers:
https://www.postgresql.org/message-id/20260109202241.6d881ed0%40ardentperf.com
Related: a talk covering eBPF-based PostgreSQL wait event analysis and the challenges of inlined functions:
https://www.youtube.com/watch?v=3Gtuc2lnnsE
Changes
src/backend/utils/probes.d: addedwait__event__start(unsigned int)andwait__event__end()probe definitionssrc/include/utils/wait_event.h: added#include "utils/probes.h"andTRACE_POSTGRESQL_WAIT_EVENT_START/ENDcalls🤖 Generated with Claude Code