Add wait_event_timing: Oracle-style wait event instrumentation by DmitryNFomin · Pull Request #1 · DmitryNFomin/postgres

DmitryNFomin · 2026-03-22T20:50:42Z

Summary

This patch adds opt-in per-backend instrumentation that captures the duration of every wait event reported via pgstat_report_wait_start/end. Two capture levels are exposed: aggregated per-event histograms (stats) and a 10046-style per-session trace ring (trace).

The feature is gated by a configure-time flag (--enable-wait-event-timing / meson -Dwait_event_timing=true) and a runtime GUC (wait_event_capture, SUSET, default off). When the GUC is off, the inline wait-event hot path is one global load + one predicted-not-taken branch — overhead vs vanilla is within run-to-run noise.

Motivation

PostgreSQL surfaces a rich taxonomy of wait events through pg_stat_activity.wait_event, but only as instantaneous snapshots. There is no in-tree way to answer:

"How long do my backends actually spend in each wait?"
"What was the wait sequence of the last N events in this session?"
"Do these timings differ across two query plans?"

External tools either sample at coarse resolution (pg_wait_sampling, default 10 ms) or pay a heavy debug-trap cost when reading PGPROC->wait_event_info via a hardware watchpoint (~200–300 ns per transition → ~29 % TPS on high-transition workloads).

The closest upstream analogue is Oracle's wait event interface: a clock_gettime-grade timestamp at every wait start/end, with per-event aggregates and a per-session trace ring (the "10046" trace).

What this patch adds

GUCs

GUC	type	default	summary
`wait_event_capture`	SUSET, enum	`off`	`off` / `stats` / `trace`
`wait_event_timing_max_tranches`	POSTMASTER, int	192	per-backend tranche slots for LWLock timing
`wait_event_trace_ring_size_kb`	POSTMASTER, int	4096	per-backend trace ring size, power-of-2 in 8 ... 32768

off — feature inert at runtime.
stats — accumulates per-(backend, event) (count, total_ns, max_ns, log2-bucket histogram) in a per-backend DSA slot.
trace — additionally pushes each completed wait into a fixed-size DSA-backed ring buffer per session (lossless single-writer / lossless external reader; wraps when exhausted).

SQL surface

-- per-(backend, event) stats; NULL pid = all backends
SELECT * FROM pg_stat_get_wait_event_timing(NULL);
-- columns: pid, backend_type, procnumber, wait_event_type, wait_event,
--          calls, total_time_ms, avg_time_us, max_time_us, histogram int8[]

-- trace ring for the calling backend
SELECT * FROM pg_get_backend_wait_event_trace();

-- cross-backend trace read (procnumber, not pid; survives owner exit)
SELECT * FROM pg_get_wait_event_trace(<procnumber>);

-- reset
SELECT pg_stat_reset_wait_event_timing(<pid>);
SELECT pg_stat_reset_wait_event_timing_all();

-- one-shot sweep of rings whose owning backend died ungracefully
SELECT pg_stat_clear_orphaned_wait_event_rings();

Design notes

Off-mode codegen

pgstat_report_wait_start/end are static inline in src/include/utils/wait_event.h. The inline carries a minimal gate:

*(volatile uint32 *) my_wait_event_info = wait_event_info;
#ifdef USE_WAIT_EVENT_TIMING
    if (unlikely(wait_event_capture != WAIT_EVENT_CAPTURE_OFF))
        pgstat_report_wait_start_timing(wait_event_info);
#endif

The body (writes-disabled check, lazy DSA attach, INSTR_TIME_SET_CURRENT, current-event write) lives in out-of-line pgstat_report_wait_start_timing / pgstat_report_wait_end_timing. Keeping the gate body out-of-line prevents host functions (LWLockAcquire, LockAcquireExtended, XLogInsert, ...) from being inflated by the unreachable on-mode code, which would otherwise cost register spills and worse instruction layout across the binary (~2 % off-mode penalty on the inlined variant, eliminated by this layout).

Lazy DSA attach

Backends do not touch the timing DSA until the first time they observe wait_event_capture != OFF. The DSA itself is not created until the first backend that enables capture. Invariants:

MyProc->lwWaiting check at the start of lazy_attach() — never attempt DSA work while queued on an LWLock (DSA primitives can themselves queue on an LWLock, which would assert).
wait_event_timing_writes_disabled flag raised by a before_shmem_exit callback registered at first attach, so wait events dispatched during the proc_exit cascade (e.g. by dsm_backend_shutdown follow-ons) do not touch DSA mappings that have already been torn down.
wait_event_info is re-published after lazy_attach() returns — nested wait events during the attach (dsa_attach can contend on an internal LWLock) clear it.

Trace ring lifecycle

The per-backend trace ring is a DSA segment owned by a WaitEventTraceRingHeader. When a backend exits, the segment is left behind for an external reader. Orphaned rings are swept by:

the next backend that re-attaches with the same procnumber (resets the ring), or
bgwriter's periodic sweep, or
the explicit SQL function above.

The trace ring records a fixed-size struct per wait: (seq, wait_start_ns, wait_end_ns, wait_event_info, query_id).

Lock-free reads

Per-backend storage has a single writer (the owning backend). External readers (pg_stat_get_wait_event_timing(NULL), pg_get_wait_event_trace(procno)) read counters and the trace ring without locks; they tolerate torn reads by snapshotting (seq, wait_start, wait_end) and re-checking seq. Cross-backend reset uses an atomic reset_generation; the owning backend clears its own counters at its next wait_end — no shared mutex.

Measured overhead

CCX43 (16 dedicated AMD EPYC-Milan vCPU / 64 GB RAM), pgbench -S 16c/8j/30s, 4 GB shared_buffers, 5 runs per config, mean TPS:

	BASELINE	WET / off	WET / stats	WET / trace
pgbench -S	164,859	166,840 (+1.20 %)	164,126 (−0.44 %)	165,687 (+0.50 %)
pgbench TPC-B	17,569	17,487 (−0.46 %)	17,240 (−1.87 %)	17,314 (−1.45 %)

Off-mode is within noise. Stats / trace cost is below 2 % on typical pgbench and peaks at ~4–5 % under wait-event-saturated workloads (small shared_buffers with hot-row contention).

Full per-run numbers, eviction-stress with shared_buffers = 32 MB (W1–W4), stress regression (T1–T6, E1–E7), streaming replication with wait_event_capture = trace on both primary and standby, ASAN-instrumented battery, and make check-world results are in the latest test-results comment.

Commits

8b77f0f739  Enable --enable-wait-event-timing on one Linux CI task
94a8ef68cb  Add tests for wait_event_capture instrumentation
06e3033604  Add documentation for wait_event_capture instrumentation
98b9e93243  Add per-backend wait-event timing and tracing instrumentation

The main feature commit is large because the recording path, the SQL/SRF surface, the trace ring management, and the proc init hooks are tightly coupled; splitting further would not yield commits that build or pass tests independently.

Files changed

New:

src/backend/utils/activity/wait_event_timing.c — DSA management, lazy attach, recording, SQL functions, orphan sweep
src/include/utils/wait_event_timing.h — public structs, GUC externs

Modified:

src/include/utils/wait_event.h — out-of-line gate at pgstat_report_wait_start/end
src/backend/storage/lmgr/proc.c — init/cleanup of per-backend timing slot
src/backend/storage/ipc/ipci.c — DSA registration
src/backend/utils/activity/backend_status.c — query_id pointer hookup for trace records
src/include/catalog/pg_proc.dat — SQL function registration
src/backend/utils/misc/guc_tables.c, guc_parameters.dat, postgresql.conf.sample — GUC definitions
configure.ac, meson.build, meson_options.txt, src/include/pg_config.h.in — compile-time option
src/backend/utils/activity/Makefile, meson.build — build wiring
doc/src/sgml/monitoring.sgml, config.sgml — docs
src/test/regress/... — regression test additions

Open questions for review

Configure flag vs always-on. Currently behind --enable-wait-event-timing (matching --enable-injection-points, --enable-cassert precedent). With the out-of-line gate the runtime off-mode cost is negligible; the compile flag could be dropped if there is consensus that the binary-size and feature-discoverability tradeoff is acceptable.
Trace record size. 64 bytes per record is generous (current fields fit in ~48). Reducing to 32 doubles ring depth at the cost of dropping query_id from the trace record.
Tranche hash sizing. wait_event_timing_max_tranches = 192 is a per-backend cap. Extensions that register many tranches can blow past this; today this raises a one-time WARNING and increments a per-backend overflow counter. An auto-grow path is possible but adds DSA allocation cost to a steady-state hot path.
bgwriter sweep cadence. Orphan rings are swept once per bgwriter cycle today. A long-idle bgwriter delays cleanup; alternatively the sweep could be hooked into pg_stat_reset_* callers too.

Related work

pg_wait_sampling — polling-based, coarser resolution
pg_wait_tracer — hardware-watchpoint approach, ~29 % overhead on high-transition workloads
NikolayS/postgres#18 — USDT probes (complementary, lower overhead at the cost of in-tree integration)
pgsql-hackers, Jeremy Schneider, 2026-01-09 — proposal thread

Add section to Performance documenting the proposed PostgreSQL patch (DmitryNFomin/postgres#1) that adds Oracle-style internal wait event instrumentation with near-zero overhead (<1%) vs hardware watchpoints (6-29%). Benchmark results on same environment (Hetzner cx43, 8 vCPU): - TPC-B: <1% vs 6% (hardware watchpoint) - SELECT-only worst case: <1% vs 29% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ocation The WaitEventTraceState array was 4 MB per backend allocated unconditionally at startup via ShmemInitStruct (400 MB at max_connections=100). Replace with DSA-backed lazy allocation: a small WaitEventTraceControl struct (~8 bytes per backend) lives in fixed shmem, and the 4 MB ring buffer is allocated via dsa_create/dsa_allocate only when a backend SETs wait_event_trace = on. Key changes: - Add WaitEventTraceControl with dsa_handle, LWLock, per-backend dsa_pointers - Add LWTRANCHE_WAIT_EVENT_TRACE_DSA for the control LWLock - Lazy DSA creation on first use (any backend), attach on subsequent backends - GUC assign hook (assign_wait_event_trace) triggers allocation on enable - Cleanup via before_shmem_exit callback (runs before dsm_backend_shutdown detaches DSA segments, avoiding the use-after-detach crash that would occur if dsa_free ran from ProcKill's on_shmem_exit callback) - Fix pg_stat_get_wait_event_trace proisstrict: was defaulting to true, causing NULL argument (own backend) to skip the C function entirely Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Address all 20 issues identified by cross-checking Gemini 3.1, Opus 4.6, and GPT 5.4 reviews against the actual PR code. Each fix follows the approved plan in crosscheck_review_fix_3.md. Security / correctness fixes: Issue #1 (UAF in trace ring reads): - Remove pg_stat_get_wait_event_timing_by_query() entirely (OID 9951, view, docs, tests) -- cross-backend DSA ring scanning was the primary use-after-free surface. - Restrict pg_stat_get_wait_event_trace() to own-session only; the backend_id argument is accepted for forward compat but ignored. - Wrap DSA pointer mutations (attach/detach) in WaitEventTraceCtl->lock LW_EXCLUSIVE to protect external readers and fix 32-bit torn pointers (also resolves Issue #11). - Document the ring buffer as an external lock-free transport with a seqlock reader contract in wait_event_timing.h and monitoring.sgml. Issue #6 (cross-backend reset race): - Add per-backend LWLock to WaitEventTimingState, following the pgstat_reset_entry() pattern from pgstat_shmem.c. - Writers acquire LW_SHARED on the hot path; reset acquires LW_EXCLUSIVE. - Init and exit paths individually zero fields instead of memset on the whole struct, preserving the LWLock. - Register LWTRANCHE_WAIT_EVENT_TIMING in lwlocklist.h. Build / startup fixes: Issue #2: Replace non-existent wait_classes.h include with wait_event_types.h in wait_event_timing.h. Issue #3 (config-file trace broken): Add explicit wait_event_trace_attach() calls after ResourceOwner init in auxprocess.c (after CreateAuxProcessResourceOwner) and postinit.c (after process_settings) to handle inherited GUC values that do not re-fire assign hooks after fork. Issue #10 (non-timing build silent suppression): Emit WARNING before forcing wait_event_timing/wait_event_trace to off in check hooks for non-timing builds (source < PGC_S_INTERACTIVE path). Issue #14 (EXEC_BACKEND init order): Add comment in pgstat_set_wait_event_timing_storage() confirming SubPostmasterMain calls CreateSharedMemoryAndSemaphores() before InitProcess(). Query lifecycle / phase separation: Issue #4 (QUERY_END broken for multi-statement): - Hybrid approach: emit QUERY_END for old query and QUERY_START for new query in pgstat_report_query_id() transitions (backend_status.c). - Keep existing QUERY_END in send_ready_for_query (postgres.c) for the last-query-to-idle boundary. - Covers simple single/multi-stmt, extended, prepared, and pipeline protocols without double-emit. Issue #5 (phase separation): Add TRACE_EXEC_END marker (type=4) emitted from ExecutorEnd() in execMain.c, separating execution from post-exec cleanup (commit, WAL flush, lock release). Also resolves Issue #13 (extended protocol phase misattribution). Overflow and dependency warnings: Issue #7: One-time WARNING on first LWLock hash overflow (>192 tranches). Issue #8: WARNING in assign_wait_event_trace when track_activities is off; document the dependency in config.sgml. Issue #12: Add flat_overflow_count field + one-time WARNING when eventId exceeds per-class slot count. Header cleanup and code generation: Issue #15: Move static const arrays and internal inline functions from wait_event_timing.h to wait_event_timing.c as file-static. Convert trace marker inlines to extern functions with a shared wait_event_trace_write_marker() helper. Issue #16 (parallel array validation): Extend generate-wait_event_types.pl to emit wait_event_timing_data.c (mapping arrays) and the three sizing #defines into wait_event_types.h, generated from wait_event_names.txt. Eliminates hand-maintained parallel arrays entirely. Update Makefile, meson.build, and pgindent exclusions. Documentation: Issue #17: Document InitMaterializedSRF usage as acceptable for both the timing SRF (bounded result set) and trace SRF (own-session only, deferred max_records parameter). Issue #18: Document ring wrap limitation, polling contract, and st_query_id fallback in wait_event_timing.h and monitoring.sgml. Issue #19 nits: Remove duplicate header include, remove install/ entries from .gitignore, add tranche_id 65535 academic comment, add eventId-equals-array-index comment, verify OID range (9950/9952/9953). Testing: Issue #20: - Add trace lifecycle regression test (off/on/query/read/off cycle). - Add auxiliary process timing test (verify checkpointer/bgwriter/ walwriter accumulate non-zero wait event counts). - Add test_lwlock_hash_overflow() to test_wait_event_stress module: registers >192 custom LWLock tranches to exercise hash overflow path and verify WARNING emission. - Document that default Cirrus CI already exercises non-timing build path via wait_event_timing_1.out alternate expected output. Issues resolved by other fixes (no separate action needed): Issue #9: Resolved by Issue #1 (by_query view removed). Issue #11: Resolved by Issue #1C (LWLock around DSA pointer mutations). Issue #13: Resolved by Issue #5 (EXEC_START/EXEC_END markers). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Four small, independently-justified fixes from the consolidated review in review_5.md. All verified by `make check` (249/249 passing, including the wait_event_timing regression test). #1 Stale docstring on pg_stat_reset_wait_event_timing (wait_event_timing.c) Two stacked docstrings: the first described an obsolete backend_id-based API ("0 = own, -1 = all") that does not exist; the function takes a PID. Merge them into one block that uses the correct PID semantics and preserves the async-protocol + reset_count-polling guidance from the obsolete one. #2 Spurious wait_event_* includes in ipci.c Two #includes (wait_event.h, wait_event_timing.h) referenced no symbols in ipci.c. The WaitEvent*ShmemCallbacks externs come via storage/subsystems.h -> subsystemlist.h. Removing both leaves ipci.c building cleanly and shrinks its include graph. #3 Missing WARNING for wait_event_capture=trace + compute_query_id=off (wait_event_timing.c, config.sgml) assign_wait_event_capture already warns when track_activities is off; mirror the same warning for compute_query_id, since markers silently skip when query_id is 0. Verified live: setting compute_query_id=off then wait_event_capture=trace now produces the expected WARNING + HINT pointing at pg_stat_statements as a common enabler. #4 Trace doc only described QUERY markers; EXEC markers were silent (config.sgml, monitoring.sgml) The trace ring emits four marker types, not two. ExecStart/ ExecEnd bracket every executor invocation and are NOT gated on track_activities, while QueryStart/QueryEnd cover inter-statement and idle-transition boundaries that the executor cannot. Restate the prereqs in config.sgml as an itemized list per family, update monitoring.sgml's view description and consumer guidance to name ExecStart as the primary attribution signal and QueryStart as the fallback for boundaries the executor cannot mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable from the patch, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. Two new SQL functions are added in src/backend/utils/activity/ wait_event_timing.c (oids 9962, 9963): pg_get_wait_event_trace(procnumber int4) Canonical cross-backend reader. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or slots whose generation counter kept changing during the snapshot (very rare slot reassignment race -- bounded retry, then empty). pg_get_wait_event_trace_for_pid(pid int4) Convenience wrapper that resolves pid -> procNumber via BackendPidGetProc(). Cannot read ORPHANED slots by design: a dying backend's pid is removed from the proc array on exit, so the pid -> procNumber mapping does not survive backend exit. Callers who want post-mortem readability use pg_get_wait_event_trace directly with procNumber. Both functions implement the cross-backend reader protocol documented on WaitEventTraceControl in src/include/utils/wait_event_timing.h: snapshot the slot's generation, read slot state atomically, acquire WaitEventTraceCtl->lock in LW_SHARED, resolve ring_ptr, memcpy the records[] array into a local palloc'd buffer (~4 MB upper bound), release the lock, recheck generation, retry on mismatch, then process records via the per-record seqlock protocol off the lock. Shared helper emit_wait_event_trace_for_procnumber() factors the common snapshot+emit code; the two SRFs are thin wrappers. Privilege model: REVOKE EXECUTE from PUBLIC and GRANT to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace (established in review_5.md issue #27, commit 34015c1). Wait-event traces are stats-class data and the symmetric privilege story is the least surprising to monitoring tooling. Test coverage: * src/test/regress/sql/wait_event_timing.sql: smoke tests that both SRFs are callable from SQL and produce sensible results for own session, unknown pid (empty), and out-of-range procnumbers (empty). Both .out variants (timing and stub builds) updated. * src/test/modules/test_wait_event_stress/t/ 001_orphan_roundtrip.pl: end-to-end TAP test that exercises the orphan-readable code path. A writer session enables trace, generates wait events, captures its procnumber, and disconnects; a separate long-lived reader session then calls pg_get_wait_event_trace on the writer's procnumber and asserts the records are visible after the writer has exited. Verifies pg_stat_clear_orphaned_wait_event_rings() releases the orphan and that a subsequent read returns empty. The test keeps a dedicated reader BackgroundPsql open for the entire run; this prevents the reader's own procNumber from being a recycle target for the writer's procNumber, which would otherwise trigger wait_event_trace_clear_orphan_at_init() and reclaim the orphan before the test can read it. pg_sleep calls are wrapped in a DO block with PERFORM so they do not emit empty result rows that would contaminate the captured procnumber. pg_stat_get_ wait_event_timing exposes backend_id (procNumber + 1 by PG convention), so the test subtracts 1 before passing to pg_get_wait_event_trace. Documentation: new entries in monitoring.sgml under monitoring-stats-funcs for both SRFs, plus a clarifying update to the pg_get_backend_wait_event_trace entry that points readers at the new cross-backend SRFs. Catversion bump for the two new pg_proc OIDs. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-wait-event-timing --enable-debug --enable-cassert --enable-tap-tests: full src/test/regress suite (249 tests) passes, test_wait_event_stress regress passes, and the new TAP test passes (6/6 subtests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Implements the documented snapshot pattern: sample slot->generation, atomic state read, LW_SHARED on WaitEventTraceCtl->lock, dsa_get_address, palloc only the slice actually used by write_pos and memcpy under lock, release lock, re-sample generation, retry on mismatch (bounded by WET_READ_MAX_RETRIES=3, then NOTICE + empty so callers can distinguish a stable empty result from a lost race), then per-record seqlock processing off the lock. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow both used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions (post-PG-17 procnumber introduction). Defensive fix in wait_event_trace_attach (review_6.md issue #5 collateral): The Assert(state_now != WET_TRACE_SLOT_ORPHANED) at attach time was too strict. A backend can legitimately observe its own slot in the ORPHANED state if a later before_shmem_exit callback (pgstat_io_flush_cb during proc_exit shutdown) contends on an LWLock, which emits a wait event that lands in pgstat_report_wait_end_timing -> wait_event_trace_attach after wait_event_trace_before_shmem_exit has already run and transitioned the slot. The Assert crashes parallel workers reliably on exit when capture=trace. Replaced with defensive fall-through: if ORPHANED is observed at attach time, do not re-attach (the ring is now post-mortem; the writer invariant must hold). This adds in-test coverage for the issue #5 case via the parallel-worker scenario below. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via DO block with explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Two scenarios: (1) Plain-backend orphan roundtrip: writer enables trace, emits waits, captures its procnumber, disconnects; long-lived reader (separate session, pinning its own slot to prevent procnumber recycle of the writer's slot) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases it; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation): a forced-parallel query (debug_parallel_ query=on plus zero parallel costs) runs and exits; the leader and at least one worker each leave an ORPHANED slot. The reader iterates procnumber 0..150 and asserts at least two slots have readable records. Race-hardening: max_connections=100 in the test cluster plus a permanently-OWNED reader slot minimise the window in which an unrelated new backend could recycle a just-exited writer's procnumber and trigger wait_event_trace_clear_orphan_at_init(). Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (7 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_trace_attach The Assert(state_now != WET_TRACE_SLOT_ORPHANED) inside wait_event_trace_attach assumed that ORPHANED could never be observed at attach time, on the reasoning that pgstat_set_wait_event_timing_storage() at backend init calls wait_event_trace_clear_orphan_at_init() which demotes any inherited orphan to FREE before any wait-event capture path can run. That invariant holds at backend INIT. It does NOT hold during backend EXIT: after wait_event_trace_before_shmem_exit (registered as a before_shmem_exit callback) has transitioned the slot to ORPHANED, a LATER before_shmem_exit callback can contend on an LWLock that emits a wait event. In particular, pgstat_io_flush_cb during proc_exit shutdown takes shared locks that may block under load; LWLockAcquire's pgstat_report_wait_start/end machinery then lands in pgstat_report_wait_end_timing -> wait_event_trace_attach after my_wait_event_trace has been cleared. The Assert fires and the backend aborts. This is reliably reproducible with parallel workers: a forced- parallel query under wait_event_capture = trace produces multiple workers that all exit at end-of-parallel-query, each of which hits the Assert during shmem_exit. Replace the Assert with a defensive runtime check: if state == ORPHANED at attach time, skip the re-attach. We are dying, the ring is now post-mortem data for cross-backend readers, and the single-writer invariant must hold; the wait event that triggered this attach path is not traced (a fully acceptable tradeoff -- we are in the middle of shutting down). review_6.md issue #5 asked for exactly this kind of defense-in- depth at module boundaries; that review framed it as a style nit ("Assert is debug-only, defensive runtime check is better"), but the parallel-worker shutdown sequence promotes it to a correctness bug that crashes the backend in cassert builds. Test coverage: the parallel-worker scenario in src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl (added in the follow-up commit for review_6.md issue #1) exercises this exact shutdown path; the same TAP test crashed the parallel worker before this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Implements the documented snapshot pattern: sample slot->generation, atomic state read, LW_SHARED on WaitEventTraceCtl->lock, dsa_get_address, palloc only the slice actually used by write_pos and memcpy under lock, release lock, re-sample generation, retry on mismatch (bounded by WET_READ_MAX_RETRIES = 3, then NOTICE + empty so callers can distinguish a stable empty result from a lost race), then per-record seqlock processing off the lock. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions (post-PG-17 procnumber introduction). Retry-loop buffer sizing: emit_wait_event_trace_for_procnumber pallocs local_records sized for the slice actually needed (min(write_pos, RING_SIZE)) rather than always 4 MB. On retry, if a different owner has taken the slot with a larger write_pos, the buffer is pfree'd and re-allocated at the new size; the buffer is never shrunk across retries. This avoids both the 4 MB-per-call cost on nearly-empty rings (the common case for short-lived backends and post-mortem orphan reads) and the buffer-overflow that would occur if we kept a too-small buffer across a size-growing retry. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Two scenarios: (1) Plain-backend orphan roundtrip. A writer enables trace, emits waits, captures its procnumber via the procnumber column of pg_stat_get_wait_event_timing, disconnects. A long-lived reader session (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings() releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; the leader and at least one worker each leave an ORPHANED slot. We count them via the admin sweep -- cheaper and more explicit than iterating procnumbers and pg_get_wait_event_trace'ing each one (the read path itself is already covered by scenario 1). Race-hardening: max_connections=100 in the test cluster plus a permanently-OWNED reader slot minimise the window in which an unrelated new backend could recycle a just-exited writer's procnumber and trigger wait_event_trace_clear_orphan_at_init(). Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (7 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Cheap unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration (the generation-counter retry that external readers would do becomes unnecessary for this in-tree caller). 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. Iterate every live ring index, applying the per-record seqlock protocol AGAINST SHARED MEMORY: read seq_before, barrier, structure-copy the record into a local result buffer, barrier, read seq_after. Skip records with odd seq_before (writer mid-record at start of read) or with seq_before != seq_after (writer completed a write cycle during our read; payload may be torn). 5. Release the lock. 6. Walk the local result buffer and emit rows into the tuplestore -- the only potentially I/O-expensive part, deliberately done outside the lock so spills do not extend lock-hold. Why per-record seqlock against shared memory (not against a memcpy'd local copy of the ring): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES, with the payload read in between. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time, the check degenerates to a no-op, and torn reads from concurrent writers cannot be detected. For an own-session reader (pg_get_backend_wait_event_trace) the concurrent-writer case does not arise; the cross-backend reader must do the protocol correctly to handle live OWNED slots safely. Lock-hold is O(records_in_range) in the iterator (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O; the tuplestore work runs after release. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via a DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures its procnumber from the procnumber column of pg_stat_get_wait_event_timing, disconnects. A long-lived reader (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; leader and at least one worker each leave an ORPHANED slot, verified via the admin sweep count. (3) OWNED-slot read with a concurrent live writer. A long-lived writer emits a steady stream of PgSleep events in a 1000-iteration loop while the reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type or wait_event, no negative duration); a single malformed row would indicate the seqlock failed to detect a torn write. This is the scenario the per-record seqlock protocol was rewritten to handle correctly. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (10 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pg_atomic_write_u64 has no barrier semantics (see the header comment in atomics.h: "No barrier semantics."), so on weak-memory architectures (ARM, POWER) the writer's pg_atomic_write_u64 to write_pos can be reordered after the subsequent plain stores to rec->seq. A cross-backend reader observing the new write_pos would then iterate i = pos and read rec[pos & MASK].seq while it still holds the PREVIOUS ring cycle's even value. The reader's seqlock check would falsely pass (the seq is stable across the read because the writer hasn't yet bumped it to odd from this cycle's perspective) and the reader would emit a stale-but-valid record from the previous ring cycle, mislabeled with the new ring index. On x86 this is masked by TSO; on ARM/POWER it is a real data- visibility bug that manifests as the cross-backend reader returning rows tied to the wrong ring positions. Add pg_write_barrier between the pg_atomic_write_u64(&write_pos, pos+1) and the first rec->seq = odd store at both writer sites (the wait-event hot path and the marker writer). This is the release half of an acquire/release pair; the matching pg_read_barrier on the reader side lives in emit_wait_event_trace_for_procnumber where the cross-backend SRF reads write_pos before iterating the per-record seq fields. This barrier is on the per-event hot path but adds only a CPU barrier (no atomic RMW, no cache-line trip), so the cost is within the noise of the existing seqlock writes already done by the function. No new tests: this is a weak-memory ordering fix that x86-only testing cannot exhibit. Catching it would require ARM/POWER stress testing of the cross-backend reader against an actively writing backend (the TAP scenario "OWNED-slot read with a concurrent live writer" added in the follow-up commit for review_6.md issue #1 is the closest thing in tree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. pg_read_barrier() to pair with the writer's pg_write_barrier between its pg_atomic_write_u64(&write_pos) and the rec->seq update (added in the predecessor commit). Without this acquire half of the pair, weak-memory readers could observe the new write_pos while still seeing the previous ring cycle's even seq value and emit stale-but- valid records with wrong ring indices. 5. Iterate every live ring index, applying the per-record seqlock protocol AGAINST SHARED MEMORY: read seq_before, barrier, structure-copy the record into a local result buffer, barrier, read seq_after. Skip records with odd seq_before (writer mid-record at start of read) or with seq_before != seq_after (writer completed a write cycle during our read; payload may be torn). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore -- the only potentially I/O-expensive part, deliberately done outside the lock so spills do not extend lock-hold. Why per-record seqlock against shared memory (not against a memcpy'd local copy of the ring): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES, with the payload read in between. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time, the check degenerates to a no-op, and torn reads from concurrent writers cannot be detected. For an own-session reader (pg_get_backend_wait_event_trace) the concurrent-writer case does not arise; the cross-backend reader must do the protocol correctly to handle live OWNED slots safely. Lock-hold is O(records_in_range) in the iterator (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O; the tuplestore work runs after release. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via a DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures its procnumber, disconnects. A long-lived reader (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; leader and at least one worker each leave an ORPHANED slot, verified via the admin sweep count. (3) OWNED-slot read with a concurrent live writer. A long-lived writer emits a steady stream of PgSleep events in a 1000-iteration loop while the reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type or wait_event, no negative duration); a single malformed row would indicate the seqlock failed to detect a torn write. This is the scenario the per-record seqlock protocol was written to handle correctly. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (10 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lazy-attach hot-path entry points pgstat_wait_event_timing_lazy_ attach() and wait_event_trace_attach() both allocate memory via dsa_attach / dsa_create / dsa_allocate_extended, which ultimately reach MemoryContextAlloc. MemoryContextAlloc Assert-fails on "CritSectionCount == 0 || allowInCritSection", so any backend whose very first wait event after wait_event_capture is enabled lands inside a critical section is killed by the assertion in cassert builds (or risks heap corruption in production builds). Reproducer: a parallel worker that has not yet emitted any wait events scans a heap page, hits BufferSetHintBits16 -> XLogSaveBufferForHint -> XLogInsert (which opens a critical section) -> LWLockAcquire, and the wait-end-timing hook routes to the lazy attach inside the critical section. Reliably triggered by the new "parallel-worker orphan roundtrip" TAP scenario added for review_6.md issue #1. Fix: check CritSectionCount at the top of each lazy-attach function and return early if non-zero. This silently drops the in-flight wait event (it is neither timed nor traced) but keeps the backend alive; the next wait event outside any critical section retries the attach and succeeds, after which the hot path no longer routes through the lazy-attach function. Critical sections are brief by design (a few dozen instructions typically), so the dropped wait events are themselves a few- nanosecond window of unobservable activity -- well within the quantisation noise of any wait-event-based profiler. This matches the behaviour PG already has for other "skipped under critical section" optimisations. Both timing and trace lazy attaches are fixed symmetrically; the two functions share a near-identical hot path entry shape and both need the same guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. pg_read_barrier() to pair with the writer's pg_write_barrier() between its pg_atomic_write_u64(&write_pos) and the rec->seq update (added in the predecessor commit). 5. Iterate every live ring index, applying the per-record POSITION-ENCODED IDENTITY SEQLOCK against shared memory: expected_seq = (uint32)(i * 2 + 2); seq_before = rec_shared->seq; pg_read_barrier(); if (seq_before != expected_seq) continue; local_copy = *rec_shared; pg_read_barrier(); seq_after = rec_shared->seq; if (seq_after != expected_seq) continue; The writer encodes ring position into the seq value: position pos uses (pos*2+1) for mid-write and (pos*2+2) for complete. After RING_SIZE writes the slot wraps and the same memory location gets a new numerically-distinct seq. A parity-only seqlock check (skip odd, accept stable even) is INSUFFICIENT for this layout in the cross-backend case: if the writer advanced write_pos but the rec->seq update has not yet propagated to this CPU's cache, the reader sees the previous cycle's stable even seq, both seq_before and seq_after match, and a record belonging to the previous cycle gets emitted with the new ring_index -- silent attribution corruption. The identity-against-expected check rejects all four failure modes (stale prior cycle, mid-write current cycle, wrapped- past-us next cycle, torn write completed mid-read). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore. Tuplestore work (potential disk spill) happens outside the lock so it does not extend lock-hold. Why per-record seqlock against shared memory (not a memcpy'd local copy): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time; the check degenerates to a no-op. Lock-hold is O(records_in_range) shared-memory loads (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: non-pg_read_all_stats role gets ERROR (caught via DO block with explicit handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures procnumber, disconnects. Reader reads ORPHANED ring; admin sweep releases; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). Forced-parallel query with 1M rows and ORDER BY (routes tuples through shm_mq queues, ensuring workers emit MessageQueueSend/MessageQueueReceive waits and lazily allocate their trace rings before exit). Asserts pg_stat_clear_orphaned_wait_event_rings returns >= 2 (leader + at least one worker). Includes a drain- wait loop to avoid racing with worker before_shmem_exit callbacks. (3) OWNED-slot read with a concurrent live writer. Writer emits ~1000 PgSleep waits in a tight loop; reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type, wait_event, no negative duration). WaitEventTraceControl protocol documentation in wait_event_timing.h is rewritten to describe the position-encoded identity seqlock (not the previous parity-only memcpy-then-check pattern, which was structurally broken for cross-process readers) so external authors building ASH/AWR background workers have correct guidance. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to pg_get_backend_wait_event_trace entry. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (11 subtests) pass in the timing build (10 stress reruns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lazy-attach hot-path entry points pgstat_wait_event_timing_lazy_ attach() and wait_event_trace_attach() both allocate memory via dsa_attach / dsa_create / dsa_allocate_extended, which ultimately reach MemoryContextAlloc. MemoryContextAlloc Assert-fails on "CritSectionCount == 0 || allowInCritSection", so any backend whose very first wait event after wait_event_capture is enabled lands inside a critical section is killed by the assertion in cassert builds (or risks heap corruption in production builds). Reproducer: a parallel worker that has not yet emitted any wait events scans a heap page, hits BufferSetHintBits16 -> XLogSaveBufferForHint -> XLogInsert (which opens a critical section) -> LWLockAcquire, and the wait-end-timing hook routes to the lazy attach inside the critical section. Reliably triggered by the parallel-worker scenario added in the follow-up commit for review_6.md issue #1. Fix: check CritSectionCount at the top of each lazy-attach function and return early if non-zero. This silently drops the in-flight wait event (it is neither timed nor traced) but keeps the backend alive; the next wait event outside any critical section retries the attach and succeeds, after which the hot path no longer routes through the lazy-attach function. Critical sections are brief by design, so the dropped wait events are themselves a few-nanosecond window of unobservable activity -- well within the quantisation noise of any wait- event-based profiler. Both timing and trace lazy attaches are fixed symmetrically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Allocate worst-case result buffer BEFORE taking the lock so the palloc -- which can bottom out in a glibc mmap syscall for the ~5 MB worst-case size -- runs without holding the WaitEventTraceCtl lock. 3. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 4. Resolve ring_ptr via dsa_get_address, read write_pos. 5. Iterate every live ring index, applying the per-record POSITION-ENCODED IDENTITY SEQLOCK against shared memory: expected_seq = (uint32)(i * 2 + 2); seq_before = rec_shared->seq; pg_read_barrier(); if (seq_before != expected_seq) continue; local_copy = *rec_shared; pg_read_barrier(); seq_after = rec_shared->seq; if (seq_after != expected_seq) continue; The writer encodes ring position into the seq value: position pos uses (pos*2+1) for mid-write and (pos*2+2) for complete. After RING_SIZE writes the slot wraps and gets a new numerically-distinct seq. A parity-only seqlock check (skip odd, accept stable even) is INSUFFICIENT for this layout in the cross-backend case: if the writer advanced write_pos but the rec->seq update has not yet propagated to this CPU's cache, the reader sees the previous cycle's stable even seq, both seq_before and seq_after match, and a record belonging to the previous cycle gets emitted with the new ring_index -- silent attribution corruption. The identity-against-expected check rejects all four failure modes (stale prior cycle, mid-write current cycle, wrapped- past-us next cycle, torn write completed mid-read). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore. Tuplestore work (potential disk spill) happens outside the lock so it does not extend lock-hold. Why per-record seqlock against shared memory (not a memcpy'd local copy): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time; the check degenerates to a no-op. Lock-hold is O(records_in_range) shared-memory loads (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O and no syscalls. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. A writer-side INJECTION_POINT("wait-event-trace-after-write-pos") is added between the pg_atomic_write_u64(&write_pos) and the rec->seq update; it is compiled out unless --enable-injection- points is set, and is intended as a hook for a future regression test of the position-encoded identity seqlock. Wiring a TAP test through this is fiddly enough to defer to a follow-up; the identity check is correct by construction and the protocol is fully documented on WaitEventTraceControl. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: non-pg_read_all_stats role gets ERROR (caught via DO block with explicit handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures procnumber, disconnects. Reader reads ORPHANED ring; admin sweep releases; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). Forced-parallel query with 1M rows and ORDER BY (routes tuples through shm_mq queues, ensuring workers emit MessageQueueSend/MessageQueueReceive waits and lazily allocate their trace rings before exit). Asserts pg_stat_clear_orphaned_wait_event_rings returns >= 2 (leader + at least one worker). Includes a drain- wait loop to avoid racing with worker before_shmem_exit callbacks. (3) OWNED-slot read with a concurrent live writer. Writer emits ~1000 PgSleep waits in a tight loop; reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed. WaitEventTraceControl protocol documentation in wait_event_timing.h is rewritten to describe the position-encoded identity seqlock (not the previous parity-only memcpy-then-check pattern, which was structurally broken for cross-process readers) so external authors building ASH/AWR background workers have correct guidance. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to pg_get_backend_wait_event_trace entry. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (11 subtests) pass in the timing build with 5/5 stress reruns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…raph The monitoring.sgml entry for pg_backend_wait_event_trace contained a slot-lifecycle paragraph claiming that orphan- persistence is "what makes the view above usable for parallel- worker observability." The reviewer correctly pointed out that the view is session-local: pg_backend_wait_event_trace always reads my_wait_event_trace (the calling backend's own ring) and never touches an orphan in another procnumber's slot. Orphan- persistence has zero effect on the view's behaviour. The actual beneficiaries of orphan-persistence are the cross- backend interfaces: * pg_get_wait_event_trace(procnumber) -- the in-tree SRF added for review_6 issue #1, which reads OWNED and ORPHANED slots uniformly. * External background workers (ASH/AWR/10046-style ring readers) following the cross-backend snapshot pattern documented on WaitEventTraceControl. Rewrite the paragraph to: * Drop the misleading "view above" attribution. * Add an explicit "This does not change the behaviour of this view, which always reads the calling backend's own ring and is unaffected by orphan-state slots belonging to other procnumbers." sentence, so a reader following the link from a parallel-worker observability question doesn't form the wrong mental model. * Reference pg_get_wait_event_trace by name and link to the monitoring-stats-funcs section. * Reattribute the parallel-worker-exit-fast rationale correctly: it explains *why* orphan-persistence matters in the first place (the data would otherwise be lost), not why this view is suddenly different. No code changes. Doc build (check-tabs, check-nbsp) and main regress (249 tests) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s/comments review_6.md issue #9 flagged that documentation comments over-reference "future ASH/AWR/10046-style background workers" as justification for the cross-backend trace surface. On -hackers this reads as speculative -- naming hypothetical consumers to motivate present code. PG style prefers factual statements about what the code does, with neutral framing for consumers ("any extension that reads via the snapshot pattern"). The framing is also now doubly stale: since review_6 was written we landed pg_get_wait_event_trace (review_6.md issue #1) as the in-tree cross-backend consumer. The "future" tense is wrong as well as speculative. Sweep through wait_event_timing.h, wait_event_timing.c, and monitoring.sgml. In each speculative reference replace the "future ASH/AWR/10046-style background workers" phrasing with neutral framing that: * names the in-tree consumer (pg_get_wait_event_trace) where that is what callers should use today, * refers to "extensions following the snapshot pattern documented on WaitEventTraceControl" for the direct-shared-memory case, * drops the speculative consumer enumeration entirely. References to "10046-style trace" as a descriptive label for the technique (e.g. "Per-session wait event trace ring buffer (10046-style)" on WaitEventTraceState) are retained -- that is a recognised term for the per-session-event-trace profiling style, used here as a descriptor of the design, not as a forward-looking promise about a particular consumer. Files touched: * doc/src/sgml/monitoring.sgml: pg_backend_wait_event_trace cross-backend-tools paragraph, pg_get_wait_event_trace description, and pg_stat_clear_orphaned_wait_event_rings description. * src/include/utils/wait_event_timing.h: slot-lifecycle comment block (3 spots). * src/backend/utils/activity/wait_event_timing.c: wait_event_trace_before_shmem_exit doc, the pg_backend_wait_event_trace SRF doc, the pg_get_wait_event_trace SRF doc, and the pg_stat_clear_orphaned_wait_event_rings doc. No code change. Pure documentation rewrite. Builds clean in both --enable-wait-event-timing and stub configurations. Main regress (249 tests) and test_wait_event_stress TAP (15 subtests) pass on the timing build; main regress (249 tests) passes on the stub build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DmitryNFomin · 2026-05-15T15:33:49Z

Full regression battery for `8b77f0f`

All tests below were run against 8b77f0f739 on a dedicated benchmark machine (CCX43 — 16 dedicated AMD EPYC Milan vCPU / 64 GB RAM). Includes a new eviction-stress suite designed to maximise the wait-event transition rate.

1. pgbench `-S` (read-only, 16c/8j/30s, 4 GB shared_buffers) — 5 runs each

run	BASELINE	WET/off	WET/stats	WET/trace
1	163,656	165,540	162,655	166,045
2	162,859	167,232	164,518	165,465
3	165,870	167,471	163,137	166,766
4	165,444	167,534	165,016	167,480
5	166,466	166,425	165,303	162,679
mean TPS	164,859	166,840	164,126	165,687
vs BASELINE	—	+1.20 %	−0.44 %	+0.50 %

All three modes are within run-to-run noise of vanilla — the compile-time penalty is gone.

2. pgbench TPC-B (16c/8j/30s) — 5 runs each

run	BASELINE	WET/off	WET/stats	WET/trace
1	17,611	17,414	17,297	17,251
2	17,520	17,576	17,212	17,234
3	17,554	17,596	17,125	17,428
4	17,485	17,416	17,262	17,304
5	17,673	17,434	17,303	17,351
mean TPS	17,569	17,487	17,240	17,314
vs BASELINE	—	−0.46 %	−1.87 %	−1.45 %

3. evict_stress — new high-wait-event-transition-rate suite

shared_buffers = 32 MB versus a ~1.5 GB working set forces ~98 % buffer miss; every miss generates a BufferIO / DataFileRead wait event. Combined with high concurrency this drives the wait-event hot path at maximum rate.

test	scenario	mode	TPS	Δ vs off	alive
W1	32 MB SB + pgbench `-S` 16c	off	116,487	—	✓
		stats	116,625	+0.1 %	✓
		trace	114,844	−1.4 %	✓
W2	32 MB SB + TPC-B 16c	off	16,813	—	✓
		stats	16,766	−0.3 %	✓
		trace	16,698	−0.7 %	✓
W3	32 MB SB + pgbench `-S` 32c oversubscribed	off	162,845	—	✓
		stats	158,880	−2.4 %	✓
		trace	159,141	−2.3 %	✓
W4	32 MB SB + hot-row UPDATE 32c	off	8,138	—	✓
		stats	7,772	−4.5 %	✓
		trace	7,814	−4.0 %	✓

Sample of recorded wait events under W1/trace verifies the timing path actually captures the expected work:

 wait_event_type |    wait_event
-----------------+-------------------
 IO              | DataFileRead       (dominant — buffer-miss reads)
 Activity        | BgwriterMain
 Activity        | WalWriterMain
 IO              | WalWrite
 LWLock          | WaitEventTraceDSA
 IO              | AioIoCompletion

No crashes in any of the 12 sub-runs across the four eviction-stress scenarios. Recording overhead under maximum-pressure scenarios peaks at ~4.5 % in W4 (where every transaction contends on the same row → many lock-wait transitions per tx). This is the real cost of writing per-event statistics, bounded and proportional to the wait-event rate.

4. Stress regression (stress2.sh, T1–T6)

#	scenario	result
T1	FOR UPDATE 16c × 5s × {off, stats, trace}	alive ✓
T2	`pgbench -C` connect-storm × {off, trace}	alive ✓
T3	`pg_ctl stop -m immediate` recovery	alive ✓
T4	`pg_terminate_backend` storm	alive ✓
T5	rapid mode flips during load	alive ✓
T6	UPDATE contention × {off, stats, trace}	alive ✓

No PANIC, no segfault.

5. Additional stress (extra_stress.sh, E1–E7)

#	scenario	result
E1	50 short-lived backends + `pg_stat_clear_orphaned_wait_event_rings()`	swept 50/50 ✓
E2	5 concurrent sessions × 20 forced-parallel aggregations	alive ✓
E3	150 concurrent `pg_get_wait_event_trace()` reads vs writer	alive ✓
E4	tiny ring (8 KB = 256 records) under pgbench `-S`	alive ✓
E5	huge ring (32 MB = 1M records) under pgbench `-S`	alive ✓
E6	`VACUUM ANALYZE` × 5 concurrent with TPC-B	alive ✓
E7	mode-flip storm — flip every ~1.2 s for 30 s during load	alive ✓

6. Streaming replication (repl_test.sh)

Primary + hot-standby both with wait_event_capture = trace.

metric	value
pgbench on primary	12,257 TPS
standby `count(*)` after replay	1,000,000 rows ✓
walreceiver / startup wait events captured	yes (`WalReceiverMain`, `RecoveryConflictSnapshot`)
crash in either log	none

7. check-world (full upstream regression suite)

All tests successful.
Result: PASS

Both make check and TAP-suite tests passed end-to-end (set -e would have aborted on any failure).

8. ASAN — rebuilt with new source

inst-asan rebuilt from 8b77f0f739 with -fsanitize=address; ldd confirms libasan.so.8 linked.

phase	scenario
A	pgbench `-S` 8c × 20s × {off, stats, trace}
B	pgbench TPC-B 8c × 20s × {stats, trace}
C	pgbench `-C -S` connect-storm 8c × 15s × trace
D	SELECT FOR UPDATE 16c × 15s × trace
E	pg_terminate_backend storm

Postmaster log: no ASAN issues detected (no heap-use-after-free, no buffer-overflow, no UB).

Client-side pgbench leak warnings are pre-existing pg_strdup / lookupCreateVariable leaks in upstream pgbench code, unrelated to this patch.

PostgreSQL already exposes the *current* wait event of each backend in pg_stat_activity, but there is no in-core way to attribute waits across time: no per-backend counts, no duration histograms, no event sequence. Production diagnosis of "what was this backend waiting on?" requires sampling pg_stat_activity at high frequency, which both undercounts short waits and burdens the cluster. This patch adds opt-in instrumentation that records wait events for the calling backend in two modes: * stats -- count, total/avg/max duration, and 32-bucket log2-spaced duration histogram, per (backend, wait-event) pair. Memory cost: ~26 KB/backend (DSA-backed, lazily allocated when capture is first enabled cluster-wide). * trace -- everything in stats, plus a per-session ring buffer of individual events plus query-attribution markers (ExecStart/ExecEnd, QueryStart/QueryEnd). This is the per-session-event-trace style ("10046-style") used by monitoring extensions and on-host workload analyzers. The instrumentation is gated by a new compile-time flag --enable-wait-event-timing. Production builds without the flag pay zero cost: the pgstat_report_wait_start/end inline path's only addition is an `#ifdef USE_WAIT_EVENT_TIMING` block that is compiled out. Builds with the flag compiled in but the wait_event_capture GUC at the default off value pay one well-predicted branch per wait event (the OFF check folds to a const compare against zero, branch-predicted straight-line). New runtime knobs: * wait_event_capture (PGC_SUSET, enum off | stats | trace, default off). Per-session GUC; superusers grant SET ON PARAMETER to extend. Stepping down trace -> off immediately releases the ring; stats counters survive capture toggles until an explicit reset. * wait_event_timing_max_tranches (PGC_POSTMASTER, int, default 192). Cap on distinct LWLock tranches each backend tracks individually. Excess tranches counted against lwlock_overflow_count. * wait_event_trace_ring_size_kb (PGC_POSTMASTER, int, default 4096 KB, power-of-two, range 8..32768). Per- backend trace ring buffer size. Tuneable per deployment: lower at high max_connections to reduce worst-case DSA footprint; raise for longer retention before the FIFO wrap. New SQL surface (all REVOKE'd from PUBLIC, GRANT'd to pg_read_all_stats): * pg_stat_wait_event_timing (view), backed by pg_stat_get_wait_event_timing(pid int4) -- per-backend per-event counts/durations/histograms. * pg_stat_wait_event_timing_overflow (view), backed by pg_stat_get_wait_event_timing_overflow(pid int4) -- counts of waits dropped to overflow (LWLock-hash full, event-class flat array full) per backend. * pg_backend_wait_event_trace (view), backed by pg_get_backend_wait_event_trace() -- this session's own trace ring, materialised. * pg_get_wait_event_trace(procnumber int4) (SRF) -- the canonical cross-backend reader. Reads OWNED and ORPHANED slots uniformly using a position-encoded identity seqlock against shared memory. * pg_stat_reset_wait_event_timing(pid int4) and pg_stat_reset_wait_event_timing_all() -- reset counters for one backend or cluster-wide. Uses a generation- counter retry protocol that coalesces concurrent resets without losing in-flight wait observations. * pg_stat_clear_orphaned_wait_event_rings() (superuser) -- explicit DBA sweep that releases orphaned trace rings from backends that exited while wait_event_capture = trace. Trace-ring lifecycle and orphan semantics: At backend exit the per-session trace ring is NOT freed. The slot transitions OWNED -> ORPHANED and the ring stays allocated in DSA so cross-backend readers (the in-tree pg_get_wait_event_trace and any extension that follows the snapshot pattern documented on WaitEventTraceControl) can read the dying backend's final waits. This is the one observable scenario that motivated the lifecycle work: parallel workers exit in milliseconds at end-of-parallel- query, well below any monitoring tool's polling interval, and without orphan-persistence their final waits are lost. Orphans are reclaimed lazily in three places: (1) clear_orphan_at_init runs at every new backend's InitProcess and frees the prior orphan whose procNumber this backend inherits; this drains orphans naturally as connection churn cycles slots. (2) The session that disabled tracing releases its own ring synchronously from the GUC assign hook. (3) The DBA-driven sweep pg_stat_clear_orphaned_wait_ event_rings handles the pathological case of capture briefly enabled then disabled on a cluster with long-lived pooled connections that never recycle procNumbers. Worst-case orphan footprint is bounded by NUM_WAIT_EVENT_TIMING_SLOTS * wait_event_trace_ring_size_kb (default 4 MB; ~400 MB at max_connections=100, ~4 GB at 1000). Hot-path design: pgstat_report_wait_start/end remain inline. The new gate if (unlikely(wait_event_capture != WAIT_EVENT_CAPTURE_OFF)) steers the predictor to the no-op fall-through. Inside, the per-backend slot pointer is lazy-attached with a one- shot in_attach re-entrancy guard so a nested wait inside the DSA-attach internal LWLock cannot recurse. The trace-ring writer is lock-free for the owning backend (single-writer claim via plain read+write of write_pos, avoiding the LOCK XADD that pg_atomic_fetch_add_u64 would emit) and uses a per-record position-encoded identity seqlock (rec->seq = pos*2+1 mid-write, pos*2+2 complete) to coordinate with cross-backend readers. The identity encoding rejects all four failure modes -- stale prior cycle, mid-write current cycle, wrapped-past-us next cycle, and torn write completed mid-read -- which a parity-only seqlock would miss on weak-memory archs. The writer hot path also coordinates with the GUC assign hook's release_slot via a per-backend wait_event_trace_writes_disabled gate so a dsa_free in the release path can't UAF the ring through a nested wait event. LWLock-timing storage: Each backend's per-tranche histograms live in a small open-addressing hash table sized at server start by wait_event_timing_max_tranches. Hits are O(1) amortised; the probe count per lookup is bounded so a pathological fill never spins. Tranches encountered after the table fills are counted against lwlock_overflow_count and not individually timed. Stub build (without --enable-wait-event-timing): Backing variables for all four GUCs are defined unconditionally so the GUC framework has somewhere to store the value. The check hook for wait_event_capture rejects non-off values (warning + force to off at server start, hard error at SET) so misconfigurations are diagnosed loudly. SQL stubs of all SRFs return empty results; the reset/clear functions ereport ERRCODE_FEATURE_NOT_SUPPORTED. Marker entry points (wait_event_trace_query_start etc.) are no-op stubs so call sites in execMain.c, backend_status.c, and postgres.c do not need #ifdef guards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

monitoring.sgml gets three new view entries (pg_stat_wait_event_timing, pg_stat_wait_event_timing_overflow, pg_backend_wait_event_trace) and new function entries under monitoring-stats-funcs (pg_get_wait_event_trace, pg_stat_reset_wait_event_timing, pg_stat_reset_wait_event_ timing_all, pg_stat_clear_orphaned_wait_event_rings), plus a section on the cross-backend snapshot protocol that monitoring extensions can use to read another backend's trace ring without going through the session-local SQL view. config.sgml gets the three new GUC entries (wait_event_capture, wait_event_timing_max_tranches, wait_event_trace_ring_size_kb) in the cumulative statistics section, including memory-footprint guidance for sizing wait_event_trace_ring_size_kb against max_connections. installation.sgml documents the --enable-wait-event-timing configure flag, including the hot-path zero-cost guarantee in builds that compile the feature in but leave wait_event_capture = off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression tests ---------------- src/test/regress/sql/wait_event_timing.sql exercises the GUC, the timing/overflow/trace SQL surface, and the histogram taxonomy view. Two expected outputs are provided: * wait_event_timing.out -- for --enable-wait-event-timing builds, asserts that a SET wait_event_capture = stats followed by pg_sleep(0.1) lands a PgSleep entry with non-zero counts, durations, and a valid 32-bucket histogram whose entries sum to the visible call count (call-count invariant: count == sum(histogram)). * wait_event_timing_1.out -- for stub builds, asserts the SET fails with the expected feature-not-supported error and the SQL functions return empty results. The parallel_schedule places wait_event_timing in its own group; the test cannot share state with other tests, and the two-phase reset_count invariant breaks if a foreign session SETs the GUC between the test's two stat-snapshot queries. Stress / TAP coverage --------------------- src/test/modules/test_wait_event_stress wraps a small C helper (stress_wait_events) that exercises the LWLock-timing hash and the wait-event flat array, with deterministic counts. The module's SQL regression script checks the write_pos == sum(per_backend_calls) invariant under load. Two TAP tests cover the lifecycle paths that the regression harness cannot exercise (asynchronous backend exits, parallel workers, ring wraparound): * t/001_orphan_roundtrip.pl -- four scenarios for the trace-ring orphan lifecycle: 1. Plain-backend orphan roundtrip (writer exits while capture is trace, reader observes the ORPHANED ring, admin sweep clears it). 2. Parallel-worker orphan roundtrip (forced-parallel query with shm_mq tuple flow, asserts the admin sweep reclaims leader + worker orphans). 3. OWNED-slot read with a concurrent live writer (verifies cross-backend reads against an actively- written ring return only well-formed rows). 4. clear_orphan_at_init reclaim (writer disconnects, many new backends spawn until one inherits the writer's procNumber, verifies the inheritor's InitProcess freed the orphan). Bounded by the ProcGlobal->freeProcs FIFO depth. * t/002_ring_wrap.pl -- provisions a cluster with the smallest legal wait_event_trace_ring_size_kb (8 KB = 256 records), drives a writer through enough waits to wrap many times, and asserts the session-local and cross-backend readers both return at most ring-size well-formed records with post-wrap seq values. Verifies the writer's `pos & ring_mask` indexing and the per- record identity seqlock survive the wrap regime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the configure flag to a single Linux task in .cirrus.tasks.yml so the timing/trace code paths get buildfarm-equivalent coverage on every push. The other CI tasks continue to build without the flag, so the stub-build path (and its alternate-output regression test) remains exercised at every push as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DmitryNFomin force-pushed the wait-event-timing branch 4 times, most recently from 7069864 to 7cadcee Compare March 24, 2026 10:14

NikolayS mentioned this pull request Mar 25, 2026

PoC: USDT static tracepoints for wait event tracing NikolayS/postgres#18

Open

DmitryNFomin force-pushed the wait-event-timing branch from 46594e4 to 6497681 Compare April 25, 2026 13:32

DmitryNFomin closed this Apr 25, 2026

DmitryNFomin reopened this Apr 25, 2026

DmitryNFomin force-pushed the wait-event-timing branch from d2013fd to 545deee Compare May 12, 2026 18:13

DmitryNFomin force-pushed the wait-event-timing branch from 545deee to e4d0433 Compare May 12, 2026 18:33

DmitryNFomin force-pushed the wait-event-timing branch from e4d0433 to 7175ef1 Compare May 12, 2026 19:01

DmitryNFomin force-pushed the wait-event-timing branch from 7175ef1 to 6373a5b Compare May 12, 2026 19:58

DmitryNFomin force-pushed the wait-event-timing branch from 6373a5b to 74cbdbb Compare May 12, 2026 20:28

DmitryNFomin force-pushed the wait-event-timing branch from 74cbdbb to 08ad420 Compare May 13, 2026 11:01

DmitryNFomin force-pushed the wait-event-timing branch from 08ad420 to 8cbbf43 Compare May 13, 2026 12:33

DmitryNFomin force-pushed the wait-event-timing branch 4 times, most recently from febf636 to a6016e7 Compare May 13, 2026 15:49

DmitryNFomin force-pushed the wait-event-timing branch 4 times, most recently from 68bdde2 to d1553b9 Compare May 14, 2026 12:01

DmitryNFomin force-pushed the wait-event-timing branch 5 times, most recently from 902df44 to adc517e Compare May 15, 2026 07:10

DmitryNFomin and others added 4 commits May 15, 2026 15:42

DmitryNFomin force-pushed the wait-event-timing branch from eac2465 to 8b77f0f Compare May 15, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wait_event_timing: Oracle-style wait event instrumentation#1

Add wait_event_timing: Oracle-style wait event instrumentation#1
DmitryNFomin wants to merge 4 commits into
masterfrom
wait-event-timing

DmitryNFomin commented Mar 22, 2026 •

edited

Loading

Uh oh!

DmitryNFomin commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DmitryNFomin commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What this patch adds

GUCs

SQL surface

Design notes

Off-mode codegen

Lazy DSA attach

Trace ring lifecycle

Lock-free reads

Measured overhead

Commits

Files changed

Open questions for review

Related work

Uh oh!

DmitryNFomin commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Full regression battery for 8b77f0f

1. pgbench -S (read-only, 16c/8j/30s, 4 GB shared_buffers) — 5 runs each

2. pgbench TPC-B (16c/8j/30s) — 5 runs each

3. evict_stress — new high-wait-event-transition-rate suite

4. Stress regression (stress2.sh, T1–T6)

5. Additional stress (extra_stress.sh, E1–E7)

6. Streaming replication (repl_test.sh)

7. check-world (full upstream regression suite)

8. ASAN — rebuilt with new source

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DmitryNFomin commented Mar 22, 2026 •

edited

Loading

DmitryNFomin commented May 15, 2026 •

edited

Loading

Full regression battery for `8b77f0f`

1. pgbench `-S` (read-only, 16c/8j/30s, 4 GB shared_buffers) — 5 runs each