Add wait_event_timing: Oracle-style wait event instrumentation#1
Add wait_event_timing: Oracle-style wait event instrumentation#1DmitryNFomin wants to merge 4 commits into
Conversation
Add section to Performance documenting the proposed PostgreSQL patch (DmitryNFomin/postgres#1) that adds Oracle-style internal wait event instrumentation with near-zero overhead (<1%) vs hardware watchpoints (6-29%). Benchmark results on same environment (Hetzner cx43, 8 vCPU): - TPC-B: <1% vs 6% (hardware watchpoint) - SELECT-only worst case: <1% vs 29% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7069864 to
7cadcee
Compare
…ocation The WaitEventTraceState array was 4 MB per backend allocated unconditionally at startup via ShmemInitStruct (400 MB at max_connections=100). Replace with DSA-backed lazy allocation: a small WaitEventTraceControl struct (~8 bytes per backend) lives in fixed shmem, and the 4 MB ring buffer is allocated via dsa_create/dsa_allocate only when a backend SETs wait_event_trace = on. Key changes: - Add WaitEventTraceControl with dsa_handle, LWLock, per-backend dsa_pointers - Add LWTRANCHE_WAIT_EVENT_TRACE_DSA for the control LWLock - Lazy DSA creation on first use (any backend), attach on subsequent backends - GUC assign hook (assign_wait_event_trace) triggers allocation on enable - Cleanup via before_shmem_exit callback (runs before dsm_backend_shutdown detaches DSA segments, avoiding the use-after-detach crash that would occur if dsa_free ran from ProcKill's on_shmem_exit callback) - Fix pg_stat_get_wait_event_trace proisstrict: was defaulting to true, causing NULL argument (own backend) to skip the C function entirely Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Address all 20 issues identified by cross-checking Gemini 3.1, Opus 4.6, and GPT 5.4 reviews against the actual PR code. Each fix follows the approved plan in crosscheck_review_fix_3.md. Security / correctness fixes: Issue #1 (UAF in trace ring reads): - Remove pg_stat_get_wait_event_timing_by_query() entirely (OID 9951, view, docs, tests) -- cross-backend DSA ring scanning was the primary use-after-free surface. - Restrict pg_stat_get_wait_event_trace() to own-session only; the backend_id argument is accepted for forward compat but ignored. - Wrap DSA pointer mutations (attach/detach) in WaitEventTraceCtl->lock LW_EXCLUSIVE to protect external readers and fix 32-bit torn pointers (also resolves Issue #11). - Document the ring buffer as an external lock-free transport with a seqlock reader contract in wait_event_timing.h and monitoring.sgml. Issue #6 (cross-backend reset race): - Add per-backend LWLock to WaitEventTimingState, following the pgstat_reset_entry() pattern from pgstat_shmem.c. - Writers acquire LW_SHARED on the hot path; reset acquires LW_EXCLUSIVE. - Init and exit paths individually zero fields instead of memset on the whole struct, preserving the LWLock. - Register LWTRANCHE_WAIT_EVENT_TIMING in lwlocklist.h. Build / startup fixes: Issue #2: Replace non-existent wait_classes.h include with wait_event_types.h in wait_event_timing.h. Issue #3 (config-file trace broken): Add explicit wait_event_trace_attach() calls after ResourceOwner init in auxprocess.c (after CreateAuxProcessResourceOwner) and postinit.c (after process_settings) to handle inherited GUC values that do not re-fire assign hooks after fork. Issue #10 (non-timing build silent suppression): Emit WARNING before forcing wait_event_timing/wait_event_trace to off in check hooks for non-timing builds (source < PGC_S_INTERACTIVE path). Issue #14 (EXEC_BACKEND init order): Add comment in pgstat_set_wait_event_timing_storage() confirming SubPostmasterMain calls CreateSharedMemoryAndSemaphores() before InitProcess(). Query lifecycle / phase separation: Issue #4 (QUERY_END broken for multi-statement): - Hybrid approach: emit QUERY_END for old query and QUERY_START for new query in pgstat_report_query_id() transitions (backend_status.c). - Keep existing QUERY_END in send_ready_for_query (postgres.c) for the last-query-to-idle boundary. - Covers simple single/multi-stmt, extended, prepared, and pipeline protocols without double-emit. Issue #5 (phase separation): Add TRACE_EXEC_END marker (type=4) emitted from ExecutorEnd() in execMain.c, separating execution from post-exec cleanup (commit, WAL flush, lock release). Also resolves Issue #13 (extended protocol phase misattribution). Overflow and dependency warnings: Issue #7: One-time WARNING on first LWLock hash overflow (>192 tranches). Issue #8: WARNING in assign_wait_event_trace when track_activities is off; document the dependency in config.sgml. Issue #12: Add flat_overflow_count field + one-time WARNING when eventId exceeds per-class slot count. Header cleanup and code generation: Issue #15: Move static const arrays and internal inline functions from wait_event_timing.h to wait_event_timing.c as file-static. Convert trace marker inlines to extern functions with a shared wait_event_trace_write_marker() helper. Issue #16 (parallel array validation): Extend generate-wait_event_types.pl to emit wait_event_timing_data.c (mapping arrays) and the three sizing #defines into wait_event_types.h, generated from wait_event_names.txt. Eliminates hand-maintained parallel arrays entirely. Update Makefile, meson.build, and pgindent exclusions. Documentation: Issue #17: Document InitMaterializedSRF usage as acceptable for both the timing SRF (bounded result set) and trace SRF (own-session only, deferred max_records parameter). Issue #18: Document ring wrap limitation, polling contract, and st_query_id fallback in wait_event_timing.h and monitoring.sgml. Issue #19 nits: Remove duplicate header include, remove install/ entries from .gitignore, add tranche_id 65535 academic comment, add eventId-equals-array-index comment, verify OID range (9950/9952/9953). Testing: Issue #20: - Add trace lifecycle regression test (off/on/query/read/off cycle). - Add auxiliary process timing test (verify checkpointer/bgwriter/ walwriter accumulate non-zero wait event counts). - Add test_lwlock_hash_overflow() to test_wait_event_stress module: registers >192 custom LWLock tranches to exercise hash overflow path and verify WARNING emission. - Document that default Cirrus CI already exercises non-timing build path via wait_event_timing_1.out alternate expected output. Issues resolved by other fixes (no separate action needed): Issue #9: Resolved by Issue #1 (by_query view removed). Issue #11: Resolved by Issue #1C (LWLock around DSA pointer mutations). Issue #13: Resolved by Issue #5 (EXEC_START/EXEC_END markers). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
46594e4 to
6497681
Compare
Four small, independently-justified fixes from the consolidated review in review_5.md. All verified by `make check` (249/249 passing, including the wait_event_timing regression test). #1 Stale docstring on pg_stat_reset_wait_event_timing (wait_event_timing.c) Two stacked docstrings: the first described an obsolete backend_id-based API ("0 = own, -1 = all") that does not exist; the function takes a PID. Merge them into one block that uses the correct PID semantics and preserves the async-protocol + reset_count-polling guidance from the obsolete one. #2 Spurious wait_event_* includes in ipci.c Two #includes (wait_event.h, wait_event_timing.h) referenced no symbols in ipci.c. The WaitEvent*ShmemCallbacks externs come via storage/subsystems.h -> subsystemlist.h. Removing both leaves ipci.c building cleanly and shrinks its include graph. #3 Missing WARNING for wait_event_capture=trace + compute_query_id=off (wait_event_timing.c, config.sgml) assign_wait_event_capture already warns when track_activities is off; mirror the same warning for compute_query_id, since markers silently skip when query_id is 0. Verified live: setting compute_query_id=off then wait_event_capture=trace now produces the expected WARNING + HINT pointing at pg_stat_statements as a common enabler. #4 Trace doc only described QUERY markers; EXEC markers were silent (config.sgml, monitoring.sgml) The trace ring emits four marker types, not two. ExecStart/ ExecEnd bracket every executor invocation and are NOT gated on track_activities, while QueryStart/QueryEnd cover inter-statement and idle-transition boundaries that the executor cannot. Restate the prereqs in config.sgml as an itemized list per family, update monitoring.sgml's view description and consumer guidance to name ExecStart as the primary attribution signal and QueryStart as the fallback for boundaries the executor cannot mark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable from the patch, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. Two new SQL functions are added in src/backend/utils/activity/ wait_event_timing.c (oids 9962, 9963): pg_get_wait_event_trace(procnumber int4) Canonical cross-backend reader. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or slots whose generation counter kept changing during the snapshot (very rare slot reassignment race -- bounded retry, then empty). pg_get_wait_event_trace_for_pid(pid int4) Convenience wrapper that resolves pid -> procNumber via BackendPidGetProc(). Cannot read ORPHANED slots by design: a dying backend's pid is removed from the proc array on exit, so the pid -> procNumber mapping does not survive backend exit. Callers who want post-mortem readability use pg_get_wait_event_trace directly with procNumber. Both functions implement the cross-backend reader protocol documented on WaitEventTraceControl in src/include/utils/wait_event_timing.h: snapshot the slot's generation, read slot state atomically, acquire WaitEventTraceCtl->lock in LW_SHARED, resolve ring_ptr, memcpy the records[] array into a local palloc'd buffer (~4 MB upper bound), release the lock, recheck generation, retry on mismatch, then process records via the per-record seqlock protocol off the lock. Shared helper emit_wait_event_trace_for_procnumber() factors the common snapshot+emit code; the two SRFs are thin wrappers. Privilege model: REVOKE EXECUTE from PUBLIC and GRANT to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace (established in review_5.md issue #27, commit 34015c1). Wait-event traces are stats-class data and the symmetric privilege story is the least surprising to monitoring tooling. Test coverage: * src/test/regress/sql/wait_event_timing.sql: smoke tests that both SRFs are callable from SQL and produce sensible results for own session, unknown pid (empty), and out-of-range procnumbers (empty). Both .out variants (timing and stub builds) updated. * src/test/modules/test_wait_event_stress/t/ 001_orphan_roundtrip.pl: end-to-end TAP test that exercises the orphan-readable code path. A writer session enables trace, generates wait events, captures its procnumber, and disconnects; a separate long-lived reader session then calls pg_get_wait_event_trace on the writer's procnumber and asserts the records are visible after the writer has exited. Verifies pg_stat_clear_orphaned_wait_event_rings() releases the orphan and that a subsequent read returns empty. The test keeps a dedicated reader BackgroundPsql open for the entire run; this prevents the reader's own procNumber from being a recycle target for the writer's procNumber, which would otherwise trigger wait_event_trace_clear_orphan_at_init() and reclaim the orphan before the test can read it. pg_sleep calls are wrapped in a DO block with PERFORM so they do not emit empty result rows that would contaminate the captured procnumber. pg_stat_get_ wait_event_timing exposes backend_id (procNumber + 1 by PG convention), so the test subtracts 1 before passing to pg_get_wait_event_trace. Documentation: new entries in monitoring.sgml under monitoring-stats-funcs for both SRFs, plus a clarifying update to the pg_get_backend_wait_event_trace entry that points readers at the new cross-backend SRFs. Catversion bump for the two new pg_proc OIDs. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-wait-event-timing --enable-debug --enable-cassert --enable-tap-tests: full src/test/regress suite (249 tests) passes, test_wait_event_stress regress passes, and the new TAP test passes (6/6 subtests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Implements the documented snapshot pattern: sample slot->generation, atomic state read, LW_SHARED on WaitEventTraceCtl->lock, dsa_get_address, palloc only the slice actually used by write_pos and memcpy under lock, release lock, re-sample generation, retry on mismatch (bounded by WET_READ_MAX_RETRIES=3, then NOTICE + empty so callers can distinguish a stable empty result from a lost race), then per-record seqlock processing off the lock. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow both used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions (post-PG-17 procnumber introduction). Defensive fix in wait_event_trace_attach (review_6.md issue #5 collateral): The Assert(state_now != WET_TRACE_SLOT_ORPHANED) at attach time was too strict. A backend can legitimately observe its own slot in the ORPHANED state if a later before_shmem_exit callback (pgstat_io_flush_cb during proc_exit shutdown) contends on an LWLock, which emits a wait event that lands in pgstat_report_wait_end_timing -> wait_event_trace_attach after wait_event_trace_before_shmem_exit has already run and transitioned the slot. The Assert crashes parallel workers reliably on exit when capture=trace. Replaced with defensive fall-through: if ORPHANED is observed at attach time, do not re-attach (the ring is now post-mortem; the writer invariant must hold). This adds in-test coverage for the issue #5 case via the parallel-worker scenario below. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via DO block with explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Two scenarios: (1) Plain-backend orphan roundtrip: writer enables trace, emits waits, captures its procnumber, disconnects; long-lived reader (separate session, pinning its own slot to prevent procnumber recycle of the writer's slot) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases it; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation): a forced-parallel query (debug_parallel_ query=on plus zero parallel costs) runs and exits; the leader and at least one worker each leave an ORPHANED slot. The reader iterates procnumber 0..150 and asserts at least two slots have readable records. Race-hardening: max_connections=100 in the test cluster plus a permanently-OWNED reader slot minimise the window in which an unrelated new backend could recycle a just-exited writer's procnumber and trigger wait_event_trace_clear_orphan_at_init(). Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (7 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d2013fd to
545deee
Compare
…_trace_attach
The Assert(state_now != WET_TRACE_SLOT_ORPHANED) inside
wait_event_trace_attach assumed that ORPHANED could never be
observed at attach time, on the reasoning that
pgstat_set_wait_event_timing_storage() at backend init calls
wait_event_trace_clear_orphan_at_init() which demotes any
inherited orphan to FREE before any wait-event capture path can
run.
That invariant holds at backend INIT. It does NOT hold during
backend EXIT: after wait_event_trace_before_shmem_exit (registered
as a before_shmem_exit callback) has transitioned the slot to
ORPHANED, a LATER before_shmem_exit callback can contend on an
LWLock that emits a wait event. In particular, pgstat_io_flush_cb
during proc_exit shutdown takes shared locks that may block under
load; LWLockAcquire's pgstat_report_wait_start/end machinery then
lands in pgstat_report_wait_end_timing -> wait_event_trace_attach
after my_wait_event_trace has been cleared. The Assert fires and
the backend aborts.
This is reliably reproducible with parallel workers: a forced-
parallel query under wait_event_capture = trace produces multiple
workers that all exit at end-of-parallel-query, each of which hits
the Assert during shmem_exit.
Replace the Assert with a defensive runtime check: if state ==
ORPHANED at attach time, skip the re-attach. We are dying, the
ring is now post-mortem data for cross-backend readers, and the
single-writer invariant must hold; the wait event that triggered
this attach path is not traced (a fully acceptable tradeoff -- we
are in the middle of shutting down).
review_6.md issue #5 asked for exactly this kind of defense-in-
depth at module boundaries; that review framed it as a style nit
("Assert is debug-only, defensive runtime check is better"), but
the parallel-worker shutdown sequence promotes it to a correctness
bug that crashes the backend in cassert builds.
Test coverage: the parallel-worker scenario in
src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl
(added in the follow-up commit for review_6.md issue #1)
exercises this exact shutdown path; the same TAP test crashed the
parallel worker before this fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Implements the documented snapshot pattern: sample slot->generation, atomic state read, LW_SHARED on WaitEventTraceCtl->lock, dsa_get_address, palloc only the slice actually used by write_pos and memcpy under lock, release lock, re-sample generation, retry on mismatch (bounded by WET_READ_MAX_RETRIES = 3, then NOTICE + empty so callers can distinguish a stable empty result from a lost race), then per-record seqlock processing off the lock. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions (post-PG-17 procnumber introduction). Retry-loop buffer sizing: emit_wait_event_trace_for_procnumber pallocs local_records sized for the slice actually needed (min(write_pos, RING_SIZE)) rather than always 4 MB. On retry, if a different owner has taken the slot with a larger write_pos, the buffer is pfree'd and re-allocated at the new size; the buffer is never shrunk across retries. This avoids both the 4 MB-per-call cost on nearly-empty rings (the common case for short-lived backends and post-mortem orphan reads) and the buffer-overflow that would occur if we kept a too-small buffer across a size-growing retry. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Two scenarios: (1) Plain-backend orphan roundtrip. A writer enables trace, emits waits, captures its procnumber via the procnumber column of pg_stat_get_wait_event_timing, disconnects. A long-lived reader session (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings() releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; the leader and at least one worker each leave an ORPHANED slot. We count them via the admin sweep -- cheaper and more explicit than iterating procnumbers and pg_get_wait_event_trace'ing each one (the read path itself is already covered by scenario 1). Race-hardening: max_connections=100 in the test cluster plus a permanently-OWNED reader slot minimise the window in which an unrelated new backend could recycle a just-exited writer's procnumber and trigger wait_event_trace_clear_orphan_at_init(). Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (7 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
545deee to
e4d0433
Compare
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session- local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Cheap unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration (the generation-counter retry that external readers would do becomes unnecessary for this in-tree caller). 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. Iterate every live ring index, applying the per-record seqlock protocol AGAINST SHARED MEMORY: read seq_before, barrier, structure-copy the record into a local result buffer, barrier, read seq_after. Skip records with odd seq_before (writer mid-record at start of read) or with seq_before != seq_after (writer completed a write cycle during our read; payload may be torn). 5. Release the lock. 6. Walk the local result buffer and emit rows into the tuplestore -- the only potentially I/O-expensive part, deliberately done outside the lock so spills do not extend lock-hold. Why per-record seqlock against shared memory (not against a memcpy'd local copy of the ring): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES, with the payload read in between. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time, the check degenerates to a no-op, and torn reads from concurrent writers cannot be detected. For an own-session reader (pg_get_backend_wait_event_trace) the concurrent-writer case does not arise; the cross-backend reader must do the protocol correctly to handle live OWNED slots safely. Lock-hold is O(records_in_range) in the iterator (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O; the tuplestore work runs after release. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via a DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures its procnumber from the procnumber column of pg_stat_get_wait_event_timing, disconnects. A long-lived reader (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; leader and at least one worker each leave an ORPHANED slot, verified via the admin sweep count. (3) OWNED-slot read with a concurrent live writer. A long-lived writer emits a steady stream of PgSleep events in a 1000-iteration loop while the reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type or wait_event, no negative duration); a single malformed row would indicate the seqlock failed to detect a torn write. This is the scenario the per-record seqlock protocol was rewritten to handle correctly. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (10 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e4d0433 to
7175ef1
Compare
pg_atomic_write_u64 has no barrier semantics (see the header comment in atomics.h: "No barrier semantics."), so on weak-memory architectures (ARM, POWER) the writer's pg_atomic_write_u64 to write_pos can be reordered after the subsequent plain stores to rec->seq. A cross-backend reader observing the new write_pos would then iterate i = pos and read rec[pos & MASK].seq while it still holds the PREVIOUS ring cycle's even value. The reader's seqlock check would falsely pass (the seq is stable across the read because the writer hasn't yet bumped it to odd from this cycle's perspective) and the reader would emit a stale-but-valid record from the previous ring cycle, mislabeled with the new ring index. On x86 this is masked by TSO; on ARM/POWER it is a real data- visibility bug that manifests as the cross-backend reader returning rows tied to the wrong ring positions. Add pg_write_barrier between the pg_atomic_write_u64(&write_pos, pos+1) and the first rec->seq = odd store at both writer sites (the wait-event hot path and the marker writer). This is the release half of an acquire/release pair; the matching pg_read_barrier on the reader side lives in emit_wait_event_trace_for_procnumber where the cross-backend SRF reads write_pos before iterating the per-record seq fields. This barrier is on the per-event hot path but adds only a CPU barrier (no atomic RMW, no cache-line trip), so the cost is within the noise of the existing seqlock writes already done by the function. No new tests: this is a weak-memory ordering fix that x86-only testing cannot exhibit. Catching it would require ARM/POWER stress testing of the cross-backend reader against an actively writing backend (the TAP scenario "OWNED-slot read with a concurrent live writer" added in the follow-up commit for review_6.md issue #1 is the closest thing in tree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. pg_read_barrier() to pair with the writer's pg_write_barrier between its pg_atomic_write_u64(&write_pos) and the rec->seq update (added in the predecessor commit). Without this acquire half of the pair, weak-memory readers could observe the new write_pos while still seeing the previous ring cycle's even seq value and emit stale-but- valid records with wrong ring indices. 5. Iterate every live ring index, applying the per-record seqlock protocol AGAINST SHARED MEMORY: read seq_before, barrier, structure-copy the record into a local result buffer, barrier, read seq_after. Skip records with odd seq_before (writer mid-record at start of read) or with seq_before != seq_after (writer completed a write cycle during our read; payload may be torn). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore -- the only potentially I/O-expensive part, deliberately done outside the lock so spills do not extend lock-hold. Why per-record seqlock against shared memory (not against a memcpy'd local copy of the ring): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES, with the payload read in between. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time, the check degenerates to a no-op, and torn reads from concurrent writers cannot be detected. For an own-session reader (pg_get_backend_wait_event_trace) the concurrent-writer case does not arise; the cross-backend reader must do the protocol correctly to handle live OWNED slots safely. Lock-hold is O(records_in_range) in the iterator (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O; the tuplestore work runs after release. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via a DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures its procnumber, disconnects. A long-lived reader (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; leader and at least one worker each leave an ORPHANED slot, verified via the admin sweep count. (3) OWNED-slot read with a concurrent live writer. A long-lived writer emits a steady stream of PgSleep events in a 1000-iteration loop while the reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type or wait_event, no negative duration); a single malformed row would indicate the seqlock failed to detect a torn write. This is the scenario the per-record seqlock protocol was written to handle correctly. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (10 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7175ef1 to
6373a5b
Compare
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. pg_read_barrier() to pair with the writer's pg_write_barrier between its pg_atomic_write_u64(&write_pos) and the rec->seq update (added in the predecessor commit). Without this acquire half of the pair, weak-memory readers could observe the new write_pos while still seeing the previous ring cycle's even seq value and emit stale-but- valid records with wrong ring indices. 5. Iterate every live ring index, applying the per-record seqlock protocol AGAINST SHARED MEMORY: read seq_before, barrier, structure-copy the record into a local result buffer, barrier, read seq_after. Skip records with odd seq_before (writer mid-record at start of read) or with seq_before != seq_after (writer completed a write cycle during our read; payload may be torn). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore -- the only potentially I/O-expensive part, deliberately done outside the lock so spills do not extend lock-hold. Why per-record seqlock against shared memory (not against a memcpy'd local copy of the ring): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES, with the payload read in between. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time, the check degenerates to a no-op, and torn reads from concurrent writers cannot be detected. For an own-session reader (pg_get_backend_wait_event_trace) the concurrent-writer case does not arise; the cross-backend reader must do the protocol correctly to handle live OWNED slots safely. Lock-hold is O(records_in_range) in the iterator (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O; the tuplestore work runs after release. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests for pg_get_wait_event_trace: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: a fresh non-pg_read_all_stats role attempting to call the function gets ERROR (caught via a DO block with an explicit insufficient_privilege handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures its procnumber, disconnects. A long-lived reader (pinning its own slot against procnumber recycle) reads the ORPHANED ring and asserts the events are present; pg_stat_clear_orphaned_wait_event_rings releases the orphan; subsequent read returns empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). A forced-parallel query (debug_parallel_query=on plus zero parallel costs) runs and exits; leader and at least one worker each leave an ORPHANED slot, verified via the admin sweep count. (3) OWNED-slot read with a concurrent live writer. A long-lived writer emits a steady stream of PgSleep events in a 1000-iteration loop while the reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type or wait_event, no negative duration); a single malformed row would indicate the seqlock failed to detect a torn write. This is the scenario the per-record seqlock protocol was written to handle correctly. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to the pg_get_backend_wait_event_trace entry pointing readers at the cross-backend SRF. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (10 subtests) pass in the timing build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6373a5b to
74cbdbb
Compare
The lazy-attach hot-path entry points pgstat_wait_event_timing_lazy_ attach() and wait_event_trace_attach() both allocate memory via dsa_attach / dsa_create / dsa_allocate_extended, which ultimately reach MemoryContextAlloc. MemoryContextAlloc Assert-fails on "CritSectionCount == 0 || allowInCritSection", so any backend whose very first wait event after wait_event_capture is enabled lands inside a critical section is killed by the assertion in cassert builds (or risks heap corruption in production builds). Reproducer: a parallel worker that has not yet emitted any wait events scans a heap page, hits BufferSetHintBits16 -> XLogSaveBufferForHint -> XLogInsert (which opens a critical section) -> LWLockAcquire, and the wait-end-timing hook routes to the lazy attach inside the critical section. Reliably triggered by the new "parallel-worker orphan roundtrip" TAP scenario added for review_6.md issue #1. Fix: check CritSectionCount at the top of each lazy-attach function and return early if non-zero. This silently drops the in-flight wait event (it is neither timed nor traced) but keeps the backend alive; the next wait event outside any critical section retries the attach and succeeds, after which the hot path no longer routes through the lazy-attach function. Critical sections are brief by design (a few dozen instructions typically), so the dropped wait events are themselves a few- nanosecond window of unobservable activity -- well within the quantisation noise of any wait-event-based profiler. This matches the behaviour PG already has for other "skipped under critical section" optimisations. Both timing and trace lazy attaches are fixed symmetrically; the two functions share a near-identical hot path entry shape and both need the same guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 3. Resolve ring_ptr via dsa_get_address, read write_pos. 4. pg_read_barrier() to pair with the writer's pg_write_barrier() between its pg_atomic_write_u64(&write_pos) and the rec->seq update (added in the predecessor commit). 5. Iterate every live ring index, applying the per-record POSITION-ENCODED IDENTITY SEQLOCK against shared memory: expected_seq = (uint32)(i * 2 + 2); seq_before = rec_shared->seq; pg_read_barrier(); if (seq_before != expected_seq) continue; local_copy = *rec_shared; pg_read_barrier(); seq_after = rec_shared->seq; if (seq_after != expected_seq) continue; The writer encodes ring position into the seq value: position pos uses (pos*2+1) for mid-write and (pos*2+2) for complete. After RING_SIZE writes the slot wraps and the same memory location gets a new numerically-distinct seq. A parity-only seqlock check (skip odd, accept stable even) is INSUFFICIENT for this layout in the cross-backend case: if the writer advanced write_pos but the rec->seq update has not yet propagated to this CPU's cache, the reader sees the previous cycle's stable even seq, both seq_before and seq_after match, and a record belonging to the previous cycle gets emitted with the new ring_index -- silent attribution corruption. The identity-against-expected check rejects all four failure modes (stale prior cycle, mid-write current cycle, wrapped- past-us next cycle, torn write completed mid-read). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore. Tuplestore work (potential disk spill) happens outside the lock so it does not extend lock-hold. Why per-record seqlock against shared memory (not a memcpy'd local copy): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time; the check degenerates to a no-op. Lock-hold is O(records_in_range) shared-memory loads (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: non-pg_read_all_stats role gets ERROR (caught via DO block with explicit handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures procnumber, disconnects. Reader reads ORPHANED ring; admin sweep releases; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). Forced-parallel query with 1M rows and ORDER BY (routes tuples through shm_mq queues, ensuring workers emit MessageQueueSend/MessageQueueReceive waits and lazily allocate their trace rings before exit). Asserts pg_stat_clear_orphaned_wait_event_rings returns >= 2 (leader + at least one worker). Includes a drain- wait loop to avoid racing with worker before_shmem_exit callbacks. (3) OWNED-slot read with a concurrent live writer. Writer emits ~1000 PgSleep waits in a tight loop; reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed (no NULL/empty wait_event_type, wait_event, no negative duration). WaitEventTraceControl protocol documentation in wait_event_timing.h is rewritten to describe the position-encoded identity seqlock (not the previous parity-only memcpy-then-check pattern, which was structurally broken for cross-process readers) so external authors building ASH/AWR background workers have correct guidance. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to pg_get_backend_wait_event_trace entry. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (11 subtests) pass in the timing build (10 stress reruns). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
74cbdbb to
08ad420
Compare
The lazy-attach hot-path entry points pgstat_wait_event_timing_lazy_ attach() and wait_event_trace_attach() both allocate memory via dsa_attach / dsa_create / dsa_allocate_extended, which ultimately reach MemoryContextAlloc. MemoryContextAlloc Assert-fails on "CritSectionCount == 0 || allowInCritSection", so any backend whose very first wait event after wait_event_capture is enabled lands inside a critical section is killed by the assertion in cassert builds (or risks heap corruption in production builds). Reproducer: a parallel worker that has not yet emitted any wait events scans a heap page, hits BufferSetHintBits16 -> XLogSaveBufferForHint -> XLogInsert (which opens a critical section) -> LWLockAcquire, and the wait-end-timing hook routes to the lazy attach inside the critical section. Reliably triggered by the parallel-worker scenario added in the follow-up commit for review_6.md issue #1. Fix: check CritSectionCount at the top of each lazy-attach function and return early if non-zero. This silently drops the in-flight wait event (it is neither timed nor traced) but keeps the backend alive; the next wait event outside any critical section retries the attach and succeeds, after which the hot path no longer routes through the lazy-attach function. Critical sections are brief by design, so the dropped wait events are themselves a few-nanosecond window of unobservable activity -- well within the quantisation noise of any wait- event-based profiler. Both timing and trace lazy attaches are fixed symmetrically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Allocate worst-case result buffer BEFORE taking the lock so the palloc -- which can bottom out in a glibc mmap syscall for the ~5 MB worst-case size -- runs without holding the WaitEventTraceCtl lock. 3. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 4. Resolve ring_ptr via dsa_get_address, read write_pos. 5. Iterate every live ring index, applying the per-record POSITION-ENCODED IDENTITY SEQLOCK against shared memory: expected_seq = (uint32)(i * 2 + 2); seq_before = rec_shared->seq; pg_read_barrier(); if (seq_before != expected_seq) continue; local_copy = *rec_shared; pg_read_barrier(); seq_after = rec_shared->seq; if (seq_after != expected_seq) continue; The writer encodes ring position into the seq value: position pos uses (pos*2+1) for mid-write and (pos*2+2) for complete. After RING_SIZE writes the slot wraps and gets a new numerically-distinct seq. A parity-only seqlock check (skip odd, accept stable even) is INSUFFICIENT for this layout in the cross-backend case: if the writer advanced write_pos but the rec->seq update has not yet propagated to this CPU's cache, the reader sees the previous cycle's stable even seq, both seq_before and seq_after match, and a record belonging to the previous cycle gets emitted with the new ring_index -- silent attribution corruption. The identity-against-expected check rejects all four failure modes (stale prior cycle, mid-write current cycle, wrapped- past-us next cycle, torn write completed mid-read). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore. Tuplestore work (potential disk spill) happens outside the lock so it does not extend lock-hold. Why per-record seqlock against shared memory (not a memcpy'd local copy): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time; the check degenerates to a no-op. Lock-hold is O(records_in_range) shared-memory loads (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O and no syscalls. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. A writer-side INJECTION_POINT("wait-event-trace-after-write-pos") is added between the pg_atomic_write_u64(&write_pos) and the rec->seq update; it is compiled out unless --enable-injection- points is set, and is intended as a hook for a future regression test of the position-encoded identity seqlock. Wiring a TAP test through this is fiddly enough to defer to a follow-up; the identity check is correct by construction and the protocol is fully documented on WaitEventTraceControl. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: non-pg_read_all_stats role gets ERROR (caught via DO block with explicit handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures procnumber, disconnects. Reader reads ORPHANED ring; admin sweep releases; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). Forced-parallel query with 1M rows and ORDER BY (routes tuples through shm_mq queues, ensuring workers emit MessageQueueSend/MessageQueueReceive waits and lazily allocate their trace rings before exit). Asserts pg_stat_clear_orphaned_wait_event_rings returns >= 2 (leader + at least one worker). Includes a drain- wait loop to avoid racing with worker before_shmem_exit callbacks. (3) OWNED-slot read with a concurrent live writer. Writer emits ~1000 PgSleep waits in a tight loop; reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed. WaitEventTraceControl protocol documentation in wait_event_timing.h is rewritten to describe the position-encoded identity seqlock (not the previous parity-only memcpy-then-check pattern, which was structurally broken for cross-process readers) so external authors building ASH/AWR background workers have correct guidance. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to pg_get_backend_wait_event_trace entry. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (11 subtests) pass in the timing build with 5/5 stress reruns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
08ad420 to
8cbbf43
Compare
…reader review_6.md raised the canonical -hackers objection against the orphan-persistence lifecycle landed in commit 0dba096: the patch preserves data (trace rings whose owning backend has exited) that no in-tree code can actually read. Without an in-tree consumer the new orphan-memory cost is paid for hypothetical external workers, the parallel-worker observability claim is unverifiable, and review_6.md issue #8 (test coverage of the orphan path) is structurally impossible because nothing in tree can exercise it. This commit ships the in-tree consumer. pg_get_wait_event_trace(procnumber int4) Cross-backend reader keyed by procnumber. Reads slots in OWNED state (live writer) and ORPHANED state (writer has exited, ring preserved post-mortem) uniformly. Returns empty for FREE slots, out-of-range procnumbers, or no records. Privileges: REVOKE'd from PUBLIC and GRANT'ed to pg_read_all_stats in system_views.sql, matching the privilege model of the session-local view pg_backend_wait_event_trace. Cross-backend reader protocol implemented in emit_wait_event_trace_for_procnumber(): 1. Unlocked state probe; FREE -> empty result. 2. Allocate worst-case result buffer BEFORE taking the lock so the palloc -- which can bottom out in a glibc mmap syscall for the ~5 MB worst-case size -- runs without holding the WaitEventTraceCtl lock. 3. Acquire WaitEventTraceCtl->lock in LW_SHARED. All slot transitions take LW_EXCLUSIVE, so the slot's identity, state, and ring_ptr are stable for the duration of the iteration. 4. Resolve ring_ptr via dsa_get_address, read write_pos. 5. Iterate every live ring index, applying the per-record POSITION-ENCODED IDENTITY SEQLOCK against shared memory: expected_seq = (uint32)(i * 2 + 2); seq_before = rec_shared->seq; pg_read_barrier(); if (seq_before != expected_seq) continue; local_copy = *rec_shared; pg_read_barrier(); seq_after = rec_shared->seq; if (seq_after != expected_seq) continue; The writer encodes ring position into the seq value: position pos uses (pos*2+1) for mid-write and (pos*2+2) for complete. After RING_SIZE writes the slot wraps and gets a new numerically-distinct seq. A parity-only seqlock check (skip odd, accept stable even) is INSUFFICIENT for this layout in the cross-backend case: if the writer advanced write_pos but the rec->seq update has not yet propagated to this CPU's cache, the reader sees the previous cycle's stable even seq, both seq_before and seq_after match, and a record belonging to the previous cycle gets emitted with the new ring_index -- silent attribution corruption. The identity-against-expected check rejects all four failure modes (stale prior cycle, mid-write current cycle, wrapped- past-us next cycle, torn write completed mid-read). 6. Release the lock. 7. Walk the local result buffer and emit rows into the tuplestore. Tuplestore work (potential disk spill) happens outside the lock so it does not extend lock-hold. Why per-record seqlock against shared memory (not a memcpy'd local copy): the protocol requires seq_before and seq_after to read THE SAME memory location at TWO DIFFERENT TIMES. A single up-front memcpy followed by two reads of seq from the local copy sees the value frozen at copy time; the check degenerates to a no-op. Lock-hold is O(records_in_range) shared-memory loads (similar wall time to a single 4 MB memcpy of the full ring, ~1 ms on modern hardware) with no I/O and no syscalls. Rename of the procnumber-exposing column on the existing SRFs: pg_stat_get_wait_event_timing and pg_stat_get_wait_event_timing_ overflow used to expose backend_id (procNumber + 1, 1-based legacy convention). This patch renames the column to procnumber and makes it 0-based so the value is directly usable as the argument to pg_get_wait_event_trace(). The 0-based form matches modern PG conventions. A writer-side INJECTION_POINT("wait-event-trace-after-write-pos") is added between the pg_atomic_write_u64(&write_pos) and the rec->seq update; it is compiled out unless --enable-injection- points is set, and is intended as a hook for a future regression test of the position-encoded identity seqlock. Wiring a TAP test through this is fiddly enough to defer to a follow-up; the identity check is correct by construction and the protocol is fully documented on WaitEventTraceControl. Test coverage: * src/test/regress/sql/wait_event_timing.sql - Smoke tests: own session via procnumber-keyed path, out-of-range procnumbers (empty), capture=off case (empty). - Permission gating: non-pg_read_all_stats role gets ERROR (caught via DO block with explicit handler). * src/test/modules/test_wait_event_stress/t/001_orphan_roundtrip.pl Three scenarios: (1) Plain-backend orphan roundtrip. Writer enables trace, emits waits, captures procnumber, disconnects. Reader reads ORPHANED ring; admin sweep releases; subsequent read empty. (2) Parallel-worker orphan roundtrip (the patch's stated motivation). Forced-parallel query with 1M rows and ORDER BY (routes tuples through shm_mq queues, ensuring workers emit MessageQueueSend/MessageQueueReceive waits and lazily allocate their trace rings before exit). Asserts pg_stat_clear_orphaned_wait_event_rings returns >= 2 (leader + at least one worker). Includes a drain- wait loop to avoid racing with worker before_shmem_exit callbacks. (3) OWNED-slot read with a concurrent live writer. Writer emits ~1000 PgSleep waits in a tight loop; reader calls pg_get_wait_event_trace(writer_procnumber) repeatedly. Every observed row must be well-formed. WaitEventTraceControl protocol documentation in wait_event_timing.h is rewritten to describe the position-encoded identity seqlock (not the previous parity-only memcpy-then-check pattern, which was structurally broken for cross-process readers) so external authors building ASH/AWR background workers have correct guidance. Documentation: monitoring.sgml entry for pg_get_wait_event_trace under monitoring-stats-funcs, including the SQL one-liner for pid-keyed lookup against live backends, plus a clarifying update to pg_get_backend_wait_event_trace entry. Catversion bump for the new pg_proc OID and the procnumber column rename. Tested on Rocky Linux 9.7 + gcc 11.5 with --enable-cassert and --enable-tap-tests in both --enable-wait-event-timing and stub build configurations: src/test/regress (249 tests) passes in both, test_wait_event_stress regress and TAP (11 subtests) pass in the timing build with 5/5 stress reruns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
febf636 to
a6016e7
Compare
…raph
The monitoring.sgml entry for pg_backend_wait_event_trace
contained a slot-lifecycle paragraph claiming that orphan-
persistence is "what makes the view above usable for parallel-
worker observability." The reviewer correctly pointed out that
the view is session-local: pg_backend_wait_event_trace always
reads my_wait_event_trace (the calling backend's own ring) and
never touches an orphan in another procnumber's slot. Orphan-
persistence has zero effect on the view's behaviour.
The actual beneficiaries of orphan-persistence are the cross-
backend interfaces:
* pg_get_wait_event_trace(procnumber) -- the in-tree SRF
added for review_6 issue #1, which reads OWNED and
ORPHANED slots uniformly.
* External background workers (ASH/AWR/10046-style ring
readers) following the cross-backend snapshot pattern
documented on WaitEventTraceControl.
Rewrite the paragraph to:
* Drop the misleading "view above" attribution.
* Add an explicit "This does not change the behaviour of this
view, which always reads the calling backend's own ring and
is unaffected by orphan-state slots belonging to other
procnumbers." sentence, so a reader following the link from
a parallel-worker observability question doesn't form the
wrong mental model.
* Reference pg_get_wait_event_trace by name and link to the
monitoring-stats-funcs section.
* Reattribute the parallel-worker-exit-fast rationale
correctly: it explains *why* orphan-persistence matters in
the first place (the data would otherwise be lost), not why
this view is suddenly different.
No code changes. Doc build (check-tabs, check-nbsp) and main
regress (249 tests) still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
68bdde2 to
d1553b9
Compare
…s/comments
review_6.md issue #9 flagged that documentation comments
over-reference "future ASH/AWR/10046-style background workers"
as justification for the cross-backend trace surface. On
-hackers this reads as speculative -- naming hypothetical
consumers to motivate present code. PG style prefers factual
statements about what the code does, with neutral framing for
consumers ("any extension that reads via the snapshot pattern").
The framing is also now doubly stale: since review_6 was
written we landed pg_get_wait_event_trace (review_6.md issue
#1) as the in-tree cross-backend consumer. The "future" tense
is wrong as well as speculative.
Sweep through wait_event_timing.h, wait_event_timing.c, and
monitoring.sgml. In each speculative reference replace the
"future ASH/AWR/10046-style background workers" phrasing with
neutral framing that:
* names the in-tree consumer (pg_get_wait_event_trace) where
that is what callers should use today,
* refers to "extensions following the snapshot pattern
documented on WaitEventTraceControl" for the
direct-shared-memory case,
* drops the speculative consumer enumeration entirely.
References to "10046-style trace" as a descriptive label for
the technique (e.g. "Per-session wait event trace ring buffer
(10046-style)" on WaitEventTraceState) are retained -- that is
a recognised term for the per-session-event-trace
profiling style, used here as a descriptor of the design, not
as a forward-looking promise about a particular consumer.
Files touched:
* doc/src/sgml/monitoring.sgml: pg_backend_wait_event_trace
cross-backend-tools paragraph, pg_get_wait_event_trace
description, and pg_stat_clear_orphaned_wait_event_rings
description.
* src/include/utils/wait_event_timing.h: slot-lifecycle
comment block (3 spots).
* src/backend/utils/activity/wait_event_timing.c:
wait_event_trace_before_shmem_exit doc, the
pg_backend_wait_event_trace SRF doc, the
pg_get_wait_event_trace SRF doc, and the
pg_stat_clear_orphaned_wait_event_rings doc.
No code change. Pure documentation rewrite.
Builds clean in both --enable-wait-event-timing and stub
configurations. Main regress (249 tests) and
test_wait_event_stress TAP (15 subtests) pass on the timing
build; main regress (249 tests) passes on the stub build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s/comments
review_6.md issue #9 flagged that documentation comments
over-reference "future ASH/AWR/10046-style background workers"
as justification for the cross-backend trace surface. On
-hackers this reads as speculative -- naming hypothetical
consumers to motivate present code. PG style prefers factual
statements about what the code does, with neutral framing for
consumers ("any extension that reads via the snapshot pattern").
The framing is also now doubly stale: since review_6 was
written we landed pg_get_wait_event_trace (review_6.md issue
#1) as the in-tree cross-backend consumer. The "future" tense
is wrong as well as speculative.
Sweep through wait_event_timing.h, wait_event_timing.c, and
monitoring.sgml. In each speculative reference replace the
"future ASH/AWR/10046-style background workers" phrasing with
neutral framing that:
* names the in-tree consumer (pg_get_wait_event_trace) where
that is what callers should use today,
* refers to "extensions following the snapshot pattern
documented on WaitEventTraceControl" for the
direct-shared-memory case,
* drops the speculative consumer enumeration entirely.
References to "10046-style trace" as a descriptive label for
the technique (e.g. "Per-session wait event trace ring buffer
(10046-style)" on WaitEventTraceState) are retained -- that is
a recognised term for the per-session-event-trace
profiling style, used here as a descriptor of the design, not
as a forward-looking promise about a particular consumer.
Files touched:
* doc/src/sgml/monitoring.sgml: pg_backend_wait_event_trace
cross-backend-tools paragraph, pg_get_wait_event_trace
description, and pg_stat_clear_orphaned_wait_event_rings
description.
* src/include/utils/wait_event_timing.h: slot-lifecycle
comment block (3 spots).
* src/backend/utils/activity/wait_event_timing.c:
wait_event_trace_before_shmem_exit doc, the
pg_backend_wait_event_trace SRF doc, the
pg_get_wait_event_trace SRF doc, and the
pg_stat_clear_orphaned_wait_event_rings doc.
No code change. Pure documentation rewrite.
Builds clean in both --enable-wait-event-timing and stub
configurations. Main regress (249 tests) and
test_wait_event_stress TAP (15 subtests) pass on the timing
build; main regress (249 tests) passes on the stub build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
902df44 to
adc517e
Compare
Full regression battery for 8b77f0fAll tests below were run against 1. pgbench
|
| run | BASELINE | WET/off | WET/stats | WET/trace |
|---|---|---|---|---|
| 1 | 163,656 | 165,540 | 162,655 | 166,045 |
| 2 | 162,859 | 167,232 | 164,518 | 165,465 |
| 3 | 165,870 | 167,471 | 163,137 | 166,766 |
| 4 | 165,444 | 167,534 | 165,016 | 167,480 |
| 5 | 166,466 | 166,425 | 165,303 | 162,679 |
| mean TPS | 164,859 | 166,840 | 164,126 | 165,687 |
| vs BASELINE | — | +1.20 % | −0.44 % | +0.50 % |
All three modes are within run-to-run noise of vanilla — the compile-time penalty is gone.
2. pgbench TPC-B (16c/8j/30s) — 5 runs each
| run | BASELINE | WET/off | WET/stats | WET/trace |
|---|---|---|---|---|
| 1 | 17,611 | 17,414 | 17,297 | 17,251 |
| 2 | 17,520 | 17,576 | 17,212 | 17,234 |
| 3 | 17,554 | 17,596 | 17,125 | 17,428 |
| 4 | 17,485 | 17,416 | 17,262 | 17,304 |
| 5 | 17,673 | 17,434 | 17,303 | 17,351 |
| mean TPS | 17,569 | 17,487 | 17,240 | 17,314 |
| vs BASELINE | — | −0.46 % | −1.87 % | −1.45 % |
3. evict_stress — new high-wait-event-transition-rate suite
shared_buffers = 32 MB versus a ~1.5 GB working set forces ~98 % buffer miss; every miss generates a BufferIO / DataFileRead wait event. Combined with high concurrency this drives the wait-event hot path at maximum rate.
| test | scenario | mode | TPS | Δ vs off | alive |
|---|---|---|---|---|---|
| W1 | 32 MB SB + pgbench -S 16c |
off | 116,487 | — | ✓ |
| stats | 116,625 | +0.1 % | ✓ | ||
| trace | 114,844 | −1.4 % | ✓ | ||
| W2 | 32 MB SB + TPC-B 16c | off | 16,813 | — | ✓ |
| stats | 16,766 | −0.3 % | ✓ | ||
| trace | 16,698 | −0.7 % | ✓ | ||
| W3 | 32 MB SB + pgbench -S 32c oversubscribed |
off | 162,845 | — | ✓ |
| stats | 158,880 | −2.4 % | ✓ | ||
| trace | 159,141 | −2.3 % | ✓ | ||
| W4 | 32 MB SB + hot-row UPDATE 32c | off | 8,138 | — | ✓ |
| stats | 7,772 | −4.5 % | ✓ | ||
| trace | 7,814 | −4.0 % | ✓ |
Sample of recorded wait events under W1/trace verifies the timing path actually captures the expected work:
wait_event_type | wait_event
-----------------+-------------------
IO | DataFileRead (dominant — buffer-miss reads)
Activity | BgwriterMain
Activity | WalWriterMain
IO | WalWrite
LWLock | WaitEventTraceDSA
IO | AioIoCompletion
No crashes in any of the 12 sub-runs across the four eviction-stress scenarios. Recording overhead under maximum-pressure scenarios peaks at ~4.5 % in W4 (where every transaction contends on the same row → many lock-wait transitions per tx). This is the real cost of writing per-event statistics, bounded and proportional to the wait-event rate.
4. Stress regression (stress2.sh, T1–T6)
| # | scenario | result |
|---|---|---|
| T1 | FOR UPDATE 16c × 5s × {off, stats, trace} | alive ✓ |
| T2 | pgbench -C connect-storm × {off, trace} |
alive ✓ |
| T3 | pg_ctl stop -m immediate recovery |
alive ✓ |
| T4 | pg_terminate_backend storm |
alive ✓ |
| T5 | rapid mode flips during load | alive ✓ |
| T6 | UPDATE contention × {off, stats, trace} | alive ✓ |
No PANIC, no segfault.
5. Additional stress (extra_stress.sh, E1–E7)
| # | scenario | result |
|---|---|---|
| E1 | 50 short-lived backends + pg_stat_clear_orphaned_wait_event_rings() |
swept 50/50 ✓ |
| E2 | 5 concurrent sessions × 20 forced-parallel aggregations | alive ✓ |
| E3 | 150 concurrent pg_get_wait_event_trace() reads vs writer |
alive ✓ |
| E4 | tiny ring (8 KB = 256 records) under pgbench -S |
alive ✓ |
| E5 | huge ring (32 MB = 1M records) under pgbench -S |
alive ✓ |
| E6 | VACUUM ANALYZE × 5 concurrent with TPC-B |
alive ✓ |
| E7 | mode-flip storm — flip every ~1.2 s for 30 s during load | alive ✓ |
6. Streaming replication (repl_test.sh)
Primary + hot-standby both with wait_event_capture = trace.
| metric | value |
|---|---|
| pgbench on primary | 12,257 TPS |
standby count(*) after replay |
1,000,000 rows ✓ |
| walreceiver / startup wait events captured | yes (WalReceiverMain, RecoveryConflictSnapshot) |
| crash in either log | none |
7. check-world (full upstream regression suite)
All tests successful.
Result: PASS
Both make check and TAP-suite tests passed end-to-end (set -e would have aborted on any failure).
8. ASAN — rebuilt with new source
inst-asan rebuilt from 8b77f0f739 with -fsanitize=address; ldd confirms libasan.so.8 linked.
| phase | scenario |
|---|---|
| A | pgbench -S 8c × 20s × {off, stats, trace} |
| B | pgbench TPC-B 8c × 20s × {stats, trace} |
| C | pgbench -C -S connect-storm 8c × 15s × trace |
| D | SELECT FOR UPDATE 16c × 15s × trace |
| E | pg_terminate_backend storm |
Postmaster log: no ASAN issues detected (no heap-use-after-free, no buffer-overflow, no UB).
Client-side pgbench leak warnings are pre-existing pg_strdup / lookupCreateVariable leaks in upstream pgbench code, unrelated to this patch.
PostgreSQL already exposes the *current* wait event of each
backend in pg_stat_activity, but there is no in-core way to
attribute waits across time: no per-backend counts, no
duration histograms, no event sequence. Production diagnosis
of "what was this backend waiting on?" requires sampling
pg_stat_activity at high frequency, which both undercounts
short waits and burdens the cluster.
This patch adds opt-in instrumentation that records wait
events for the calling backend in two modes:
* stats -- count, total/avg/max duration, and 32-bucket
log2-spaced duration histogram, per (backend, wait-event)
pair. Memory cost: ~26 KB/backend (DSA-backed, lazily
allocated when capture is first enabled cluster-wide).
* trace -- everything in stats, plus a per-session ring
buffer of individual events plus query-attribution
markers (ExecStart/ExecEnd, QueryStart/QueryEnd). This is
the per-session-event-trace style ("10046-style") used by
monitoring extensions and on-host workload analyzers.
The instrumentation is gated by a new compile-time flag
--enable-wait-event-timing. Production builds without the
flag pay zero cost: the pgstat_report_wait_start/end inline
path's only addition is an `#ifdef USE_WAIT_EVENT_TIMING`
block that is compiled out. Builds with the flag compiled in
but the wait_event_capture GUC at the default off value pay
one well-predicted branch per wait event (the OFF check
folds to a const compare against zero, branch-predicted
straight-line).
New runtime knobs:
* wait_event_capture (PGC_SUSET, enum off | stats | trace,
default off). Per-session GUC; superusers grant
SET ON PARAMETER to extend. Stepping down trace -> off
immediately releases the ring; stats counters survive
capture toggles until an explicit reset.
* wait_event_timing_max_tranches (PGC_POSTMASTER, int,
default 192). Cap on distinct LWLock tranches each
backend tracks individually. Excess tranches counted
against lwlock_overflow_count.
* wait_event_trace_ring_size_kb (PGC_POSTMASTER, int,
default 4096 KB, power-of-two, range 8..32768). Per-
backend trace ring buffer size. Tuneable per
deployment: lower at high max_connections to reduce
worst-case DSA footprint; raise for longer retention
before the FIFO wrap.
New SQL surface (all REVOKE'd from PUBLIC, GRANT'd to
pg_read_all_stats):
* pg_stat_wait_event_timing (view), backed by
pg_stat_get_wait_event_timing(pid int4) -- per-backend
per-event counts/durations/histograms.
* pg_stat_wait_event_timing_overflow (view), backed by
pg_stat_get_wait_event_timing_overflow(pid int4) --
counts of waits dropped to overflow (LWLock-hash full,
event-class flat array full) per backend.
* pg_backend_wait_event_trace (view), backed by
pg_get_backend_wait_event_trace() -- this session's
own trace ring, materialised.
* pg_get_wait_event_trace(procnumber int4) (SRF) -- the
canonical cross-backend reader. Reads OWNED and
ORPHANED slots uniformly using a position-encoded
identity seqlock against shared memory.
* pg_stat_reset_wait_event_timing(pid int4) and
pg_stat_reset_wait_event_timing_all() -- reset counters
for one backend or cluster-wide. Uses a generation-
counter retry protocol that coalesces concurrent resets
without losing in-flight wait observations.
* pg_stat_clear_orphaned_wait_event_rings() (superuser) --
explicit DBA sweep that releases orphaned trace rings
from backends that exited while
wait_event_capture = trace.
Trace-ring lifecycle and orphan semantics:
At backend exit the per-session trace ring is NOT freed.
The slot transitions OWNED -> ORPHANED and the ring stays
allocated in DSA so cross-backend readers (the in-tree
pg_get_wait_event_trace and any extension that follows
the snapshot pattern documented on WaitEventTraceControl)
can read the dying backend's final waits. This is the
one observable scenario that motivated the lifecycle work:
parallel workers exit in milliseconds at end-of-parallel-
query, well below any monitoring tool's polling interval,
and without orphan-persistence their final waits are lost.
Orphans are reclaimed lazily in three places:
(1) clear_orphan_at_init runs at every new backend's
InitProcess and frees the prior orphan whose
procNumber this backend inherits; this drains
orphans naturally as connection churn cycles slots.
(2) The session that disabled tracing releases its own
ring synchronously from the GUC assign hook.
(3) The DBA-driven sweep pg_stat_clear_orphaned_wait_
event_rings handles the pathological case of capture
briefly enabled then disabled on a cluster with
long-lived pooled connections that never recycle
procNumbers.
Worst-case orphan footprint is bounded by
NUM_WAIT_EVENT_TIMING_SLOTS * wait_event_trace_ring_size_kb
(default 4 MB; ~400 MB at max_connections=100, ~4 GB at
1000).
Hot-path design:
pgstat_report_wait_start/end remain inline. The new gate
if (unlikely(wait_event_capture != WAIT_EVENT_CAPTURE_OFF))
steers the predictor to the no-op fall-through. Inside,
the per-backend slot pointer is lazy-attached with a one-
shot in_attach re-entrancy guard so a nested wait inside
the DSA-attach internal LWLock cannot recurse.
The trace-ring writer is lock-free for the owning backend
(single-writer claim via plain read+write of write_pos,
avoiding the LOCK XADD that pg_atomic_fetch_add_u64 would
emit) and uses a per-record position-encoded identity
seqlock (rec->seq = pos*2+1 mid-write, pos*2+2 complete)
to coordinate with cross-backend readers. The identity
encoding rejects all four failure modes -- stale prior
cycle, mid-write current cycle, wrapped-past-us next
cycle, and torn write completed mid-read -- which a
parity-only seqlock would miss on weak-memory archs.
The writer hot path also coordinates with the GUC assign
hook's release_slot via a per-backend
wait_event_trace_writes_disabled gate so a dsa_free in
the release path can't UAF the ring through a nested
wait event.
LWLock-timing storage:
Each backend's per-tranche histograms live in a small
open-addressing hash table sized at server start by
wait_event_timing_max_tranches. Hits are O(1) amortised;
the probe count per lookup is bounded so a pathological
fill never spins. Tranches encountered after the table
fills are counted against lwlock_overflow_count and not
individually timed.
Stub build (without --enable-wait-event-timing):
Backing variables for all four GUCs are defined
unconditionally so the GUC framework has somewhere to
store the value. The check hook for wait_event_capture
rejects non-off values (warning + force to off at server
start, hard error at SET) so misconfigurations are
diagnosed loudly. SQL stubs of all SRFs return empty
results; the reset/clear functions ereport
ERRCODE_FEATURE_NOT_SUPPORTED. Marker entry points
(wait_event_trace_query_start etc.) are no-op stubs so
call sites in execMain.c, backend_status.c, and
postgres.c do not need #ifdef guards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
monitoring.sgml gets three new view entries (pg_stat_wait_event_timing, pg_stat_wait_event_timing_overflow, pg_backend_wait_event_trace) and new function entries under monitoring-stats-funcs (pg_get_wait_event_trace, pg_stat_reset_wait_event_timing, pg_stat_reset_wait_event_ timing_all, pg_stat_clear_orphaned_wait_event_rings), plus a section on the cross-backend snapshot protocol that monitoring extensions can use to read another backend's trace ring without going through the session-local SQL view. config.sgml gets the three new GUC entries (wait_event_capture, wait_event_timing_max_tranches, wait_event_trace_ring_size_kb) in the cumulative statistics section, including memory-footprint guidance for sizing wait_event_trace_ring_size_kb against max_connections. installation.sgml documents the --enable-wait-event-timing configure flag, including the hot-path zero-cost guarantee in builds that compile the feature in but leave wait_event_capture = off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression tests
----------------
src/test/regress/sql/wait_event_timing.sql exercises the GUC,
the timing/overflow/trace SQL surface, and the histogram
taxonomy view. Two expected outputs are provided:
* wait_event_timing.out -- for --enable-wait-event-timing
builds, asserts that a SET wait_event_capture = stats
followed by pg_sleep(0.1) lands a PgSleep entry with
non-zero counts, durations, and a valid 32-bucket
histogram whose entries sum to the visible call count
(call-count invariant: count == sum(histogram)).
* wait_event_timing_1.out -- for stub builds, asserts the
SET fails with the expected feature-not-supported error
and the SQL functions return empty results.
The parallel_schedule places wait_event_timing in its own
group; the test cannot share state with other tests, and the
two-phase reset_count invariant breaks if a foreign session
SETs the GUC between the test's two stat-snapshot queries.
Stress / TAP coverage
---------------------
src/test/modules/test_wait_event_stress wraps a small C
helper (stress_wait_events) that exercises the LWLock-timing
hash and the wait-event flat array, with deterministic counts.
The module's SQL regression script checks the
write_pos == sum(per_backend_calls) invariant under load.
Two TAP tests cover the lifecycle paths that the regression
harness cannot exercise (asynchronous backend exits, parallel
workers, ring wraparound):
* t/001_orphan_roundtrip.pl -- four scenarios for the
trace-ring orphan lifecycle:
1. Plain-backend orphan roundtrip (writer exits while
capture is trace, reader observes the ORPHANED ring,
admin sweep clears it).
2. Parallel-worker orphan roundtrip (forced-parallel
query with shm_mq tuple flow, asserts the admin
sweep reclaims leader + worker orphans).
3. OWNED-slot read with a concurrent live writer
(verifies cross-backend reads against an actively-
written ring return only well-formed rows).
4. clear_orphan_at_init reclaim (writer disconnects,
many new backends spawn until one inherits the
writer's procNumber, verifies the inheritor's
InitProcess freed the orphan). Bounded by the
ProcGlobal->freeProcs FIFO depth.
* t/002_ring_wrap.pl -- provisions a cluster with the
smallest legal wait_event_trace_ring_size_kb (8 KB =
256 records), drives a writer through enough waits to
wrap many times, and asserts the session-local and
cross-backend readers both return at most ring-size
well-formed records with post-wrap seq values. Verifies
the writer's `pos & ring_mask` indexing and the per-
record identity seqlock survive the wrap regime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the configure flag to a single Linux task in .cirrus.tasks.yml so the timing/trace code paths get buildfarm-equivalent coverage on every push. The other CI tasks continue to build without the flag, so the stub-build path (and its alternate-output regression test) remains exercised at every push as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eac2465 to
8b77f0f
Compare
Summary
This patch adds opt-in per-backend instrumentation that captures the duration of every wait event reported via
pgstat_report_wait_start/end. Two capture levels are exposed: aggregated per-event histograms (stats) and a 10046-style per-session trace ring (trace).The feature is gated by a configure-time flag (
--enable-wait-event-timing/meson -Dwait_event_timing=true) and a runtime GUC (wait_event_capture, SUSET, defaultoff). When the GUC isoff, the inline wait-event hot path is one global load + one predicted-not-taken branch — overhead vs vanilla is within run-to-run noise.Motivation
PostgreSQL surfaces a rich taxonomy of wait events through
pg_stat_activity.wait_event, but only as instantaneous snapshots. There is no in-tree way to answer:External tools either sample at coarse resolution (
pg_wait_sampling, default 10 ms) or pay a heavy debug-trap cost when readingPGPROC->wait_event_infovia a hardware watchpoint (~200–300 ns per transition → ~29 % TPS on high-transition workloads).The closest upstream analogue is Oracle's wait event interface: a
clock_gettime-grade timestamp at every wait start/end, with per-event aggregates and a per-session trace ring (the "10046" trace).What this patch adds
GUCs
wait_event_captureoffoff/stats/tracewait_event_timing_max_trancheswait_event_trace_ring_size_kboff— feature inert at runtime.stats— accumulates per-(backend, event)(count, total_ns, max_ns, log2-bucket histogram)in a per-backend DSA slot.trace— additionally pushes each completed wait into a fixed-size DSA-backed ring buffer per session (lossless single-writer / lossless external reader; wraps when exhausted).SQL surface
Design notes
Off-mode codegen
pgstat_report_wait_start/endarestatic inlineinsrc/include/utils/wait_event.h. The inline carries a minimal gate:The body (writes-disabled check, lazy DSA attach,
INSTR_TIME_SET_CURRENT, current-event write) lives in out-of-linepgstat_report_wait_start_timing/pgstat_report_wait_end_timing. Keeping the gate body out-of-line prevents host functions (LWLockAcquire,LockAcquireExtended,XLogInsert, ...) from being inflated by the unreachable on-mode code, which would otherwise cost register spills and worse instruction layout across the binary (~2 % off-mode penalty on the inlined variant, eliminated by this layout).Lazy DSA attach
Backends do not touch the timing DSA until the first time they observe
wait_event_capture != OFF. The DSA itself is not created until the first backend that enables capture. Invariants:MyProc->lwWaitingcheck at the start oflazy_attach()— never attempt DSA work while queued on an LWLock (DSA primitives can themselves queue on an LWLock, which would assert).wait_event_timing_writes_disabledflag raised by abefore_shmem_exitcallback registered at first attach, so wait events dispatched during the proc_exit cascade (e.g. bydsm_backend_shutdownfollow-ons) do not touch DSA mappings that have already been torn down.wait_event_infois re-published afterlazy_attach()returns — nested wait events during the attach (dsa_attachcan contend on an internal LWLock) clear it.Trace ring lifecycle
The per-backend trace ring is a DSA segment owned by a
WaitEventTraceRingHeader. When a backend exits, the segment is left behind for an external reader. Orphaned rings are swept by:procnumber(resets the ring), orThe trace ring records a fixed-size struct per wait:
(seq, wait_start_ns, wait_end_ns, wait_event_info, query_id).Lock-free reads
Per-backend storage has a single writer (the owning backend). External readers (
pg_stat_get_wait_event_timing(NULL),pg_get_wait_event_trace(procno)) read counters and the trace ring without locks; they tolerate torn reads by snapshotting(seq, wait_start, wait_end)and re-checkingseq. Cross-backend reset uses an atomicreset_generation; the owning backend clears its own counters at its next wait_end — no shared mutex.Measured overhead
CCX43 (16 dedicated AMD EPYC-Milan vCPU / 64 GB RAM), pgbench
-S16c/8j/30s, 4 GB shared_buffers, 5 runs per config, mean TPS:Off-mode is within noise. Stats / trace cost is below 2 % on typical pgbench and peaks at ~4–5 % under wait-event-saturated workloads (small
shared_bufferswith hot-row contention).Full per-run numbers, eviction-stress with
shared_buffers = 32 MB(W1–W4), stress regression (T1–T6, E1–E7), streaming replication withwait_event_capture = traceon both primary and standby, ASAN-instrumented battery, andmake check-worldresults are in the latest test-results comment.Commits
The main feature commit is large because the recording path, the SQL/SRF surface, the trace ring management, and the proc init hooks are tightly coupled; splitting further would not yield commits that build or pass tests independently.
Files changed
New:
src/backend/utils/activity/wait_event_timing.c— DSA management, lazy attach, recording, SQL functions, orphan sweepsrc/include/utils/wait_event_timing.h— public structs, GUC externsModified:
src/include/utils/wait_event.h— out-of-line gate atpgstat_report_wait_start/endsrc/backend/storage/lmgr/proc.c— init/cleanup of per-backend timing slotsrc/backend/storage/ipc/ipci.c— DSA registrationsrc/backend/utils/activity/backend_status.c— query_id pointer hookup for trace recordssrc/include/catalog/pg_proc.dat— SQL function registrationsrc/backend/utils/misc/guc_tables.c,guc_parameters.dat,postgresql.conf.sample— GUC definitionsconfigure.ac,meson.build,meson_options.txt,src/include/pg_config.h.in— compile-time optionsrc/backend/utils/activity/Makefile,meson.build— build wiringdoc/src/sgml/monitoring.sgml,config.sgml— docssrc/test/regress/...— regression test additionsOpen questions for review
Configure flag vs always-on. Currently behind
--enable-wait-event-timing(matching--enable-injection-points,--enable-cassertprecedent). With the out-of-line gate the runtime off-mode cost is negligible; the compile flag could be dropped if there is consensus that the binary-size and feature-discoverability tradeoff is acceptable.Trace record size. 64 bytes per record is generous (current fields fit in ~48). Reducing to 32 doubles ring depth at the cost of dropping
query_idfrom the trace record.Tranche hash sizing.
wait_event_timing_max_tranches = 192is a per-backend cap. Extensions that register many tranches can blow past this; today this raises a one-timeWARNINGand increments a per-backend overflow counter. An auto-grow path is possible but adds DSA allocation cost to a steady-state hot path.bgwriter sweep cadence. Orphan rings are swept once per bgwriter cycle today. A long-idle bgwriter delays cleanup; alternatively the sweep could be hooked into
pg_stat_reset_*callers too.Related work