From a0143a08c33cdcae88cf4d5f85fafcbc9d29c967 Mon Sep 17 00:00:00 2001 From: David W Bitner Date: Mon, 22 Jun 2026 16:07:00 -0500 Subject: [PATCH 1/2] Consolidate on the index access method; auto-provision opclasses MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make CREATE INDEX ... USING table_range the single interface and remove the SQL function interface (table_range_create/refresh/drop/summary_count), the registration table, and the staleness triggers. - Operator classes are now provisioned automatically by mirroring the types that already have a btree or range operator class, re-running on every CREATE EXTENSION. So any btree-comparable type and any range type work out of the box, and PostGIS geometry registers the moment `CREATE EXTENSION postgis` runs — no manual step. - summary_build.rs keeps only the ambuild path (scalar min/max + range/geometry extent); staleness is maintained by aminsert, recompute via REINDEX; the sql_drop event trigger still cleans up dropped indexes/tables. - Tests rewritten onto CREATE INDEX (26 pass on pg18); dropped the two function-only tests (refresh needs REINDEX which can't run in a pg_test txn; drop is covered by the AM drop test). - README/benchmark updated. Honest planning-time note: owning summaries in a real index adds per-partition index-metadata overhead during planning (~85 ms flat on a bare 1000-partition table); pruning still removes ~110 ms of child-path planning. Co-Authored-By: Claude Opus 4.8 --- README.md | 77 +++++------ bench/planning_benchmark.sql | 2 +- src/e2e_tests.rs | 172 ++++++------------------ src/index_am.rs | 90 +++++++++---- src/lib.rs | 29 +---- src/summary_build.rs | 245 ++--------------------------------- 6 files changed, 166 insertions(+), 449 deletions(-) diff --git a/README.md b/README.md index daf821c..30b1300 100644 --- a/README.md +++ b/README.md @@ -12,34 +12,20 @@ without it. ## Quick Start -Two equivalent interfaces build and maintain the per-partition summaries. Use whichever -fits your workflow. - -### Function interface +Summaries are built and maintained through a custom index access method, so pruning +follows the normal index lifecycle (`pg_dump`/restore, `REINDEX`, `DROP INDEX`). ```sql CREATE EXTENSION pg_table_range; --- Register one or more columns of a partitioned (or plain) table and build summaries. --- Pass the relation as an OID; cast a name with ::regclass::oid. -SELECT table_range_create('events'::regclass::oid, ARRAY['val', 'created_at']); +-- Summarize one or more columns of a partitioned (or plain) table. +CREATE INDEX events_tr ON events USING table_range (val, created_at); --- Queries now prune partitions whose summarized range cannot match the predicate. +-- Queries now prune partitions whose summary cannot match the predicate. -- Verify with EXPLAIN: non-matching partitions disappear from the plan. EXPLAIN (COSTS OFF) SELECT * FROM events WHERE val >= 250; --- Recompute after bulk loads (also clears staleness); or drop registration entirely. -SELECT table_range_refresh('events'::regclass::oid); -SELECT table_range_drop('events'::regclass::oid); -``` - -### Index interface - -```sql --- Builds the same summaries via a custom index access method. -CREATE INDEX events_tr ON events USING table_range (val, created_at); - --- Pruning works immediately; REINDEX rebuilds summaries after heavy churn. +-- Recompute after heavy churn; or drop the summaries entirely. REINDEX INDEX events_tr; DROP INDEX events_tr; -- removes the summaries it built ``` @@ -47,6 +33,20 @@ DROP INDEX events_tr; -- removes the summaries it built The index is never used for scans — it exists only to build and own the summaries — so it adds no scan-time overhead and is never chosen by the planner for data access. +### Supported column types (no setup, including PostGIS) + +`CREATE INDEX … USING table_range` works on any **btree-comparable** type and any +**range** type out of the box. The required operator classes are provisioned +automatically by mirroring the types that already have a btree/range operator class — and +that mirror re-runs whenever an extension is installed, so **PostGIS geometry works the +moment you `CREATE EXTENSION postgis`, with no extra step**: + +```sql +CREATE EXTENSION postgis; -- geometry opclass auto-registers +CREATE INDEX places_tr ON places USING table_range (geom); +EXPLAIN (COSTS OFF) SELECT * FROM places WHERE geom && ST_MakeEnvelope(0,0,10,10); +``` + ## How it works - **Summaries.** For each leaf partition and indexed column, one row in @@ -69,12 +69,12 @@ it adds no scan-time overhead and is never chosen by the planner for data access is pruned by testing the constant against the partition's stored extent with PostgreSQL's own `&&` operator — so a partition is eliminated when its extent cannot overlap the query. -- **Automatic correctness.** Data changes mark the affected partition's summaries - *stale*, and stale summaries are never used for pruning — so a change can never cause a - missing row. The function interface installs row-level triggers - (`INSERT`/`UPDATE`/`DELETE`/`TRUNCATE`); the index interface marks stale from - `aminsert`. `table_range_refresh` (or `REINDEX`) recomputes and re-enables pruning. - A `sql_drop` event trigger removes summaries for any dropped relation or index. +- **Automatic correctness.** An insert that extends a partition marks its summary + *stale* (via the index's `aminsert`), and stale summaries are never used for pruning — + so a change can never cause a missing row. Deletes only shrink a partition's true + range, so the summary stays conservatively wide and remains safe. `REINDEX` recomputes + and re-enables pruning after churn, and a `sql_drop` event trigger removes a dropped + index's (or table's) summaries. ## Performance @@ -84,10 +84,15 @@ paths for every partition. On a 1000-partition table queried by a non-key column | | Planning time | Result | |---|---|---| -| pruning off | ~125 ms | 50 rows | -| pruning on | ~17 ms | 50 rows | +| pruning off | ~210 ms | 50 rows | +| pruning on | ~100 ms | 50 rows | + +Pruning removes ~110 ms of child-path planning here. Note the absolute numbers are higher +than they could be: because summaries are owned by a real index, PostgreSQL loads index +metadata for every partition during planning (a flat overhead, ~85 ms on this bare +1000-partition table — proportionally smaller when partitions already carry indexes). -Summaries are loaded **once per plan** (not per partition); the +Summaries themselves are loaded **once per plan** (not per partition); the `e2e_per_plan_cache_loads_once_regardless_of_partitions` test asserts exactly one catalog load for a 64-partition query. @@ -114,20 +119,18 @@ Everything not listed is conservatively **kept** (never mispruned): ## Catalog -- `table_range_summary` — one summary row per (owner, leaf partition, column): +- `table_range_summary` — one summary row per (index, leaf partition, column): `index_oid`, `relid`, `attnum`, `kind` (`minmax` or `overlap`), `type_name`, `min_summary`, `max_summary`, `has_nulls`, `all_nulls`, `stale`, `tuple_version`. -- `table_range_registered` — parents registered through the function interface and their - columns. ## Project layout | File | Responsibility | |------|----------------| | `src/lib.rs` | GUCs, `_PG_init`, catalog/bootstrap SQL, test wiring | -| `src/summary_build.rs` | SPI summary build/refresh/drop, registration, staleness triggers | +| `src/summary_build.rs` | SPI summary build (scalar min/max + range/geometry extent) | | `src/prune_hook.rs` | planner + pathlist hooks, per-plan cache, typed in-memory evaluation | -| `src/index_am.rs` | `table_range` index access method and operator classes | +| `src/index_am.rs` | `table_range` index access method + automatic operator-class provisioning | | `src/e2e_tests.rs`, `src/index_am_tests.rs` | end-to-end tests | ## Building and testing @@ -149,7 +152,5 @@ range-type tests, which exercise the same code path. - `NOT IN` / `<> ALL`, `NOT (...)`, expression predicates, and parameterized prepared-statement plans are kept rather than pruned. -- Summaries are exact at build/refresh time; between changes and a refresh the affected - partitions are simply not pruned (always correct, just less selective). -- The index interface marks a partition stale on insert; recompute with `REINDEX` (or - `table_range_refresh` for the function interface) to re-enable pruning after churn. +- Summaries are exact at build time; an insert that extends a partition marks it stale + (not pruned, but still correct) until the next `REINDEX`. diff --git a/bench/planning_benchmark.sql b/bench/planning_benchmark.sql index 4558a5f..fac2d23 100644 --- a/bench/planning_benchmark.sql +++ b/bench/planning_benchmark.sql @@ -28,7 +28,7 @@ FROM generate_series(1, :part_count) g, ANALYZE bench_events; -SELECT table_range_create('bench_events'::regclass::oid, ARRAY['val']); +CREATE INDEX bench_events_tr ON bench_events USING table_range (val); \echo '==================== pruning OFF ====================' SET table_range.enable_pruning = off; diff --git a/src/e2e_tests.rs b/src/e2e_tests.rs index 3129926..ad2bf1c 100644 --- a/src/e2e_tests.rs +++ b/src/e2e_tests.rs @@ -3,19 +3,26 @@ // the pgrx test harness invokes. // // These exercise the full path: create a partitioned table, populate disjoint value -// ranges per partition, build summaries with `table_range_create`, then verify that -// (a) the planner eliminates non-matching partitions (via EXPLAIN) and (b) results are -// identical with pruning on and off (no false negatives). +// ranges per partition, build summaries with `CREATE INDEX ... USING table_range`, then +// verify that (a) the planner eliminates non-matching partitions (via EXPLAIN) and +// (b) results are identical with pruning on and off (no false negatives). // -// The partition key (`region`) deliberately differs from the queried data column -// (`val`), so native PostgreSQL partition pruning cannot help — only the table_range -// summaries can eliminate partitions here. +// The partition key (`region`) deliberately differs from the queried data column, so +// native PostgreSQL partition pruning cannot help — only the table_range summaries can. + +/// Build summaries for `cols` of `table` via a table_range index named `_tr`. +fn e2e_build(table: &str, cols: &str) { + Spi::run(&format!( + "CREATE INDEX {table}_tr ON {table} USING table_range ({cols})" + )) + .expect("create table_range index"); +} /// Build a 3-way LIST-partitioned table with disjoint `val` ranges: /// events_r1: region=1, val in [0, 99] /// events_r2: region=2, val in [100, 199] /// events_r3: region=3, val in [200, 299] -/// Then register summaries on `val`. +/// Then summarize `val`. fn e2e_setup_events() { Spi::run( "DROP TABLE IF EXISTS events CASCADE; @@ -28,11 +35,7 @@ fn e2e_setup_events() { INSERT INTO events SELECT 3, g FROM generate_series(200, 299) g;", ) .expect("setup events"); - let written = - Spi::get_one::("SELECT table_range_create('events'::regclass::oid, ARRAY['val'])") - .expect("create ok") - .expect("create returned count"); - assert_eq!(written, 3, "one summary per leaf partition"); + e2e_build("events", "val"); } fn e2e_explain(query: &str) -> String { @@ -50,6 +53,10 @@ fn e2e_explain(query: &str) -> String { .expect("explain") } +fn e2e_explain_on(table: &str, pred: &str) -> String { + e2e_explain(&format!("SELECT * FROM {table} WHERE {pred}")) +} + fn e2e_set_pruning(on: bool) { Spi::run(&format!( "SET table_range.enable_pruning = {}", @@ -126,24 +133,6 @@ fn e2e_boundary_equality_keeps_correct_partition() { assert_eq!(e2e_count_where("val = 100"), 1); } -#[pg_test] -fn e2e_refresh_picks_up_new_data_range() { - e2e_setup_events(); - e2e_set_pruning(true); - let plan_before = e2e_explain("SELECT * FROM events WHERE val = 500"); - assert!(!plan_before.contains("events_r1")); - - Spi::run("INSERT INTO events VALUES (1, 500)").expect("insert"); - let n = Spi::get_one::("SELECT table_range_refresh('events'::regclass::oid)") - .expect("refresh") - .expect("count"); - assert_eq!(n, 3); - - let plan_after = e2e_explain("SELECT * FROM events WHERE val = 500"); - assert!(plan_after.contains("events_r1"), "r1 must reappear:\n{plan_after}"); - assert_eq!(e2e_count_where("val = 500"), 1); -} - #[pg_test] fn e2e_disabled_pruning_scans_all_partitions() { e2e_setup_events(); @@ -154,25 +143,6 @@ fn e2e_disabled_pruning_scans_all_partitions() { assert!(plan.contains("events_r3")); } -#[pg_test] -fn e2e_drop_removes_summaries_and_pruning() { - e2e_setup_events(); - assert_eq!( - Spi::get_one::("SELECT table_range_summary_count('events'::regclass::oid)") - .unwrap() - .unwrap(), - 3 - ); - assert!(Spi::get_one::("SELECT table_range_drop('events'::regclass::oid)") - .unwrap() - .unwrap()); - e2e_set_pruning(true); - let plan = e2e_explain("SELECT * FROM events WHERE val >= 250"); - assert!(plan.contains("events_r1")); - assert!(plan.contains("events_r2")); - assert!(plan.contains("events_r3")); -} - #[pg_test] fn e2e_works_on_plain_unpartitioned_table() { Spi::run( @@ -181,9 +151,7 @@ fn e2e_works_on_plain_unpartitioned_table() { INSERT INTO plain_t SELECT g FROM generate_series(0, 99) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('plain_t'::regclass::oid, ARRAY['val'])") - .unwrap() - .unwrap(); + e2e_build("plain_t", "val"); e2e_set_pruning(true); assert_eq!( Spi::get_one::("SELECT count(*)::bigint FROM plain_t WHERE val >= 50") @@ -200,39 +168,30 @@ fn e2e_works_on_plain_unpartitioned_table() { } #[pg_test] -fn e2e_insert_without_refresh_is_still_correct() { +fn e2e_insert_keeps_results_correct_via_staleness() { e2e_setup_events(); e2e_set_pruning(true); // Sanity: before the insert, val=500 prunes everything (no partition covers it). assert_eq!(e2e_count_where("val = 500"), 0); - // Insert a value far outside r1's summarized range and DO NOT refresh. + // Insert a value far outside r1's summarized range. aminsert marks r1 stale, so the + // new row is still found — no false negative despite a now-stale summary. Spi::run("INSERT INTO events VALUES (1, 500)").expect("insert"); - - // The staleness trigger must have disabled pruning for r1, so the new row is - // still found — no false negative despite a stale summary. assert_eq!( e2e_count_where("val = 500"), 1, "stale summary must not prune away newly inserted matching rows" ); - // r1 must reappear in the plan (kept due to staleness). let plan = e2e_explain("SELECT * FROM events WHERE val = 500"); assert!(plan.contains("events_r1"), "r1 kept while stale:\n{plan}"); - - // After refresh, pruning becomes effective again and the row is still correct. - Spi::get_one::("SELECT table_range_refresh('events'::regclass::oid)") - .unwrap() - .unwrap(); - assert_eq!(e2e_count_where("val = 500"), 1); } #[pg_test] fn e2e_delete_keeps_results_correct() { e2e_setup_events(); e2e_set_pruning(true); - // Deleting rows can only shrink a partition's true range; a now-too-wide summary - // is conservative (safe). Results must stay correct with or without refresh. + // Deleting rows can only shrink a partition's true range; a now-too-wide summary is + // conservative (safe). Results must stay correct. Spi::run("DELETE FROM events WHERE region = 2").expect("delete"); for pred in ["val = 150", "val >= 100 AND val < 200", "val < 50"] { e2e_set_pruning(false); @@ -260,16 +219,12 @@ fn e2e_large_tree_prunes_to_single_partition() { )) .unwrap(); } - Spi::get_one::("SELECT table_range_create('big'::regclass::oid, ARRAY['val'])") - .unwrap() - .unwrap(); + e2e_build("big", "val"); e2e_set_pruning(true); // Value 1750 lives only in partition p17 (1700..1799). let plan = e2e_explain_on("big", "val = 1750"); assert!(plan.contains("big_p17"), "p17 kept:\n{plan}"); - assert!(!plan.contains("big_p0\n") && !plan.contains("big_p0 "), "p0 pruned:\n{plan}"); - // Count of "big_p" scan references should be exactly 1 (only p17). let scans = plan.matches("big_p").count(); assert_eq!(scans, 1, "expected a single surviving partition:\n{plan}"); @@ -284,8 +239,8 @@ fn e2e_large_tree_prunes_to_single_partition() { #[pg_test] fn e2e_per_plan_cache_loads_once_regardless_of_partitions() { - // 64 range partitions; planning a query must load summaries exactly once, not - // once per partition — this is the observable signature of the per-plan cache. + // 64 range partitions; planning a query must load summaries exactly once, not once + // per partition — the observable signature of the per-plan cache. Spi::run( "DROP TABLE IF EXISTS cache_t CASCADE; CREATE TABLE cache_t (val bigint) PARTITION BY RANGE (val);", @@ -300,13 +255,10 @@ fn e2e_per_plan_cache_loads_once_regardless_of_partitions() { )) .unwrap(); } - Spi::get_one::("SELECT table_range_create('cache_t'::regclass::oid, ARRAY['val'])") - .unwrap() - .unwrap(); + e2e_build("cache_t", "val"); e2e_set_pruning(true); Spi::run("SELECT table_range_reset_cache_load_count()").unwrap(); - // One selective query over 64 partitions. let found = Spi::get_one::("SELECT count(*)::bigint FROM cache_t WHERE val = 3333") .unwrap() .unwrap(); @@ -330,7 +282,9 @@ fn postgis_available() -> bool { #[pg_test] fn e2e_postgis_extent_pruning() { // PostGIS is not installed in every test environment (e.g. the pgrx-managed pg18); - // skip gracefully there. CI installs PostGIS so this runs for real. + // skip gracefully there. CI installs PostGIS so this runs for real. Creating the + // extension fires our event trigger, which registers the geometry opclass so + // CREATE INDEX ... USING table_range (geom) resolves with no manual step. if !postgis_available() { return; } @@ -346,9 +300,7 @@ fn e2e_postgis_extent_pruning() { INSERT INTO ev_g SELECT 3, ST_MakePoint(200+x, 200+y) FROM generate_series(0,10) x, generate_series(0,10) y;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('ev_g'::regclass::oid, ARRAY['geom'])") - .unwrap() - .unwrap(); + e2e_build("ev_g", "geom"); e2e_set_pruning(true); // A query box over partition 3's extent prunes partitions 1 and 2. @@ -391,9 +343,7 @@ fn e2e_range_overlap_pruning() { INSERT INTO ev_r SELECT 3, int8range(200+g*10, 200+g*10+10) FROM generate_series(0,9) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('ev_r'::regclass::oid, ARRAY['period'])") - .unwrap() - .unwrap(); + e2e_build("ev_r", "period"); e2e_set_pruning(true); let plan = e2e_explain_on("ev_r", "period && int8range(250, 260)"); @@ -401,7 +351,6 @@ fn e2e_range_overlap_pruning() { assert!(!plan.contains("ev_r_1"), "r1 pruned:\n{plan}"); assert!(!plan.contains("ev_r_2"), "r2 pruned:\n{plan}"); - // Correctness on/off for several overlap predicates, including spanning and empty. let count = |pred: &str| { Spi::get_one::(&format!("SELECT count(*)::bigint FROM ev_r WHERE {pred}")) .unwrap() @@ -409,9 +358,9 @@ fn e2e_range_overlap_pruning() { }; for pred in [ "period && int8range(250, 260)", - "period && int8range(95, 105)", // spans r1/r2 boundary + "period && int8range(95, 105)", // spans r1/r2 boundary "period && int8range(1000, 2000)", // matches nothing - "period && int8range(0, 300)", // matches everything + "period && int8range(0, 300)", // matches everything ] { e2e_set_pruning(false); let off = count(pred); @@ -438,7 +387,6 @@ fn e2e_or_pruning() { assert!(plan_wide.contains("events_r2")); assert!(plan_wide.contains("events_r3")); - // Correctness on/off for several OR shapes, including a nested AND inside the OR. for pred in [ "val < 50 OR val >= 250", "val = 5 OR val = 295", @@ -470,7 +418,6 @@ fn e2e_in_list_pruning() { assert!(!plan2.contains("events_r2"), "r2 pruned:\n{plan2}"); assert!(!plan2.contains("events_r3"), "r3 pruned:\n{plan2}"); - // Correctness on/off across several IN lists (incl. NULL element and no matches). for pred in [ "val IN (5, 250)", "val IN (5, 25, 75)", @@ -486,22 +433,6 @@ fn e2e_in_list_pruning() { } } -fn e2e_explain_on(table: &str, pred: &str) -> String { - Spi::connect(|client| { - let q = format!("EXPLAIN (COSTS OFF) SELECT * FROM {table} WHERE {pred}"); - let t = client.select(&q, None, &[])?; - let mut out = String::new(); - for row in t { - if let Ok(Some(line)) = row.get::(1) { - out.push_str(&line); - out.push('\n'); - } - } - Ok::(out) - }) - .expect("explain") -} - #[pg_test] fn e2e_timestamptz_pruning() { Spi::run( @@ -515,16 +446,13 @@ fn e2e_timestamptz_pruning() { INSERT INTO ev_ts SELECT 3, timestamptz '2024-03-01' + (g||' days')::interval FROM generate_series(0,27) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('ev_ts'::regclass::oid, ARRAY['ts'])") - .unwrap() - .unwrap(); + e2e_build("ev_ts", "ts"); e2e_set_pruning(true); let plan = e2e_explain_on("ev_ts", "ts >= timestamptz '2024-03-01'"); assert!(plan.contains("ev_ts_3"), "march must remain:\n{plan}"); assert!(!plan.contains("ev_ts_1"), "jan pruned:\n{plan}"); assert!(!plan.contains("ev_ts_2"), "feb pruned:\n{plan}"); - // Correctness on/off. let pred = "ts >= timestamptz '2024-02-15' AND ts < timestamptz '2024-03-10'"; e2e_set_pruning(false); let off = Spi::get_one::(&format!("SELECT count(*)::bigint FROM ev_ts WHERE {pred}")) @@ -550,9 +478,7 @@ fn e2e_text_pruning() { INSERT INTO ev_txt VALUES (3,'watermelon'),(3,'xigua'),(3,'zucchini');", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('ev_txt'::regclass::oid, ARRAY['name'])") - .unwrap() - .unwrap(); + e2e_build("ev_txt", "name"); e2e_set_pruning(true); let plan = e2e_explain_on("ev_txt", "name >= 'watermelon'"); assert!(plan.contains("ev_txt_3"), "third must remain:\n{plan}"); @@ -582,9 +508,7 @@ fn e2e_float_pruning() { INSERT INTO ev_f SELECT 2, 100.0 + g * 1.5 FROM generate_series(0,49) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('ev_f'::regclass::oid, ARRAY['amt'])") - .unwrap() - .unwrap(); + e2e_build("ev_f", "amt"); e2e_set_pruning(true); let plan = e2e_explain_on("ev_f", "amt > 120.0"); assert!(plan.contains("ev_f_2")); @@ -599,18 +523,12 @@ fn e2e_multicolumn_and_semantics() { CREATE TABLE mc_1 PARTITION OF mc FOR VALUES IN (1); CREATE TABLE mc_2 PARTITION OF mc FOR VALUES IN (2); CREATE TABLE mc_3 PARTITION OF mc FOR VALUES IN (3); - -- a and b ranges per partition: - -- p1: a[0..99] b[0..99] - -- p2: a[100..199] b[100..199] - -- p3: a[200..299] b[200..299] INSERT INTO mc SELECT 1, g, g FROM generate_series(0,99) g; INSERT INTO mc SELECT 2, 100+g, 100+g FROM generate_series(0,99) g; INSERT INTO mc SELECT 3, 200+g, 200+g FROM generate_series(0,99) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('mc'::regclass::oid, ARRAY['a','b'])") - .unwrap() - .unwrap(); + e2e_build("mc", "a, b"); e2e_set_pruning(true); // a >= 250 keeps only p3; b < 50 alone keeps only p1; together -> empty. let plan = e2e_explain_on("mc", "a >= 250 AND b < 50"); @@ -618,7 +536,6 @@ fn e2e_multicolumn_and_semantics() { assert!(!plan.contains("mc_2"), "p2 pruned by both:\n{plan}"); assert!(!plan.contains("mc_3"), "p3 pruned by b:\n{plan}"); - // Correctness across several multi-column predicates. for pred in [ "a >= 250 AND b < 50", "a < 150 AND b > 50", @@ -645,14 +562,12 @@ fn e2e_is_null_pruning() { CREATE TABLE nt_nonull PARTITION OF nt FOR VALUES IN (1); CREATE TABLE nt_allnull PARTITION OF nt FOR VALUES IN (2); CREATE TABLE nt_mixed PARTITION OF nt FOR VALUES IN (3); - INSERT INTO nt SELECT 1, g FROM generate_series(1,50) g; -- no nulls - INSERT INTO nt SELECT 2, NULL FROM generate_series(1,50) g; -- all null + INSERT INTO nt SELECT 1, g FROM generate_series(1,50) g; + INSERT INTO nt SELECT 2, NULL FROM generate_series(1,50) g; INSERT INTO nt SELECT 3, CASE WHEN g % 2 = 0 THEN g ELSE NULL END FROM generate_series(1,50) g;", ) .unwrap(); - Spi::get_one::("SELECT table_range_create('nt'::regclass::oid, ARRAY['val'])") - .unwrap() - .unwrap(); + e2e_build("nt", "val"); e2e_set_pruning(true); // IS NULL: the no-null partition can be pruned; all-null and mixed remain. @@ -667,7 +582,6 @@ fn e2e_is_null_pruning() { assert!(plan_nn.contains("nt_nonull")); assert!(plan_nn.contains("nt_mixed")); - // Correctness on/off for both null predicates. for pred in ["val IS NULL", "val IS NOT NULL"] { e2e_set_pruning(false); let off = Spi::get_one::(&format!("SELECT count(*)::bigint FROM nt WHERE {pred}")) diff --git a/src/index_am.rs b/src/index_am.rs index cbe9832..8a92a52 100644 --- a/src/index_am.rs +++ b/src/index_am.rs @@ -37,8 +37,8 @@ unsafe extern "C-unwind" fn xact_callback( // Instead, `ambuild` scans the (leaf) relation and writes one min/max/null summary per // indexed column into `table_range_summary` (keyed by the index OID), and installs // the staleness trigger that keeps the summary conservative on data changes. The planner -// hook then prunes partitions exactly as it does for the function interface. The index is -// never chosen for scans (no `amgettuple`/`amgetbitmap`, prohibitive cost estimate). +// hook then prunes partitions using those summaries. The index is never chosen for scans +// (no `amgettuple`/`amgetbitmap`, prohibitive cost estimate). /// V1 function-info record so PostgreSQL can call `table_range_amhandler` as a /// `LANGUAGE c` function declared in the access-method SQL below. @@ -100,7 +100,7 @@ unsafe extern "C-unwind" fn am_build( pg_sys::PushActiveSnapshot(pg_sys::GetTransactionSnapshot()); pgrx::PgTryBuilder::new(|| { if let Ok(names) = crate::summary_build::column_names_for_attnums(heap_relid, &attnums) { - let _ = crate::summary_build::build_one_leaf(index_relid, heap_relid, &names, false); + let _ = crate::summary_build::build_one_leaf(index_relid, heap_relid, &names); } }) .catch_others(|_| ()) @@ -212,27 +212,69 @@ extension_sql!( CREATE ACCESS METHOD table_range TYPE INDEX HANDLER table_range_amhandler; COMMENT ON ACCESS METHOD table_range IS - 'Conservative early partition pruning via min/max range summaries'; - - -- Minimal default operator classes so CREATE INDEX ... USING table_range resolves a - -- class for each common column type. The AM stores only summaries, so these carry no - -- operators or support procedures (amvalidate accepts them). - CREATE OPERATOR CLASS bool_tr_ops DEFAULT FOR TYPE boolean USING table_range AS STORAGE boolean; - CREATE OPERATOR CLASS int2_tr_ops DEFAULT FOR TYPE smallint USING table_range AS STORAGE smallint; - CREATE OPERATOR CLASS int4_tr_ops DEFAULT FOR TYPE integer USING table_range AS STORAGE integer; - CREATE OPERATOR CLASS int8_tr_ops DEFAULT FOR TYPE bigint USING table_range AS STORAGE bigint; - CREATE OPERATOR CLASS float4_tr_ops DEFAULT FOR TYPE real USING table_range AS STORAGE real; - CREATE OPERATOR CLASS float8_tr_ops DEFAULT FOR TYPE double precision USING table_range AS STORAGE double precision; - CREATE OPERATOR CLASS numeric_tr_ops DEFAULT FOR TYPE numeric USING table_range AS STORAGE numeric; - CREATE OPERATOR CLASS text_tr_ops DEFAULT FOR TYPE text USING table_range AS STORAGE text; - CREATE OPERATOR CLASS varchar_tr_ops DEFAULT FOR TYPE varchar USING table_range AS STORAGE varchar; - CREATE OPERATOR CLASS bpchar_tr_ops DEFAULT FOR TYPE bpchar USING table_range AS STORAGE bpchar; - CREATE OPERATOR CLASS date_tr_ops DEFAULT FOR TYPE date USING table_range AS STORAGE date; - CREATE OPERATOR CLASS time_tr_ops DEFAULT FOR TYPE time USING table_range AS STORAGE time; - CREATE OPERATOR CLASS timestamp_tr_ops DEFAULT FOR TYPE timestamp USING table_range AS STORAGE timestamp; - CREATE OPERATOR CLASS timestamptz_tr_ops DEFAULT FOR TYPE timestamptz USING table_range AS STORAGE timestamptz; - CREATE OPERATOR CLASS uuid_tr_ops DEFAULT FOR TYPE uuid USING table_range AS STORAGE uuid; - CREATE OPERATOR CLASS oid_tr_ops DEFAULT FOR TYPE oid USING table_range AS STORAGE oid; + 'Early partition pruning via per-partition data-range summaries'; + + -- CREATE INDEX ... USING table_range needs a default operator class for the column + -- type. Rather than hardcode a list, mirror the operator-class coverage that already + -- exists: any btree-ordered type (scalar min/max), every range type, and PostGIS + -- geometry/geography (extent). The AM stores only summaries, so these classes carry + -- no operators or support procedures. This runs at install and again whenever an + -- extension is created, so installing PostGIS makes geometry "just work" with no + -- manual step. + CREATE FUNCTION table_range_sync_opclasses() RETURNS void + LANGUAGE plpgsql AS $$ + DECLARE + tr_am oid; + r record; + BEGIN + SELECT oid INTO tr_am FROM pg_am WHERE amname = 'table_range'; + IF tr_am IS NULL THEN + RETURN; + END IF; + FOR r IN + SELECT DISTINCT cand.typid, format_type(cand.typid, NULL) AS typname + FROM ( + SELECT bc.opcintype AS typid + FROM pg_opclass bc JOIN pg_am am ON am.oid = bc.opcmethod + WHERE am.amname = 'btree' AND bc.opcdefault + UNION + SELECT t.oid FROM pg_type t WHERE t.typtype = 'r' + UNION + SELECT t.oid FROM pg_type t WHERE t.typname IN ('geometry', 'geography') + ) cand + WHERE cand.typid NOT IN ('anyrange'::regtype, 'anyarray'::regtype) + AND NOT EXISTS ( + SELECT 1 FROM pg_opclass tc + WHERE tc.opcmethod = tr_am AND tc.opcdefault + AND tc.opcintype = cand.typid + ) + LOOP + BEGIN + EXECUTE format( + 'CREATE OPERATOR CLASS %I DEFAULT FOR TYPE %s USING table_range AS STORAGE %s', + 'tr_' || r.typid, r.typname, r.typname); + EXCEPTION WHEN OTHERS THEN + -- Skip types that cannot host a storage-only opclass. + NULL; + END; + END LOOP; + END; + $$; + + SELECT table_range_sync_opclasses(); + + CREATE FUNCTION table_range_opclass_sync_evt() RETURNS event_trigger + LANGUAGE plpgsql AS $$ + BEGIN + PERFORM table_range_sync_opclasses(); + END; + $$; + + -- Re-sync when any extension is installed (e.g. PostGIS), so new types that gain a + -- btree/geometry opclass automatically become usable with table_range. + CREATE EVENT TRIGGER table_range_opclass_sync_trg + ON ddl_command_end WHEN TAG IN ('CREATE EXTENSION') + EXECUTE FUNCTION table_range_opclass_sync_evt(); "#, name = "table_range_access_method", requires = ["table_range_bootstrap_sql"] diff --git a/src/lib.rs b/src/lib.rs index 351a8f2..945d75b 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -43,8 +43,8 @@ pub extern "C-unwind" fn _PG_init() { extension_sql!( r#" - -- One summary per (registered relation, leaf partition, column). - -- index_oid: the registering parent relation OID (or index OID via the AM). + -- One summary per (index, leaf partition, column), built by the index AM. + -- index_oid: the (leaf) index relation OID. -- relid: the leaf partition (heap) OID the planner sees. -- attnum: the leaf partition's attnum for the column (resolved by name). -- kind: 'minmax' -> min_summary/max_summary hold the column's btree min/max; @@ -74,29 +74,8 @@ extension_sql!( ON table_range_summary (relid) WHERE stale; - -- Registration of parent relations whose partitions are summarized for pruning. - CREATE TABLE IF NOT EXISTS table_range_registered ( - parent_relid oid PRIMARY KEY, - columns text[] NOT NULL, - created_at timestamptz NOT NULL DEFAULT now(), - refreshed_at timestamptz NOT NULL DEFAULT now() - ); - - -- Row/statement trigger that marks a partition's summaries stale when its data - -- changes. The planner hook ignores stale summaries (treats them as KEEP), so any - -- modification automatically and conservatively disables pruning for that partition - -- until table_range_refresh() recomputes it. This is the correctness safety net: - -- no INSERT/UPDATE/DELETE/TRUNCATE can ever cause a false negative. - CREATE OR REPLACE FUNCTION table_range_stale_trigger() RETURNS trigger - LANGUAGE plpgsql AS $$ - BEGIN - UPDATE table_range_summary SET stale = true WHERE relid = TG_RELID; - RETURN NULL; - END; - $$; - - -- Drop summaries/registration for any relation that is dropped, so a dropped - -- table_range index can never leave behind a summary that nothing keeps stale. + -- Drop summaries for any relation that is dropped, so a dropped table_range index + -- (or its table) can never leave behind a summary that nothing keeps stale. CREATE FUNCTION table_range_drop_cleanup() RETURNS event_trigger LANGUAGE c AS 'MODULE_PATHNAME', 'table_range_drop_cleanup'; CREATE EVENT TRIGGER table_range_drop_trg ON sql_drop diff --git a/src/summary_build.rs b/src/summary_build.rs index ca164f8..5731ec5 100644 --- a/src/summary_build.rs +++ b/src/summary_build.rs @@ -2,16 +2,14 @@ use pgrx::prelude::*; use pgrx::spi::SpiError; use std::sync::OnceLock; -// Real, SPI-driven summary maintenance for the table_range pruning extension. +// SPI-driven summary maintenance for the table_range pruning extension. // -// Scans a registered parent relation's leaf partitions and persists one -// per-(partition, column) min/max/null summary into `table_range_summary`. -// -// Summaries are keyed by: -// - `index_oid` = the registered parent relation OID (synthetic key, no real index), -// - `relid` = the leaf partition OID, -// - `attnum` = the leaf partition's attnum for the column (resolved by name so -// differing physical column order across partitions is handled). +// Summaries are built by the index access method's `ambuild` (see `index_am.rs`): for +// each leaf partition it scans the column's real data and persists one summary row into +// `table_range_summary`, keyed by: +// - `index_oid` = the (leaf) index relation OID, +// - `relid` = the leaf partition (heap) OID the planner sees, +// - `attnum` = the leaf partition's attnum for the column. // // Correctness: a missing or `stale` summary means "do not prune". We never persist a // summary that could cause a false negative; on any failure we leave the partition @@ -42,99 +40,6 @@ pub(crate) fn summary_table() -> String { format!("{}.table_range_summary", schema()) } -/// Schema-qualified name of the registration table. -pub(crate) fn registered_table() -> String { - format!("{}.table_range_registered", schema()) -} - -/// Register a parent relation and build summaries for the named columns. -#[pg_extern] -fn table_range_create(parent: pg_sys::Oid, columns: Vec) -> i64 { - if columns.is_empty() { - error!("table_range_create: at least one column is required"); - } - validate_columns_exist(parent, &columns); - - // Persist registration (idempotent). - let cols_literal = pg_array_text_literal(&columns); - let reg = format!( - "INSERT INTO table_range_registered (parent_relid, columns, refreshed_at) \ - VALUES ({}::oid, {}, now()) \ - ON CONFLICT (parent_relid) DO UPDATE SET columns = EXCLUDED.columns, refreshed_at = now()", - oid_u32(parent), - cols_literal - ); - Spi::run(®).unwrap_or_else(|e| error!("table_range_create: failed to register parent: {e}")); - - build_summaries(parent, &columns) - .unwrap_or_else(|e| error!("table_range_create: summary build failed: {e}")) -} - -/// Recompute summaries for an already-registered parent relation. -#[pg_extern] -fn table_range_refresh(parent: pg_sys::Oid) -> i64 { - let columns = registered_columns(parent) - .unwrap_or_else(|e| error!("table_range_refresh: lookup failed: {e}")); - let columns = match columns { - Some(c) => c, - None => error!( - "table_range_refresh: parent {} is not registered", - oid_u32(parent) - ), - }; - let written = build_summaries(parent, &columns) - .unwrap_or_else(|e| error!("table_range_refresh: summary build failed: {e}")); - Spi::run(&format!( - "UPDATE table_range_registered SET refreshed_at = now() WHERE parent_relid = {}::oid", - oid_u32(parent) - )) - .ok(); - written -} - -/// Unregister a parent relation, drop its summaries, and remove its triggers. -#[pg_extern] -fn table_range_drop(parent: pg_sys::Oid) -> bool { - // Remove staleness triggers from every leaf first (best-effort). - if let Ok(leaves) = leaf_partitions(parent) { - for leaf in leaves { - if let Ok(Some(name)) = relation_name(leaf) { - let _ = Spi::run(&format!( - "DROP TRIGGER IF EXISTS {trg} ON {tbl}; \ - DROP TRIGGER IF EXISTS {trg}_trunc ON {tbl}", - trg = STALE_TRIGGER_NAME, - tbl = name - )); - } - } - } - - let p = oid_u32(parent); - Spi::run(&format!( - "DELETE FROM {tbl} WHERE index_oid = {p}::oid", - tbl = summary_table() - )) - .and_then(|_| { - Spi::run(&format!( - "DELETE FROM table_range_registered WHERE parent_relid = {p}::oid" - )) - }) - .is_ok() -} - -/// Number of leaf partitions currently summarized for a parent. -#[pg_extern] -fn table_range_summary_count(parent: pg_sys::Oid) -> i64 { - Spi::get_one::(&format!( - "SELECT count(DISTINCT relid)::bigint FROM {tbl} WHERE index_oid = {p}::oid", - tbl = summary_table(), - p = oid_u32(parent) - )) - .ok() - .flatten() - .unwrap_or(0) -} - /// V1 record for the `sql_drop` event-trigger cleanup function. #[no_mangle] pub extern "C" fn pg_finfo_table_range_drop_cleanup() -> &'static pg_sys::Pg_finfo_record { @@ -142,8 +47,8 @@ pub extern "C" fn pg_finfo_table_range_drop_cleanup() -> &'static pg_sys::Pg_fin &V1_API } -/// Event-trigger handler: when any relation is dropped, remove the summaries and -/// registration that referenced it. This closes the correctness gap where a dropped +/// Event-trigger handler: when any relation is dropped, remove the summaries that +/// referenced it (by index OID or leaf OID). This closes the gap where a dropped /// `table_range` index would leave summaries behind that nothing keeps stale anymore. #[no_mangle] #[pg_guard] @@ -156,138 +61,23 @@ pub unsafe extern "C-unwind" fn table_range_drop_cleanup( WHERE s.index_oid = d.objid OR s.relid = d.objid", tbl = summary_table() )); - let _ = Spi::run(&format!( - "DELETE FROM {reg} r USING pg_event_trigger_dropped_objects() d \ - WHERE r.parent_relid = d.objid", - reg = registered_table() - )); }) .catch_others(|_| ()) .execute(); pg_sys::Datum::from(0) } -fn validate_columns_exist(parent: pg_sys::Oid, columns: &[String]) { - for col in columns { - let found = Spi::get_one::(&format!( - "SELECT EXISTS (SELECT 1 FROM pg_attribute \ - WHERE attrelid = {}::oid AND attname = {} AND attnum > 0 AND NOT attisdropped)", - oid_u32(parent), - quote_literal(col) - )) - .ok() - .flatten() - .unwrap_or(false); - if !found { - error!( - "table_range_create: column {:?} does not exist on relation {}", - col, - oid_u32(parent) - ); - } - } -} - -fn registered_columns(parent: pg_sys::Oid) -> Result>, SpiError> { - let mut out: Vec = Vec::new(); - let mut registered = false; - Spi::connect(|client| { - let table = client.select( - &format!( - "SELECT unnest(columns) AS c FROM table_range_registered WHERE parent_relid = {}::oid", - oid_u32(parent) - ), - None, - &[], - )?; - for row in table { - registered = true; - if let Ok(Some(c)) = row.get::(1) { - out.push(c); - } - } - Ok::<(), SpiError>(()) - })?; - if !registered { - Ok(None) - } else { - Ok(Some(out)) - } -} - -/// Enumerate leaf partitions of `parent`. For a non-partitioned table this returns -/// the table itself, so summaries work for plain tables too. -fn leaf_partitions(parent: pg_sys::Oid) -> Result, SpiError> { - let mut leaves: Vec = Vec::new(); - Spi::connect(|client| { - let table = client.select( - &format!( - "SELECT relid::oid FROM pg_partition_tree({}::oid::regclass) WHERE isleaf", - oid_u32(parent) - ), - None, - &[], - )?; - for row in table { - if let Ok(Some(oid)) = row.get::(1) { - leaves.push(oid); - } - } - Ok::<(), SpiError>(()) - })?; - Ok(leaves) -} - -/// Trigger name installed on each leaf to mark its summaries stale on data change. -const STALE_TRIGGER_NAME: &str = "table_range_stale_trg"; - -/// Install (idempotently) the staleness triggers on a leaf partition. -/// -/// A row-level trigger is required for INSERT/UPDATE/DELETE because statement-level -/// triggers on a leaf do not fire for tuples routed through the partitioned parent; -/// row-level triggers do. TRUNCATE cannot be row-level, so it gets a statement -/// trigger. Both mark only this leaf's summaries stale (precise, not global). -fn install_stale_trigger(leaf_name: &str) -> Result<(), SpiError> { - Spi::run(&format!( - "DROP TRIGGER IF EXISTS {trg} ON {tbl}; \ - CREATE TRIGGER {trg} AFTER INSERT OR UPDATE OR DELETE ON {tbl} \ - FOR EACH ROW EXECUTE FUNCTION table_range_stale_trigger(); \ - DROP TRIGGER IF EXISTS {trg}_trunc ON {tbl}; \ - CREATE TRIGGER {trg}_trunc AFTER TRUNCATE ON {tbl} \ - FOR EACH STATEMENT EXECUTE FUNCTION table_range_stale_trigger();", - trg = STALE_TRIGGER_NAME, - tbl = leaf_name - )) -} - -/// Returns the number of summary rows written. -fn build_summaries(parent: pg_sys::Oid, columns: &[String]) -> Result { - let leaves = leaf_partitions(parent)?; - let mut written = 0i64; - for leaf in &leaves { - written += build_one_leaf(parent, *leaf, columns, true)?; - } - Ok(written) -} - -/// Build summaries for a single leaf relation's named columns and install the -/// correctness safety-net trigger on it. Used by both `table_range_create` (keyed by -/// parent OID) and the index AM's `ambuild` (keyed by index OID). Returns the number -/// of summary rows written. +/// Build summaries for a single leaf relation's named columns. Called by `ambuild` +/// (keyed by the index OID). Returns the number of summary rows written. pub(crate) fn build_one_leaf( index_oid: pg_sys::Oid, leaf: pg_sys::Oid, columns: &[String], - install_trigger: bool, ) -> Result { let leaf_name = match relation_name(leaf)? { Some(n) => n, None => return Ok(0), }; - // Ensure the correctness safety-net trigger exists before (re)building. - if install_trigger { - install_stale_trigger(&leaf_name)?; - } let mut written = 0i64; for col in columns { @@ -439,7 +229,7 @@ pub(crate) fn column_names_for_attnums( #[allow(clippy::too_many_arguments)] fn upsert_summary( - parent: pg_sys::Oid, + index_oid: pg_sys::Oid, leaf: pg_sys::Oid, attnum: i16, kind: &str, @@ -467,7 +257,7 @@ fn upsert_summary( stale = false, \ tuple_version = s.tuple_version + 1", tbl = summary_table(), - p = oid_u32(parent), + p = oid_u32(index_oid), r = oid_u32(leaf), a = attnum, kind = quote_literal(kind), @@ -502,12 +292,3 @@ fn quote_literal(s: &str) -> String { fn quote_ident(s: &str) -> String { format!("\"{}\"", s.replace('"', "\"\"")) } - -fn pg_array_text_literal(items: &[String]) -> String { - let inner = items - .iter() - .map(|s| quote_literal(s)) - .collect::>() - .join(", "); - format!("ARRAY[{}]::text[]", inner) -} From 09fb46e2db4638752b6cf6b158237b1cef8e29fd Mon Sep 17 00:00:00 2001 From: David W Bitner Date: Mon, 22 Jun 2026 16:32:44 -0500 Subject: [PATCH 2/2] Correct the performance story; the index overhead was a measurement artifact While investigating the "per-partition index planning overhead," controlled same-session A/B measurements showed my earlier numbers were confounded by relcache warmth and cross-session machine load: - The index-validity effect on planning is ~3 ms on 500 partitions (warm), not the ~85 ms I previously reported. So the invalid-index event trigger I added is not worth its cost (a confusing INVALID display) and is reverted, along with the ineffective get_relation_info_hook. - The real, clean benefit is at EXECUTION: on 100 partitions x 30k rows (3M rows, warm), a selective non-key predicate runs in ~18 ms with pruning vs ~125 ms without (~7x). Pruning adds a small per-plan overhead and can be a slight net cost on tiny partitions; it wins when eliminated partitions are large. - Rewrote bench/planning_benchmark.sql to measure total (plan+exec) time, warm, on realistic partition sizes, and rewrote the README performance section to match. - Gated the test-only cache-load diagnostics so production builds are warning-free. Co-Authored-By: Claude Opus 4.8 --- README.md | 32 +++++++++++---------- bench/planning_benchmark.sql | 56 +++++++++++++++++++----------------- src/prune_hook.rs | 2 ++ 3 files changed, 48 insertions(+), 42 deletions(-) diff --git a/README.md b/README.md index 30b1300..fe1a6b1 100644 --- a/README.md +++ b/README.md @@ -78,21 +78,23 @@ EXPLAIN (COSTS OFF) SELECT * FROM places WHERE geom && ST_MakeEnvelope(0,0,10,10 ## Performance -The win is at **planning time** on wide trees, where the planner would otherwise build -paths for every partition. On a 1000-partition table queried by a non-key column -(`bench/planning_benchmark.sql`, PostgreSQL 18): - -| | Planning time | Result | -|---|---|---| -| pruning off | ~210 ms | 50 rows | -| pruning on | ~100 ms | 50 rows | - -Pruning removes ~110 ms of child-path planning here. Note the absolute numbers are higher -than they could be: because summaries are owned by a real index, PostgreSQL loads index -metadata for every partition during planning (a flat overhead, ~85 ms on this bare -1000-partition table — proportionally smaller when partitions already carry indexes). - -Summaries themselves are loaded **once per plan** (not per partition); the +The benefit is at **execution**: a selective predicate on a non-key column scans only the +matching partition instead of every partition. On 100 partitions × 30k rows = 3M rows +(`bench/planning_benchmark.sql`, PostgreSQL 18, warm): + +| | Total query time (plan + exec) | +|---|---| +| pruning off (scans all 100 partitions) | ~125 ms | +| pruning on (scans 1 partition) | ~18 ms | + +Pruning is **not** a free planning-time win: it adds a small per-plan overhead (loading +summaries once, then evaluating each partition — single-digit to low-tens of ms on +hundreds of partitions). It pays off when the partitions it eliminates are large enough +that avoiding their scan outweighs that overhead — so it helps most on **large +partitions with a selective non-key predicate**, and can be a slight net cost on tiny +partitions. Use `table_range.enable_pruning` to measure both ways on your workload. + +Summaries are loaded **once per plan** (not per partition); the `e2e_per_plan_cache_loads_once_regardless_of_partitions` test asserts exactly one catalog load for a 64-partition query. diff --git a/bench/planning_benchmark.sql b/bench/planning_benchmark.sql index fac2d23..a2463e8 100644 --- a/bench/planning_benchmark.sql +++ b/bench/planning_benchmark.sql @@ -1,49 +1,51 @@ --- Planning-time benchmark for table_range pruning. +-- Benchmark for table_range pruning. -- --- Builds a wide partition tree where the queried column is NOT the partition key, so --- native PostgreSQL pruning cannot help, and compares planning time + plan size with --- table_range pruning off vs. on. Run with: +-- Measures end-to-end query time (planning + execution, warm) for a selective predicate +-- on a NON-partition-key column, with table_range pruning on vs. off. Native PostgreSQL +-- cannot prune on a non-key column, so without pruning the query scans every partition. -- -- cargo pgrx run pg18 -- \i bench/planning_benchmark.sql -- --- Look at the "Planning Time" line and the number of child plans in each EXPLAIN. +-- Pruning trades a small per-plan overhead (loading summaries + evaluating each +-- partition) for skipping the scan of non-matching partitions, so it wins when the +-- partitions it eliminates are large enough to outweigh that overhead. -\set part_count 1000 +\set part_count 100 +\set rows_per_part 30000 DROP TABLE IF EXISTS bench_events CASCADE; -CREATE TABLE bench_events (region int, val bigint) PARTITION BY LIST (region); +CREATE TABLE bench_events (region int, val bigint, pad text) PARTITION BY LIST (region); --- One partition per region; each holds a disjoint 1000-wide band of `val`. SELECT format( - 'CREATE TABLE bench_events_%s PARTITION OF bench_events FOR VALUES IN (%s);', - g, g -) + 'CREATE TABLE bench_events_%s PARTITION OF bench_events FOR VALUES IN (%s);', g, g) FROM generate_series(1, :part_count) g \gexec +-- region is the partition key; val is a disjoint band per partition (the queried, +-- non-key column). INSERT INTO bench_events -SELECT g, (g * 1000) + s -FROM generate_series(1, :part_count) g, - generate_series(0, 49) s; - -ANALYZE bench_events; +SELECT g, g * 1000000 + s, repeat('x', 50) +FROM generate_series(1, :part_count) g, generate_series(0, :rows_per_part - 1) s; +VACUUM ANALYZE bench_events; CREATE INDEX bench_events_tr ON bench_events USING table_range (val); -\echo '==================== pruning OFF ====================' -SET table_range.enable_pruning = off; -EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY ON) -SELECT * FROM bench_events WHERE val BETWEEN 500000 AND 500049; +\timing on -\echo '==================== pruning ON ====================' +-- Warm the relation cache first so the numbers reflect steady state, not first-touch +-- partition-metadata loading (which both modes pay equally). SET table_range.enable_pruning = on; -EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY ON) -SELECT * FROM bench_events WHERE val BETWEEN 500000 AND 500049; +SELECT count(*) FROM bench_events WHERE val = 50000000; + +\echo '==================== pruning ON (warm) ====================' +SELECT count(*) FROM bench_events WHERE val = 50000000; +SELECT count(*) FROM bench_events WHERE val = 50000000; -\echo '==================== correctness check (must match) ====================' SET table_range.enable_pruning = off; -SELECT count(*) AS off_count FROM bench_events WHERE val BETWEEN 500000 AND 500049; -SET table_range.enable_pruning = on; -SELECT count(*) AS on_count FROM bench_events WHERE val BETWEEN 500000 AND 500049; +SELECT count(*) FROM bench_events WHERE val = 50000000; + +\echo '==================== pruning OFF (warm) ====================' +SELECT count(*) FROM bench_events WHERE val = 50000000; +SELECT count(*) FROM bench_events WHERE val = 50000000; DROP TABLE bench_events CASCADE; diff --git a/src/prune_hook.rs b/src/prune_hook.rs index 715051d..5da946b 100644 --- a/src/prune_hook.rs +++ b/src/prune_hook.rs @@ -11,10 +11,12 @@ use crate::{TABLE_RANGE_ENABLE_PRUNING, TABLE_RANGE_LOG_PRUNING_DEBUG}; /// One load per top-level plan (regardless of partition count) proves the per-plan cache. static CACHE_LOADS: AtomicU64 = AtomicU64::new(0); +#[cfg(any(test, feature = "pg_test"))] pub fn cache_load_count() -> u64 { CACHE_LOADS.load(Ordering::Relaxed) } +#[cfg(any(test, feature = "pg_test"))] pub fn reset_cache_load_count() { CACHE_LOADS.store(0, Ordering::Relaxed); }