Subtree counts by whilo · Pull Request #4 · replikativ/persistent-sorted-set

whilo · 2026-02-05T12:06:54Z

No description provided.

Implements Phase 1 of subtree statistics: - Add ISubtreeCount interface for nodes tracking subtree element counts - Update Leaf to implement ISubtreeCount (subtreeCount == len) - Update Branch with _subtreeCount field, maintained during add/remove - Add countSlice(from, to) method for O(log n) range counting - Add count-slice to Clojure API - Add comprehensive tests for count-slice - Include design document in doc/SUBTREE_STATS_DESIGN.md This enables O(1) count() and O(log n) range counting for query planning in Datahike and range aggregations in Sagitta columnar indices.

- Add $subtree-count to INode protocol - Add subtree-count field to Branch deftype - Maintain subtree counts in add/remove/merge operations - Implement $count-slice with O(log n) performance - Add count-slice to public CLJS API - Fix reader conditional for js/window in stress.cljc

- Add IStats interface for user-defined aggregations over tree nodes - Implement NumericStats/NumericStatsOps as example for count/sum/min/max - Add stats field to Leaf and Branch nodes with lazy computation - Update from-sorted-array to compute stats during tree construction - Fix lazy set count by propagating unknown counts (-1) instead of eager loading - Add cross-platform generative tests (CLJ/CLJS) for: - Model-based testing against Clojure's sorted-set - Lazy set count correctness after operations - Transient/persistent roundtrip - Tree rebalancing with small branching factors - count-slice correctness - Structural sharing with derived sets - Slice/rslice operations

CLJS stats implementation: - Add IStats protocol in impl/stats.cljs (identity-stats, extract, merge-stats, remove-stats) - Add _stats field to Leaf and Branch types - Add $stats and $compute-stats methods to INode protocol - Update from-sorted-array to compute stats during tree construction - Add stats API function to main namespace - Update storage to store/restore stats Lazy set fixes: - Fix $conjoin/$disjoin to compute count before inc/dec when count is unknown (-1) - Fix Branch.merge/merge-split to handle null children and addresses for lazy branches - Add test.check dependency to :node-tests alias for CLJS generative tests All 37 CLJS tests pass (1276 assertions) All 44 CLJ tests pass (52127 assertions)

When modifying lazy sets where count is unknown (-1), keep it as -1 instead of computing O(n) count on every modification. The count will be computed lazily when actually needed via $count. This matches the Java implementation behavior where: newCount = _subtreeCount >= 0 ? _subtreeCount + 1 : -1

- Fix stats initialization when _stats is nil in CLJS Leaf and Branch - Fix stats computation for split nodes during add operations - Compute stats from children when creating new Branch roots - Add README documentation for count-slice (O(log n) range counting) - Add README documentation for aggregate statistics with IStats protocol - Document monoid requirements and non-invertible stats handling

- Moved doc/SUBTREE_STATS_DESIGN.md to .internal/ - Keep internal design documents out of the main repository

- Fix unmatched parentheses in branch.cljs, btset.cljs from merge conflicts - Update deps.edn to run all tests (not just core, small, replace-and-lookup) Tests passing: - CLJ: 59 tests, 53,931 assertions (includes 22 generative tests) - CLJS node-tests: 39 tests, 1,435 assertions - CLJS stress: 5 tests, 13,401 assertions

Adds zero-cost extension point for post-processing leaf entries after add/remove operations. Enables applications like Stratum to: - Merge small chunks in same leaf (compaction) - Split oversized chunks (prevent overflow) - Custom rebalancing based on application-specific metrics Key features: - Zero overhead when processor is null (default) - shouldProcess() pre-check avoids allocation if no work needed - Processor can split/merge entries, PSS tree handles rebalancing - Works with both add and remove operations - Maintains all PSS invariants (sort order, branching factor, etc.) Based on subtree-counts branch for integration with Stratum columnar index.

- Update map->settings to accept :leaf-processor option - Update settings->map to include leaf-processor field - Enables Stratum to pass processor through sorted-set* API

IStats.weight() extracts element count from stats, enabling getNth to navigate by accumulated data counts rather than entry counts. Used by Stratum for O(log chunks) point access without absolute positions.

Implements weighted rank-based navigation for O(log n) access by accumulated element counts, matching the Java implementation added in commit 09f29ed. Changes: - Add weight protocol method to IStats (impl/stats.cljs) - Implement weight in numeric-stats-ops returning :cnt - Add $get-nth to btset.cljs with async+sync support - Add public get-nth wrapper in persistent_sorted_set.cljs Usage: (get-nth set n) returns [entry local-offset] or nil. Requires stats with weight() configured on the set. Tests: CLJ 59 tests/53,110 assertions, CLJS 39 tests/1,435 assertions ✓

Copilot

Pull request overview

This PR introduces subtree-count tracking and optional aggregate statistics to persistent-sorted-set, enabling efficient range counting (count-slice) and range/whole-set statistics (stats, stats-slice) across CLJ and CLJS, with accompanying tests and documentation.

Changes:

Add subtree count support and count-slice API (CLJ/CLJS + Java core).
Add pluggable stats aggregation (IStats, NumericStats*) and new stats / stats-slice APIs.
Expand/adjust test suite (new stats tests, large generative rewrite) and update docs/README.

Reviewed changes

Copilot reviewed 23 out of 25 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`test-clojure/me/tonsky/persistent_sorted_set/test/stress.cljc`	CLJS-only debug globals guarded with reader conditional.
`test-clojure/me/tonsky/persistent_sorted_set/test/storage/util.cljs`	Persist/restore `:subtree-count` and `:stats` in CLJS test storage.
`test-clojure/me/tonsky/persistent_sorted_set/test/stats.clj`	New JVM stats test coverage for `NumericStats`.
`test-clojure/me/tonsky/persistent_sorted_set/test/generative.cljc`	Major rewrite/expansion of property tests (count-slice, lazy ops, stats).
`test-clojure/me/tonsky/persistent_sorted_set/test/core.cljc`	Adds deterministic unit tests for `count-slice`.
`src-java/me/tonsky/persistent_sorted_set/Settings.java`	Adds `IStats` to settings and threads it through editable transitions.
`src-java/me/tonsky/persistent_sorted_set/PersistentSortedSet.java`	Adds `countSlice` and root stats/subtree-count initialization logic.
`src-java/me/tonsky/persistent_sorted_set/NumericStatsOps.java`	New numeric `IStats` implementation for JVM.
`src-java/me/tonsky/persistent_sorted_set/NumericStats.java`	New immutable numeric stats value type for JVM.
`src-java/me/tonsky/persistent_sorted_set/Leaf.java`	Implements `ISubtreeCount` + incremental stats updates and lazy compute.
`src-java/me/tonsky/persistent_sorted_set/ISubtreeCount.java`	New interface for subtree element counts.
`src-java/me/tonsky/persistent_sorted_set/IStats.java`	New interface for monoidal stats + removal support.
`src-java/me/tonsky/persistent_sorted_set/Branch.java`	Implements `ISubtreeCount`, lazy subtree-count compute, stats compute/updates.
`src-java/me/tonsky/persistent_sorted_set/ANode.java`	Adds `_stats` storage and `computeStats` abstract method.
`src-clojure/me/tonsky/persistent_sorted_set/leaf.cljs`	CLJS leaf now tracks subtree-count + optional stats with incremental updates.
`src-clojure/me/tonsky/persistent_sorted_set/impl/stats.cljs`	New CLJS `IStats` protocol.
`src-clojure/me/tonsky/persistent_sorted_set/impl/numeric_stats.cljs`	New CLJS numeric stats implementation.
`src-clojure/me/tonsky/persistent_sorted_set/impl/node.cljs`	Extends CLJS node protocol to expose subtree-count/stats/compute-stats.
`src-clojure/me/tonsky/persistent_sorted_set/btset.cljs`	Adds CLJS `count-slice`, `stats`, `stats-slice` plumbing and root aggregation.
`src-clojure/me/tonsky/persistent_sorted_set/branch.cljs`	CLJS branch now carries subtree-count + stats and updates them on edits.
`src-clojure/me/tonsky/persistent_sorted_set.cljs`	Public CLJS API additions: `count-slice`, `stats`, `stats-slice`.
`src-clojure/me/tonsky/persistent_sorted_set.clj`	Public CLJ API additions: `count-slice`, `stats`, `stats-slice`, settings carry `:stats`.
`deps.edn`	Test runner patterns broadened to include all test namespaces.
`README.md`	Documentation for `count-slice` + stats API and examples.
`.gitignore`	Formatting tweak (no functional impact).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src-java/me/tonsky/persistent_sorted_set/Branch.java

src-clojure/me/tonsky/persistent_sorted_set/btset.cljs

src-java/me/tonsky/persistent_sorted_set/Branch.java

src-clojure/me/tonsky/persistent_sorted_set/btset.cljs

src-clojure/me/tonsky/persistent_sorted_set.clj

test-clojure/org/replikativ/persistent_sorted_set/test/generative.cljc

src-clojure/me/tonsky/persistent_sorted_set/leaf.cljs

src-java/org/replikativ/persistent_sorted_set/PersistentSortedSet.java

src-java/me/tonsky/persistent_sorted_set/PersistentSortedSet.java

Added comprehensive documentation for rank-based access and statistical use cases enabled by getNth. Changes: - README.md: Added "Rank-Based Access (getNth)" section with examples - Basic percentile queries (median, quartiles, deciles) - ClojureScript async/sync support - Use cases: outlier detection, quantile regression, weighted sampling - doc/statistical-queries.md: Complete guide with runnable examples - Percentile computation and five-number summaries - Robust statistics (median, IQR) vs traditional (mean, stddev) - Outlier detection using Tukey's fences - Distribution analysis (empirical CDF, inverse CDF) - Advanced techniques (quantile binning, rolling median) - Performance characteristics and complexity analysis Position: Added after "Efficient Range Counting" and before "Aggregate Statistics" in README, following the logical progression of O(log n) query operations.

- Branch.$replace (CLJS): add missing subtree-count and _stats args - Leaf.$replace (CLJS): add missing _stats arg - auto_removal.cljs, bench/cps/util.cljs: fix Leaf/Branch arity - NumericStatsOps.weight(): override default (was returning 1 instead of stats.count, making getNth only ever return the first element) - Add regression tests for replace with stats, count-slice, and get-nth

…sing await - Eagerly maintain stats during replace (Java & CLJS Branch/Leaf) using remove-stats + merge-stats(extract) pattern, matching add/remove behavior - Fix sorted-set* in CLJ to accept :comparator key (matching CLJS API) - Fix missing await on $keys-for in btset.cljs async seek paths - Preserve subtreeCount in Java Branch.replace persistent path (8-arg ctor)

This commit addresses 6 bugs identified in code review: 1. Subtree count sentinel propagation (Java + CLJS) - Don't sum negative subtree counts; propagate -1 sentinel when any child is unknown - Prevents invalid states like summing -1 + 5 = 4 2. Branch.computeStatsFromChildren skips References (Java) - Now properly dereferences SoftReference/WeakReference before accessing child stats 3. Root stats aggregation with nil children (CLJS) - Only compute root stats when ALL children have stats (not just some) 4. Stats reads _root directly without dereferencing (CLJ) - Use .root method instead of .-_root field to properly handle References 5. Leaf $add incremental stats with nil _stats (CLJS) - Only do incremental stats update when _stats exists (not nil) 6. Stats maintenance during replace operations (Java + CLJS) - Compute stats from final state (after replacement) rather than incremental updates - Maintains O(1) complexity: O(leaf-size) for leaves, O(branching-factor) for branches All tests pass: - CLJS: 30 tests, 1061 assertions - Java: 4 replace tests, 133 assertions

…tion Implements Option B (try/force naming) to handle stats computation differently for operations vs user queries: **tryComputeStats()**: Used during tree operations (add/remove/replace) - Returns null if any child has unavailable stats (postpones computation) - Avoids sudden O(n) traversals on old trees without stats - Stats propagate gradually bottom-up as operations touch nodes **forceComputeStats()**: Used for explicit user queries (stats, getNth, countSlice) - Recursively descends to compute stats if children have null stats - Acceptable O(n) cost when user explicitly requests stats on old tree - Only computes when needed (user invokes the query) This ensures: - New trees: Always have stats bottom-up, operations maintain in O(1) - Old trees: No sudden O(n) during operations, stats filled in gradually - User queries: Work correctly on both old and new trees Changes: - ANode: Added abstract tryComputeStats/forceComputeStats, deprecated computeStats - Leaf: Both methods do same thing (no children to check) - Branch: tryComputeStats postpones if any child null, force descends recursively - Operations: Use tryComputeStatsFromChildren (postpone if unavailable) - Queries: Use forceComputeStats via deprecated computeStats (descend if needed) All tests pass: - Java: 4 replace tests, 133 assertions - CLJS: test-stats shows correct counts after replace

…methods Implements the same lazy vs eager stats computation pattern in CLJS as in Java: **try-compute-stats**: Used during operations (add/remove/replace) - Returns nil if any child has unavailable stats (postpones computation) - Prevents sudden O(n) traversals on old trees without stats - Stats propagate gradually as operations touch nodes **force-compute-stats**: Used for user queries (stats, getNth, count-slice) - Recursively descends to compute stats if children have null stats - Acceptable O(n) cost when user explicitly requests stats Changes: - Protocol: Added try-compute-stats and force-compute-stats, removed $compute-stats - Leaf: Both methods delegate to same implementation (no children to check) - Branch: try postpones if any child null, force descends recursively - Operations: Use try-compute-stats (add/remove/replace) - User queries: Use force-compute-stats (stats, getNth, count-slice) - Removed deprecated computeStats from Java (no backward compat needed) All tests pass: - Java: 4 replace tests, 133 assertions (no reflection warnings) - CLJS: 30 tests, 1061 assertions

The overloaded 3-arity that dispatched on fn? to accept either not-found or comparator was inconsistent with Java/CLJ where 3-arity is always (lookup set key cmp). Since the set can't store nil, nil is an unambiguous not-found sentinel.

The stats framework is a monoid-valued measure in the measure theory sense. Renaming IStats/stats to IMeasure/measure better reflects the general-purpose nature of the interface — it's not limited to statistics, any commutative monoid works. Breaking change: settings key :stats → :measure, public API functions stats → measure and stats-slice → measure-slice.

Remove test_stats.cljs (println-based debug script, not a proper test) and the test-stats shadow-cljs build target that compiled it.

- Fix silent data corruption: branch $add used identity fallback when _measure was nil, producing wrong aggregates. Now consistent with leaf pattern (skip when nil). - Fix eager tree traversal: tryComputeMeasure loaded children from storage instead of checking in-memory only. Now postpones when children not loaded. - Fix laziness violations: join paths forced O(n) computeSubtreeCount when counts unknown. Now propagates -1 sentinel. - Fix settings lost on restore: btset.cljs select-keys now includes :measure so restored sets keep their measure ops. - Add _measure != null guard to all EARLY_EXIT paths and remove/replace paths in both Java and CLJS for consistency. - Guard Branch.address() against null _children to prevent NPE. - Remove dead sumChildCounts method, eliminate throwaway Branch in replace. - Add comprehensive cross-platform test suite (measure_test.cljc).

- Guard all measure computation paths with _measure != null check to preserve laziness (Leaf.java borrow-from-right, processSingleLeaf, processLeafNodes; branch.cljs $add split, $replace; leaf.cljs $add split) - Cache forceComputeMeasure result in Branch.java - Remove count() caching restriction in Branch.java (safe for immutable nodes) - Add nil safety to btset.cljs $get-nth for force-compute-measure result - Fix measure-slice using ._root instead of .root (NPE on lazy sets) - Add 6 generative tests: replace+measure, get-nth, rebalancing+measure, transient+measure, stats-after-roundtrip, measure-slice-after-roundtrip - Add storage-with-settings helper for measure-aware roundtrip tests

- Leaf.java: forceComputeMeasure now caches _measure (was recomputing every call) - Branch.java: forceComputeMeasure returns null on missing child measure instead of silently undercounting - btset.cljs: -empty clears storage/address to nil (was preserving stale address) - btset.cljs: $$_next-path uses .-keys instead of .-children (prevents TypeError on storage-backed branches where children may be nil) - branch.cljs/btset.cljs: measure reduce during split returns nil when any child measure is unavailable instead of silently skipping - Formatter whitespace fixes across CLJS files

…uite New invariants.cljc validates 8 tree invariants (balance, node sizes, key ordering, separator keys, subtree counts, root shape, level consistency, sibling ordering) after every operation. Integrated validate-tree into 7 existing generative tests for broader coverage.

Branch splits and borrows were unconditionally setting subtreeCount=-1, poisoning the entire tree lineage permanently. Now compute counts from in-memory children instead. Also cache computed counts in child branches during computeSubtreeCount to avoid redundant traversals. New diagnostics.cljc library namespace provides validate, validate-full, validate-counts-known, validate-measures-known, validate-content, and tree-stats for structural integrity checking (like pg amcheck / SQLite PRAGMA integrity_check). Non-deterioration generative tests verify fresh trees never degrade to lazy recomputation.

This library has diverged significantly from the upstream and is now maintained as part of the replikativ ecosystem. Rename Java packages, Clojure/CLJS namespaces, directory structures, build configs, scripts, and documentation to reflect the new org.replikativ ownership.

- compact: rebuild tree with optimal fill factors from current elements, preserving comparator/settings/metadata (CLJ + CLJS) - Fill factor histogram: extend tree-stats with percentile distribution (p25/p50/p75/p90) for all nodes, leaves, and branches separately - validate-navigation: root-descend verification inspired by PostgreSQL amcheck — re-looks up every element from root to catch transitive inconsistencies across cousin leaf pages - Add generative tests for compact and navigation validation

Remove unnecessary $ prefixes from CLJS function names that don't clash with cljs.core, making the convention self-documenting: $ now signals "this name would shadow a cljs.core var". Renamed: add, lookup, subtree-count, measure, store, walk-addresses, child, conjoin, disjoin, slice, rslice, seek, count-slice, get-nth, measure-slice, equivalent?, equivalent-sequential? Kept $: $count, $contains?, $replace, $remove, $reduce, $transduce, $into, $seq, $rseq

Since PSS cannot store nil, nil is an unambiguous not-found sentinel. Remove the not-found parameter from btset/lookup and accept cmp as the third positional arg instead. ILookup.-lookup handles not-found at the protocol level. Also add test failure policy to CLAUDE.md: correctness errors in this data structure library must always be investigated, never skipped.

Branch.add() and Branch.replace() shared the _keys array between the original persistent branch and newly created copies when maxKey didn't change. When a transient later modified _keys[i] in place via the editable() path, it corrupted the original persistent branch's separator keys, causing silent element loss. Also removed dead code path where _children/_addresses arrays were shared (the node == child() condition was unreachable on non-editable paths). Bug also reported upstream: tonsky#19

validate-full now re-looks up every key from the root after structural checks, catching separator key corruption and navigation bugs that structural checks alone miss. This runs in all generative tests via validate-tree.

CLJS branch.cljs $replace and util.cljs check-n-splice had the same copy-on-write violation as Java: sharing keys/addrs arrays between persistent copies and transient-editable branches. Fixed by always cloning arrays when creating persistent branch copies. Added cross-platform generative test transient-does-not-corrupt-persistent that exercises the exact failure pattern: transient ops on base-set first, then persistent ops on the same base-set, verifying the persistent result matches a reference sorted-set.

…heck, update docs - btset.cljs: Add :early-exit handling in $replace (latent bug if transients enabled) - Branch.java: Remove always-true maxKey == check in editable add path (line 366) - doc/btree-operations.md: Comprehensive rewrite with ILeafProcessor section, platform-specific remove conventions, borrow preference differences, test coverage

- make-array: fix return tag from "[[Ljava.lang.Object;" (Object[][]) to "[Ljava.lang.Object;" (Object[]) — the function returns a 1D array - array macro: same tag fix + use gensym with proper metadata instead of auto-gensym with inline metadata, use object-array - amap: use object-array instead of clojure.core/make-array (avoids reflection)

Copilot

Pull request overview

Copilot reviewed 70 out of 72 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

test-clojure/org/replikativ/persistent_sorted_set/test/storage/util.cljs:93

Same issue here: the Leaf constructor call needs 4 arguments to match the Leaf.java constructor signature. The constructor should be (Leaf. (count keys) keys measure settings) to match the 4-arg constructor in Leaf.java, not (Leaf. keys settings measure).
test-clojure/org/replikativ/persistent_sorted_set/test/storage/util.cljs:51
The Leaf constructor is being called with 3 arguments (keys, settings, measure) but the Leaf class has two constructors: one with 2 args (keys, settings) and one with 3 args (len, keys, settings). The call should likely be (Leaf. keys settings measure) but needs to match the 4-argument constructor (len, keys, measure, settings) that exists in the Leaf.java file. This will cause a runtime error when restoring from storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

whilo added 13 commits February 5, 2026 03:24

Move design doc to .internal directory

f95d9f8

- Moved doc/SUBTREE_STATS_DESIGN.md to .internal/ - Keep internal design documents out of the main repository

Remove subtree doc.

3c4234e

Fix format.

31b3c76

Add Clojure wrapper support for ILeafProcessor

c5a656d

- Update map->settings to accept :leaf-processor option - Update settings->map to include leaf-processor field - Enables Stratum to pass processor through sorted-set* API

Add weighted getNth for O(log n) rank-based navigation

09f29ed

IStats.weight() extracts element count from stats, enabling getNth to navigate by accumulated data counts rather than entry counts. Used by Stratum for O(log chunks) point access without absolute positions.

whilo requested a review from Copilot February 9, 2026 20:58

Copilot started reviewing on behalf of whilo February 9, 2026 20:58 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

whilo added 13 commits February 9, 2026 14:40

Fix format.

1ea60b0

Remove ad-hoc debug script and its build target

caa7e61

Remove test_stats.cljs (println-based debug script, not a proper test) and the test-stats shadow-cljs build target that compiled it.

whilo added 12 commits February 13, 2026 04:02

Document compact memory behavior for storage-backed sets

b97a197

Add element-wise navigation check to validate-full

c456b86

validate-full now re-looks up every key from the root after structural checks, catching separator key corruption and navigation bugs that structural checks alone miss. This runs in all generative tests via validate-tree.

whilo requested a review from Copilot February 14, 2026 09:32

Copilot started reviewing on behalf of whilo February 14, 2026 09:32 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subtree counts#4

Subtree counts#4
whilo wants to merge 39 commits intomainfrom
subtree-counts

whilo commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant