Skip to content

Subtree counts#4

Open
whilo wants to merge 39 commits intomainfrom
subtree-counts
Open

Subtree counts#4
whilo wants to merge 39 commits intomainfrom
subtree-counts

Conversation

@whilo
Copy link
Member

@whilo whilo commented Feb 5, 2026

No description provided.

whilo added 13 commits February 5, 2026 03:24
Implements Phase 1 of subtree statistics:

- Add ISubtreeCount interface for nodes tracking subtree element counts
- Update Leaf to implement ISubtreeCount (subtreeCount == len)
- Update Branch with _subtreeCount field, maintained during add/remove
- Add countSlice(from, to) method for O(log n) range counting
- Add count-slice to Clojure API
- Add comprehensive tests for count-slice
- Include design document in doc/SUBTREE_STATS_DESIGN.md

This enables O(1) count() and O(log n) range counting for query planning
in Datahike and range aggregations in Sagitta columnar indices.
- Add $subtree-count to INode protocol
- Add subtree-count field to Branch deftype
- Maintain subtree counts in add/remove/merge operations
- Implement $count-slice with O(log n) performance
- Add count-slice to public CLJS API
- Fix reader conditional for js/window in stress.cljc
- Add IStats interface for user-defined aggregations over tree nodes
- Implement NumericStats/NumericStatsOps as example for count/sum/min/max
- Add stats field to Leaf and Branch nodes with lazy computation
- Update from-sorted-array to compute stats during tree construction
- Fix lazy set count by propagating unknown counts (-1) instead of eager loading
- Add cross-platform generative tests (CLJ/CLJS) for:
  - Model-based testing against Clojure's sorted-set
  - Lazy set count correctness after operations
  - Transient/persistent roundtrip
  - Tree rebalancing with small branching factors
  - count-slice correctness
  - Structural sharing with derived sets
  - Slice/rslice operations
CLJS stats implementation:
- Add IStats protocol in impl/stats.cljs (identity-stats, extract, merge-stats, remove-stats)
- Add _stats field to Leaf and Branch types
- Add $stats and $compute-stats methods to INode protocol
- Update from-sorted-array to compute stats during tree construction
- Add stats API function to main namespace
- Update storage to store/restore stats

Lazy set fixes:
- Fix $conjoin/$disjoin to compute count before inc/dec when count is unknown (-1)
- Fix Branch.merge/merge-split to handle null children and addresses for lazy branches
- Add test.check dependency to :node-tests alias for CLJS generative tests

All 37 CLJS tests pass (1276 assertions)
All 44 CLJ tests pass (52127 assertions)
When modifying lazy sets where count is unknown (-1), keep it as -1
instead of computing O(n) count on every modification. The count
will be computed lazily when actually needed via $count.

This matches the Java implementation behavior where:
  newCount = _subtreeCount >= 0 ? _subtreeCount + 1 : -1
- Fix stats initialization when _stats is nil in CLJS Leaf and Branch
- Fix stats computation for split nodes during add operations
- Compute stats from children when creating new Branch roots
- Add README documentation for count-slice (O(log n) range counting)
- Add README documentation for aggregate statistics with IStats protocol
- Document monoid requirements and non-invertible stats handling
- Moved doc/SUBTREE_STATS_DESIGN.md to .internal/
- Keep internal design documents out of the main repository
- Fix unmatched parentheses in branch.cljs, btset.cljs from merge conflicts
- Update deps.edn to run all tests (not just core, small, replace-and-lookup)

Tests passing:
- CLJ: 59 tests, 53,931 assertions (includes 22 generative tests)
- CLJS node-tests: 39 tests, 1,435 assertions
- CLJS stress: 5 tests, 13,401 assertions
Adds zero-cost extension point for post-processing leaf entries after
add/remove operations. Enables applications like Stratum to:
- Merge small chunks in same leaf (compaction)
- Split oversized chunks (prevent overflow)
- Custom rebalancing based on application-specific metrics

Key features:
- Zero overhead when processor is null (default)
- shouldProcess() pre-check avoids allocation if no work needed
- Processor can split/merge entries, PSS tree handles rebalancing
- Works with both add and remove operations
- Maintains all PSS invariants (sort order, branching factor, etc.)

Based on subtree-counts branch for integration with Stratum columnar index.
- Update map->settings to accept :leaf-processor option
- Update settings->map to include leaf-processor field
- Enables Stratum to pass processor through sorted-set* API
IStats.weight() extracts element count from stats, enabling getNth
to navigate by accumulated data counts rather than entry counts.
Used by Stratum for O(log chunks) point access without absolute positions.
Implements weighted rank-based navigation for O(log n) access by
accumulated element counts, matching the Java implementation added
in commit 09f29ed.

Changes:
- Add weight protocol method to IStats (impl/stats.cljs)
- Implement weight in numeric-stats-ops returning :cnt
- Add $get-nth to btset.cljs with async+sync support
- Add public get-nth wrapper in persistent_sorted_set.cljs

Usage: (get-nth set n) returns [entry local-offset] or nil.
Requires stats with weight() configured on the set.

Tests: CLJ 59 tests/53,110 assertions, CLJS 39 tests/1,435 assertions ✓
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces subtree-count tracking and optional aggregate statistics to persistent-sorted-set, enabling efficient range counting (count-slice) and range/whole-set statistics (stats, stats-slice) across CLJ and CLJS, with accompanying tests and documentation.

Changes:

  • Add subtree count support and count-slice API (CLJ/CLJS + Java core).
  • Add pluggable stats aggregation (IStats, NumericStats*) and new stats / stats-slice APIs.
  • Expand/adjust test suite (new stats tests, large generative rewrite) and update docs/README.

Reviewed changes

Copilot reviewed 23 out of 25 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test-clojure/me/tonsky/persistent_sorted_set/test/stress.cljc CLJS-only debug globals guarded with reader conditional.
test-clojure/me/tonsky/persistent_sorted_set/test/storage/util.cljs Persist/restore :subtree-count and :stats in CLJS test storage.
test-clojure/me/tonsky/persistent_sorted_set/test/stats.clj New JVM stats test coverage for NumericStats.
test-clojure/me/tonsky/persistent_sorted_set/test/generative.cljc Major rewrite/expansion of property tests (count-slice, lazy ops, stats).
test-clojure/me/tonsky/persistent_sorted_set/test/core.cljc Adds deterministic unit tests for count-slice.
src-java/me/tonsky/persistent_sorted_set/Settings.java Adds IStats to settings and threads it through editable transitions.
src-java/me/tonsky/persistent_sorted_set/PersistentSortedSet.java Adds countSlice and root stats/subtree-count initialization logic.
src-java/me/tonsky/persistent_sorted_set/NumericStatsOps.java New numeric IStats implementation for JVM.
src-java/me/tonsky/persistent_sorted_set/NumericStats.java New immutable numeric stats value type for JVM.
src-java/me/tonsky/persistent_sorted_set/Leaf.java Implements ISubtreeCount + incremental stats updates and lazy compute.
src-java/me/tonsky/persistent_sorted_set/ISubtreeCount.java New interface for subtree element counts.
src-java/me/tonsky/persistent_sorted_set/IStats.java New interface for monoidal stats + removal support.
src-java/me/tonsky/persistent_sorted_set/Branch.java Implements ISubtreeCount, lazy subtree-count compute, stats compute/updates.
src-java/me/tonsky/persistent_sorted_set/ANode.java Adds _stats storage and computeStats abstract method.
src-clojure/me/tonsky/persistent_sorted_set/leaf.cljs CLJS leaf now tracks subtree-count + optional stats with incremental updates.
src-clojure/me/tonsky/persistent_sorted_set/impl/stats.cljs New CLJS IStats protocol.
src-clojure/me/tonsky/persistent_sorted_set/impl/numeric_stats.cljs New CLJS numeric stats implementation.
src-clojure/me/tonsky/persistent_sorted_set/impl/node.cljs Extends CLJS node protocol to expose subtree-count/stats/compute-stats.
src-clojure/me/tonsky/persistent_sorted_set/btset.cljs Adds CLJS count-slice, stats, stats-slice plumbing and root aggregation.
src-clojure/me/tonsky/persistent_sorted_set/branch.cljs CLJS branch now carries subtree-count + stats and updates them on edits.
src-clojure/me/tonsky/persistent_sorted_set.cljs Public CLJS API additions: count-slice, stats, stats-slice.
src-clojure/me/tonsky/persistent_sorted_set.clj Public CLJ API additions: count-slice, stats, stats-slice, settings carry :stats.
deps.edn Test runner patterns broadened to include all test namespaces.
README.md Documentation for count-slice + stats API and examples.
.gitignore Formatting tweak (no functional impact).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

whilo added 13 commits February 9, 2026 14:40
Added comprehensive documentation for rank-based access and statistical
use cases enabled by getNth.

Changes:
- README.md: Added "Rank-Based Access (getNth)" section with examples
  - Basic percentile queries (median, quartiles, deciles)
  - ClojureScript async/sync support
  - Use cases: outlier detection, quantile regression, weighted sampling
- doc/statistical-queries.md: Complete guide with runnable examples
  - Percentile computation and five-number summaries
  - Robust statistics (median, IQR) vs traditional (mean, stddev)
  - Outlier detection using Tukey's fences
  - Distribution analysis (empirical CDF, inverse CDF)
  - Advanced techniques (quantile binning, rolling median)
  - Performance characteristics and complexity analysis

Position: Added after "Efficient Range Counting" and before "Aggregate
Statistics" in README, following the logical progression of O(log n)
query operations.
- Branch.$replace (CLJS): add missing subtree-count and _stats args
- Leaf.$replace (CLJS): add missing _stats arg
- auto_removal.cljs, bench/cps/util.cljs: fix Leaf/Branch arity
- NumericStatsOps.weight(): override default (was returning 1 instead
  of stats.count, making getNth only ever return the first element)
- Add regression tests for replace with stats, count-slice, and get-nth
…sing await

- Eagerly maintain stats during replace (Java & CLJS Branch/Leaf) using
  remove-stats + merge-stats(extract) pattern, matching add/remove behavior
- Fix sorted-set* in CLJ to accept :comparator key (matching CLJS API)
- Fix missing await on $keys-for in btset.cljs async seek paths
- Preserve subtreeCount in Java Branch.replace persistent path (8-arg ctor)
This commit addresses 6 bugs identified in code review:

1. Subtree count sentinel propagation (Java + CLJS)
   - Don't sum negative subtree counts; propagate -1 sentinel when any child is unknown
   - Prevents invalid states like summing -1 + 5 = 4

2. Branch.computeStatsFromChildren skips References (Java)
   - Now properly dereferences SoftReference/WeakReference before accessing child stats

3. Root stats aggregation with nil children (CLJS)
   - Only compute root stats when ALL children have stats (not just some)

4. Stats reads _root directly without dereferencing (CLJ)
   - Use .root method instead of .-_root field to properly handle References

5. Leaf $add incremental stats with nil _stats (CLJS)
   - Only do incremental stats update when _stats exists (not nil)

6. Stats maintenance during replace operations (Java + CLJS)
   - Compute stats from final state (after replacement) rather than incremental updates
   - Maintains O(1) complexity: O(leaf-size) for leaves, O(branching-factor) for branches

All tests pass:
- CLJS: 30 tests, 1061 assertions
- Java: 4 replace tests, 133 assertions
…tion

Implements Option B (try/force naming) to handle stats computation differently
for operations vs user queries:

**tryComputeStats()**: Used during tree operations (add/remove/replace)
- Returns null if any child has unavailable stats (postpones computation)
- Avoids sudden O(n) traversals on old trees without stats
- Stats propagate gradually bottom-up as operations touch nodes

**forceComputeStats()**: Used for explicit user queries (stats, getNth, countSlice)
- Recursively descends to compute stats if children have null stats
- Acceptable O(n) cost when user explicitly requests stats on old tree
- Only computes when needed (user invokes the query)

This ensures:
- New trees: Always have stats bottom-up, operations maintain in O(1)
- Old trees: No sudden O(n) during operations, stats filled in gradually
- User queries: Work correctly on both old and new trees

Changes:
- ANode: Added abstract tryComputeStats/forceComputeStats, deprecated computeStats
- Leaf: Both methods do same thing (no children to check)
- Branch: tryComputeStats postpones if any child null, force descends recursively
- Operations: Use tryComputeStatsFromChildren (postpone if unavailable)
- Queries: Use forceComputeStats via deprecated computeStats (descend if needed)

All tests pass:
- Java: 4 replace tests, 133 assertions
- CLJS: test-stats shows correct counts after replace
…methods

Implements the same lazy vs eager stats computation pattern in CLJS as in Java:

**try-compute-stats**: Used during operations (add/remove/replace)
- Returns nil if any child has unavailable stats (postpones computation)
- Prevents sudden O(n) traversals on old trees without stats
- Stats propagate gradually as operations touch nodes

**force-compute-stats**: Used for user queries (stats, getNth, count-slice)
- Recursively descends to compute stats if children have null stats
- Acceptable O(n) cost when user explicitly requests stats

Changes:
- Protocol: Added try-compute-stats and force-compute-stats, removed $compute-stats
- Leaf: Both methods delegate to same implementation (no children to check)
- Branch: try postpones if any child null, force descends recursively
- Operations: Use try-compute-stats (add/remove/replace)
- User queries: Use force-compute-stats (stats, getNth, count-slice)
- Removed deprecated computeStats from Java (no backward compat needed)

All tests pass:
- Java: 4 replace tests, 133 assertions (no reflection warnings)
- CLJS: 30 tests, 1061 assertions
The overloaded 3-arity that dispatched on fn? to accept either
not-found or comparator was inconsistent with Java/CLJ where
3-arity is always (lookup set key cmp). Since the set can't store
nil, nil is an unambiguous not-found sentinel.
The stats framework is a monoid-valued measure in the measure theory
sense. Renaming IStats/stats to IMeasure/measure better reflects the
general-purpose nature of the interface — it's not limited to
statistics, any commutative monoid works.

Breaking change: settings key :stats → :measure, public API functions
stats → measure and stats-slice → measure-slice.
Remove test_stats.cljs (println-based debug script, not a proper test)
and the test-stats shadow-cljs build target that compiled it.
- Fix silent data corruption: branch $add used identity fallback when
  _measure was nil, producing wrong aggregates. Now consistent with
  leaf pattern (skip when nil).
- Fix eager tree traversal: tryComputeMeasure loaded children from
  storage instead of checking in-memory only. Now postpones when
  children not loaded.
- Fix laziness violations: join paths forced O(n) computeSubtreeCount
  when counts unknown. Now propagates -1 sentinel.
- Fix settings lost on restore: btset.cljs select-keys now includes
  :measure so restored sets keep their measure ops.
- Add _measure != null guard to all EARLY_EXIT paths and remove/replace
  paths in both Java and CLJS for consistency.
- Guard Branch.address() against null _children to prevent NPE.
- Remove dead sumChildCounts method, eliminate throwaway Branch in replace.
- Add comprehensive cross-platform test suite (measure_test.cljc).
- Guard all measure computation paths with _measure != null check
  to preserve laziness (Leaf.java borrow-from-right, processSingleLeaf,
  processLeafNodes; branch.cljs $add split, $replace; leaf.cljs $add split)
- Cache forceComputeMeasure result in Branch.java
- Remove count() caching restriction in Branch.java (safe for immutable nodes)
- Add nil safety to btset.cljs $get-nth for force-compute-measure result
- Fix measure-slice using ._root instead of .root (NPE on lazy sets)
- Add 6 generative tests: replace+measure, get-nth, rebalancing+measure,
  transient+measure, stats-after-roundtrip, measure-slice-after-roundtrip
- Add storage-with-settings helper for measure-aware roundtrip tests
- Leaf.java: forceComputeMeasure now caches _measure (was recomputing every call)
- Branch.java: forceComputeMeasure returns null on missing child measure
  instead of silently undercounting
- btset.cljs: -empty clears storage/address to nil (was preserving stale address)
- btset.cljs: $$_next-path uses .-keys instead of .-children (prevents
  TypeError on storage-backed branches where children may be nil)
- branch.cljs/btset.cljs: measure reduce during split returns nil when any
  child measure is unavailable instead of silently skipping
- Formatter whitespace fixes across CLJS files
whilo added 12 commits February 13, 2026 04:02
…uite

New invariants.cljc validates 8 tree invariants (balance, node sizes,
key ordering, separator keys, subtree counts, root shape, level
consistency, sibling ordering) after every operation. Integrated
validate-tree into 7 existing generative tests for broader coverage.
Branch splits and borrows were unconditionally setting subtreeCount=-1,
poisoning the entire tree lineage permanently. Now compute counts from
in-memory children instead. Also cache computed counts in child branches
during computeSubtreeCount to avoid redundant traversals.

New diagnostics.cljc library namespace provides validate, validate-full,
validate-counts-known, validate-measures-known, validate-content, and
tree-stats for structural integrity checking (like pg amcheck / SQLite
PRAGMA integrity_check). Non-deterioration generative tests verify fresh
trees never degrade to lazy recomputation.
This library has diverged significantly from the upstream and is now
maintained as part of the replikativ ecosystem. Rename Java packages,
Clojure/CLJS namespaces, directory structures, build configs, scripts,
and documentation to reflect the new org.replikativ ownership.
- compact: rebuild tree with optimal fill factors from current elements,
  preserving comparator/settings/metadata (CLJ + CLJS)
- Fill factor histogram: extend tree-stats with percentile distribution
  (p25/p50/p75/p90) for all nodes, leaves, and branches separately
- validate-navigation: root-descend verification inspired by PostgreSQL
  amcheck — re-looks up every element from root to catch transitive
  inconsistencies across cousin leaf pages
- Add generative tests for compact and navigation validation
Remove unnecessary $ prefixes from CLJS function names that don't
clash with cljs.core, making the convention self-documenting: $ now
signals "this name would shadow a cljs.core var".

Renamed: add, lookup, subtree-count, measure, store, walk-addresses,
child, conjoin, disjoin, slice, rslice, seek, count-slice, get-nth,
measure-slice, equivalent?, equivalent-sequential?

Kept $: $count, $contains?, $replace, $remove, $reduce, $transduce,
$into, $seq, $rseq
Since PSS cannot store nil, nil is an unambiguous not-found sentinel.
Remove the not-found parameter from btset/lookup and accept cmp as
the third positional arg instead. ILookup.-lookup handles not-found
at the protocol level.

Also add test failure policy to CLAUDE.md: correctness errors in this
data structure library must always be investigated, never skipped.
Branch.add() and Branch.replace() shared the _keys array between the
original persistent branch and newly created copies when maxKey didn't
change. When a transient later modified _keys[i] in place via the
editable() path, it corrupted the original persistent branch's separator
keys, causing silent element loss.

Also removed dead code path where _children/_addresses arrays were
shared (the node == child() condition was unreachable on non-editable
paths).

Bug also reported upstream: tonsky#19
validate-full now re-looks up every key from the root after structural
checks, catching separator key corruption and navigation bugs that
structural checks alone miss. This runs in all generative tests via
validate-tree.
CLJS branch.cljs $replace and util.cljs check-n-splice had the same
copy-on-write violation as Java: sharing keys/addrs arrays between
persistent copies and transient-editable branches. Fixed by always
cloning arrays when creating persistent branch copies.

Added cross-platform generative test transient-does-not-corrupt-persistent
that exercises the exact failure pattern: transient ops on base-set
first, then persistent ops on the same base-set, verifying the
persistent result matches a reference sorted-set.
…heck, update docs

- btset.cljs: Add :early-exit handling in $replace (latent bug if transients enabled)
- Branch.java: Remove always-true maxKey == check in editable add path (line 366)
- doc/btree-operations.md: Comprehensive rewrite with ILeafProcessor section,
  platform-specific remove conventions, borrow preference differences, test coverage
- make-array: fix return tag from "[[Ljava.lang.Object;" (Object[][]) to
  "[Ljava.lang.Object;" (Object[]) — the function returns a 1D array
- array macro: same tag fix + use gensym with proper metadata instead of
  auto-gensym with inline metadata, use object-array
- amap: use object-array instead of clojure.core/make-array (avoids reflection)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 70 out of 72 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

test-clojure/org/replikativ/persistent_sorted_set/test/storage/util.cljs:93

  • Same issue here: the Leaf constructor call needs 4 arguments to match the Leaf.java constructor signature. The constructor should be (Leaf. (count keys) keys measure settings) to match the 4-arg constructor in Leaf.java, not (Leaf. keys settings measure).
    test-clojure/org/replikativ/persistent_sorted_set/test/storage/util.cljs:51
  • The Leaf constructor is being called with 3 arguments (keys, settings, measure) but the Leaf class has two constructors: one with 2 args (keys, settings) and one with 3 args (len, keys, settings). The call should likely be (Leaf. keys settings measure) but needs to match the 4-argument constructor (len, keys, measure, settings) that exists in the Leaf.java file. This will cause a runtime error when restoring from storage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant