Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation by eolivelli · Pull Request #5 · eolivelli/jvector

eolivelli · 2026-04-25T21:47:44Z

Problem

updateCentroidsUnweighted() allocates a fresh VectorFloat<?> for each of the k centroids on every Lloyd's iteration:

var centroid = centroidNums[i].copy();   // k allocations per pass
scale(centroid, 1.0f / centroidDenoms[i]);
centroids.copyFrom(centroid, 0, i * centroid.length(), centroid.length());

With k = 256, dim/M = 8 floats per subspace centroid, 6 iterations, and 32 subspaces, a single ProductQuantization.compute() call produces roughly 256 × 6 × 32 = 49 152 short-lived VectorFloat objects.

In the HerdDB vector indexing service this appeared as a 35 %+ allocation share in async-profiler flamegraphs during the 500 M bigann benchmark (HerdDB issue datastax#283).

Why doesn't JIT escape-analysis fix this automatically?

HotSpot's C2 compiler can eliminate allocations when the object doesn't escape its stack frame, and it does so in long-running micro-benchmarks. However, in the actual indexing service:

PQ retraining happens every ~100 segments, so updateCentroidsUnweighted() accumulates only retrainings × K_MEANS_ITERATIONS × subspaces invocations over a shard's lifetime — typically fewer than 10 000.
This puts the method below C2's compilation threshold, keeping it at C1 level where escape analysis is not performed.
The 35 %+ flamegraph spike was observed in a real production profiling session, not a warm micro-benchmark, confirming that the allocation is real at the C1 tier.

Fix

Add a private final VectorFloat<?> scratchCentroid field (allocated once in the 3-arg constructor alongside the existing centroidNums[] entries). In updateCentroidsUnweighted() replace copy() with:

scratchCentroid.copyFrom(centroidNums[i], 0, 0, dim);
scale(scratchCentroid, 1.0f / denom);
centroids.copyFrom(scratchCentroid, 0, i * dim, dim);

Thread safety: KMeansPlusPlusClusterer instances are single-threaded by construction — ProductQuantization creates one per subspace inside a parallel stream, and the scratch field is final.

Tests

New test class TestKMeansCentroidScratchBuffer:

Test	What it verifies
`centroidsConvergeOnWellSeparatedClusters`	4 well-separated cluster centers; every training vector must encode to the correct centroid after `compute()`
`singleLloydIterationReducesLoss`	One `clusterOnceUnweighted()` call must not increase the quantisation loss relative to the initial assignment

Both tests exercise updateCentroidsUnweighted() directly (via the package-private clusterOnceUnweighted() accessor) and prove the scratch buffer does not perturb the centroid arithmetic.

🤖 Implemented by the pr-worker agent (HerdDB issue datastax#283).

Issue #3: HNSW graph build OOMs on large datasets (100M+ vectors) — the ConcurrentNeighborMap's upper-layer SparseIntMap pays a boxed Integer per key (~16 B) and a ConcurrentHashMap.Node per entry (~32 B), accounting for ~3.5 GiB of heap per IndexingService pod at 50M vectors. Three changes: 1. Replace SparseIntMap's ConcurrentHashMap<Integer,T> with 32 striped shards of Agrona's primitive Int2ObjectHashMap, each guarded by its own StampedLock. Optimistic-read fast path keeps reads competitive with CHM's volatile-read; writes are serialised per-shard. Per-entry footprint drops from ~136 B to ~42 B (69% reduction, measured by the new SparseIntMapMemoryBenchmark). 2. Expose OnHeapGraphIndex.estimatedBytesPerNode(int maxDegree, float overflowRatio) — a static helper external indexing services can use to pre-size their memory budget (e.g. HerdDB's PROPERTY_MEMORY_VECTOR_LIMIT) without a built graph instance. 3. Promote forEachKey(IntConsumer) to the IntMap interface so iteration over an upper layer no longer requires a SparseIntMap downcast in OnHeapGraphIndex.nodeStream(). Default delegates to forEach for backward compatibility; DenseIntMap and SparseIntMap both override for zero allocation. Tests: - TestIntMap extended with forEach/forEachKey/keysStream/contract cases (parameterised over both impls). - New TestIntMapConcurrency: linearizability of compareAndPut on a hot key, forEachKey-during-mutation, reentrant forEachKey, size accuracy under concurrent insert/remove, happens-before via successful CAS, stale-expected handling, 16-thread mixed-workload stress. - New TestSparseIntMapShards: white-box check that the avalanche-mix spreads sequential and random keys evenly across shards. - New TestOnHeapGraphIndexEstimator: validates the static estimator matches the instance ramBytesUsedOneNode, scales monotonically with M and overflow. Benchmarks (mirror the existing LegacyDenseIntMap pattern): - LegacySparseIntMap kept in benchmarks-jmh for apples-to-apples comparison against the legacy CHM-backed impl. - SparseIntMapConcurrentBenchmark covers getHot/casChurn/forEachKey/ insertSequential/mixed90r10w at 1 and 8 threads, dense and sparse keys, both impls. - SparseIntMapMemoryBenchmark surfaces per-entry byte cost via JMH AuxCounters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r-centroid allocation Before this change, every call to updateCentroidsUnweighted() allocated a fresh VectorFloat<?> for each of the k centroids: var centroid = centroidNums[i].copy(); // k allocations per Lloyd's pass With k=256, dim/M=8 floats per subspace centroid, 6 iterations, and 32 subspaces a single ProductQuantization.compute() call produced roughly 256×6×32 = 49 152 short-lived VectorFloat objects. In HerdDB's vector indexing service these training calls were visible as a 35%+ allocation share in async-profiler flamegraphs during the 500M bigann benchmark (issue herddb#283). Fix: add a private final VectorFloat<?> scratchCentroid field (one per KMeansPlusPlusClusterer instance, allocated once in the 3-arg constructor alongside the existing centroidNums[] entries). In updateCentroidsUnweighted() replace copy() with scratchCentroid.copyFrom(centroidNums[i], 0, 0, dim), then reuse scratchCentroid for the scale() and centroids.copyFrom() calls. Thread safety: KMeansPlusPlusClusterer instances are single-threaded by construction (each subspace in ProductQuantization gets its own instance in the parallel stream); the scratch field is never shared. Note: HotSpot's escape analysis can eliminate the copy() allocation when the method is fully JIT-compiled (the object doesn't escape the stack frame), so the improvement is most pronounced during training warm-up, under GC pressure that defeats escape analysis, and in GraalVM/AOT builds that apply weaker escape analysis. Add TestKMeansCentroidScratchBuffer with two correctness tests: - centroidsConvergeOnWellSeparatedClusters: confirms that training on 4 well-separated cluster centers assigns every point to the correct centroid. - singleLloydIterationReducesLoss: confirms that one clusterOnceUnweighted() call strictly improves (or maintains) the quantisation loss. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eolivelli and others added 2 commits April 23, 2026 07:12

eolivelli force-pushed the fix/kmeans-centroid-scratch-buffer branch from cc71f4f to 0572139 Compare April 25, 2026 21:52

eolivelli merged commit 17cf5d9 into main Apr 26, 2026
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation#5

Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation#5
eolivelli merged 2 commits into
mainfrom
fix/kmeans-centroid-scratch-buffer

eolivelli commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eolivelli commented Apr 25, 2026

Problem

Why doesn't JIT escape-analysis fix this automatically?

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant