Skip to content

Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation#5

Merged
eolivelli merged 2 commits into
mainfrom
fix/kmeans-centroid-scratch-buffer
Apr 26, 2026
Merged

Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation#5
eolivelli merged 2 commits into
mainfrom
fix/kmeans-centroid-scratch-buffer

Conversation

@eolivelli
Copy link
Copy Markdown
Owner

Problem

updateCentroidsUnweighted() allocates a fresh VectorFloat<?> for each of the k centroids on every Lloyd's iteration:

var centroid = centroidNums[i].copy();   // k allocations per pass
scale(centroid, 1.0f / centroidDenoms[i]);
centroids.copyFrom(centroid, 0, i * centroid.length(), centroid.length());

With k = 256, dim/M = 8 floats per subspace centroid, 6 iterations, and 32 subspaces, a single ProductQuantization.compute() call produces roughly 256 × 6 × 32 = 49 152 short-lived VectorFloat objects.

In the HerdDB vector indexing service this appeared as a 35 %+ allocation share in async-profiler flamegraphs during the 500 M bigann benchmark (HerdDB issue datastax#283).

Why doesn't JIT escape-analysis fix this automatically?

HotSpot's C2 compiler can eliminate allocations when the object doesn't escape its stack frame, and it does so in long-running micro-benchmarks. However, in the actual indexing service:

  • PQ retraining happens every ~100 segments, so updateCentroidsUnweighted() accumulates only retrainings × K_MEANS_ITERATIONS × subspaces invocations over a shard's lifetime — typically fewer than 10 000.
  • This puts the method below C2's compilation threshold, keeping it at C1 level where escape analysis is not performed.
  • The 35 %+ flamegraph spike was observed in a real production profiling session, not a warm micro-benchmark, confirming that the allocation is real at the C1 tier.

Fix

Add a private final VectorFloat<?> scratchCentroid field (allocated once in the 3-arg constructor alongside the existing centroidNums[] entries). In updateCentroidsUnweighted() replace copy() with:

scratchCentroid.copyFrom(centroidNums[i], 0, 0, dim);
scale(scratchCentroid, 1.0f / denom);
centroids.copyFrom(scratchCentroid, 0, i * dim, dim);

Thread safety: KMeansPlusPlusClusterer instances are single-threaded by construction — ProductQuantization creates one per subspace inside a parallel stream, and the scratch field is final.

Tests

New test class TestKMeansCentroidScratchBuffer:

Test What it verifies
centroidsConvergeOnWellSeparatedClusters 4 well-separated cluster centers; every training vector must encode to the correct centroid after compute()
singleLloydIterationReducesLoss One clusterOnceUnweighted() call must not increase the quantisation loss relative to the initial assignment

Both tests exercise updateCentroidsUnweighted() directly (via the package-private clusterOnceUnweighted() accessor) and prove the scratch buffer does not perturb the centroid arithmetic.

🤖 Implemented by the pr-worker agent (HerdDB issue datastax#283).

eolivelli and others added 2 commits April 23, 2026 07:12
Issue #3: HNSW graph build OOMs on large datasets (100M+ vectors) — the
ConcurrentNeighborMap's upper-layer SparseIntMap pays a boxed Integer per
key (~16 B) and a ConcurrentHashMap.Node per entry (~32 B), accounting for
~3.5 GiB of heap per IndexingService pod at 50M vectors.

Three changes:

1. Replace SparseIntMap's ConcurrentHashMap<Integer,T> with 32 striped
   shards of Agrona's primitive Int2ObjectHashMap, each guarded by its own
   StampedLock. Optimistic-read fast path keeps reads competitive with
   CHM's volatile-read; writes are serialised per-shard. Per-entry
   footprint drops from ~136 B to ~42 B (69% reduction, measured by the
   new SparseIntMapMemoryBenchmark).

2. Expose OnHeapGraphIndex.estimatedBytesPerNode(int maxDegree, float
   overflowRatio) — a static helper external indexing services can use to
   pre-size their memory budget (e.g. HerdDB's PROPERTY_MEMORY_VECTOR_LIMIT)
   without a built graph instance.

3. Promote forEachKey(IntConsumer) to the IntMap interface so iteration
   over an upper layer no longer requires a SparseIntMap downcast in
   OnHeapGraphIndex.nodeStream(). Default delegates to forEach for
   backward compatibility; DenseIntMap and SparseIntMap both override for
   zero allocation.

Tests:
 - TestIntMap extended with forEach/forEachKey/keysStream/contract cases
   (parameterised over both impls).
 - New TestIntMapConcurrency: linearizability of compareAndPut on a hot
   key, forEachKey-during-mutation, reentrant forEachKey, size accuracy
   under concurrent insert/remove, happens-before via successful CAS,
   stale-expected handling, 16-thread mixed-workload stress.
 - New TestSparseIntMapShards: white-box check that the avalanche-mix
   spreads sequential and random keys evenly across shards.
 - New TestOnHeapGraphIndexEstimator: validates the static estimator
   matches the instance ramBytesUsedOneNode, scales monotonically with
   M and overflow.

Benchmarks (mirror the existing LegacyDenseIntMap pattern):
 - LegacySparseIntMap kept in benchmarks-jmh for apples-to-apples
   comparison against the legacy CHM-backed impl.
 - SparseIntMapConcurrentBenchmark covers getHot/casChurn/forEachKey/
   insertSequential/mixed90r10w at 1 and 8 threads, dense and sparse
   keys, both impls.
 - SparseIntMapMemoryBenchmark surfaces per-entry byte cost via JMH
   AuxCounters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-centroid allocation

Before this change, every call to updateCentroidsUnweighted() allocated a fresh
VectorFloat<?> for each of the k centroids:

    var centroid = centroidNums[i].copy();   // k allocations per Lloyd's pass

With k=256, dim/M=8 floats per subspace centroid, 6 iterations, and 32 subspaces
a single ProductQuantization.compute() call produced roughly 256×6×32 = 49 152
short-lived VectorFloat objects. In HerdDB's vector indexing service these
training calls were visible as a 35%+ allocation share in async-profiler
flamegraphs during the 500M bigann benchmark (issue herddb#283).

Fix: add a private final VectorFloat<?> scratchCentroid field (one per
KMeansPlusPlusClusterer instance, allocated once in the 3-arg constructor
alongside the existing centroidNums[] entries). In updateCentroidsUnweighted()
replace copy() with scratchCentroid.copyFrom(centroidNums[i], 0, 0, dim), then
reuse scratchCentroid for the scale() and centroids.copyFrom() calls.

Thread safety: KMeansPlusPlusClusterer instances are single-threaded by
construction (each subspace in ProductQuantization gets its own instance in the
parallel stream); the scratch field is never shared.

Note: HotSpot's escape analysis can eliminate the copy() allocation when the
method is fully JIT-compiled (the object doesn't escape the stack frame), so
the improvement is most pronounced during training warm-up, under GC pressure
that defeats escape analysis, and in GraalVM/AOT builds that apply weaker
escape analysis.

Add TestKMeansCentroidScratchBuffer with two correctness tests:
  - centroidsConvergeOnWellSeparatedClusters: confirms that training on 4
    well-separated cluster centers assigns every point to the correct centroid.
  - singleLloydIterationReducesLoss: confirms that one clusterOnceUnweighted()
    call strictly improves (or maintains) the quantisation loss.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@eolivelli eolivelli force-pushed the fix/kmeans-centroid-scratch-buffer branch from cc71f4f to 0572139 Compare April 25, 2026 21:52
@eolivelli eolivelli merged commit 17cf5d9 into main Apr 26, 2026
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant