Reuse scratchCentroid buffer in updateCentroidsUnweighted to avoid per-centroid allocation#5
Merged
Merged
Conversation
Issue #3: HNSW graph build OOMs on large datasets (100M+ vectors) — the ConcurrentNeighborMap's upper-layer SparseIntMap pays a boxed Integer per key (~16 B) and a ConcurrentHashMap.Node per entry (~32 B), accounting for ~3.5 GiB of heap per IndexingService pod at 50M vectors. Three changes: 1. Replace SparseIntMap's ConcurrentHashMap<Integer,T> with 32 striped shards of Agrona's primitive Int2ObjectHashMap, each guarded by its own StampedLock. Optimistic-read fast path keeps reads competitive with CHM's volatile-read; writes are serialised per-shard. Per-entry footprint drops from ~136 B to ~42 B (69% reduction, measured by the new SparseIntMapMemoryBenchmark). 2. Expose OnHeapGraphIndex.estimatedBytesPerNode(int maxDegree, float overflowRatio) — a static helper external indexing services can use to pre-size their memory budget (e.g. HerdDB's PROPERTY_MEMORY_VECTOR_LIMIT) without a built graph instance. 3. Promote forEachKey(IntConsumer) to the IntMap interface so iteration over an upper layer no longer requires a SparseIntMap downcast in OnHeapGraphIndex.nodeStream(). Default delegates to forEach for backward compatibility; DenseIntMap and SparseIntMap both override for zero allocation. Tests: - TestIntMap extended with forEach/forEachKey/keysStream/contract cases (parameterised over both impls). - New TestIntMapConcurrency: linearizability of compareAndPut on a hot key, forEachKey-during-mutation, reentrant forEachKey, size accuracy under concurrent insert/remove, happens-before via successful CAS, stale-expected handling, 16-thread mixed-workload stress. - New TestSparseIntMapShards: white-box check that the avalanche-mix spreads sequential and random keys evenly across shards. - New TestOnHeapGraphIndexEstimator: validates the static estimator matches the instance ramBytesUsedOneNode, scales monotonically with M and overflow. Benchmarks (mirror the existing LegacyDenseIntMap pattern): - LegacySparseIntMap kept in benchmarks-jmh for apples-to-apples comparison against the legacy CHM-backed impl. - SparseIntMapConcurrentBenchmark covers getHot/casChurn/forEachKey/ insertSequential/mixed90r10w at 1 and 8 threads, dense and sparse keys, both impls. - SparseIntMapMemoryBenchmark surfaces per-entry byte cost via JMH AuxCounters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r-centroid allocation
Before this change, every call to updateCentroidsUnweighted() allocated a fresh
VectorFloat<?> for each of the k centroids:
var centroid = centroidNums[i].copy(); // k allocations per Lloyd's pass
With k=256, dim/M=8 floats per subspace centroid, 6 iterations, and 32 subspaces
a single ProductQuantization.compute() call produced roughly 256×6×32 = 49 152
short-lived VectorFloat objects. In HerdDB's vector indexing service these
training calls were visible as a 35%+ allocation share in async-profiler
flamegraphs during the 500M bigann benchmark (issue herddb#283).
Fix: add a private final VectorFloat<?> scratchCentroid field (one per
KMeansPlusPlusClusterer instance, allocated once in the 3-arg constructor
alongside the existing centroidNums[] entries). In updateCentroidsUnweighted()
replace copy() with scratchCentroid.copyFrom(centroidNums[i], 0, 0, dim), then
reuse scratchCentroid for the scale() and centroids.copyFrom() calls.
Thread safety: KMeansPlusPlusClusterer instances are single-threaded by
construction (each subspace in ProductQuantization gets its own instance in the
parallel stream); the scratch field is never shared.
Note: HotSpot's escape analysis can eliminate the copy() allocation when the
method is fully JIT-compiled (the object doesn't escape the stack frame), so
the improvement is most pronounced during training warm-up, under GC pressure
that defeats escape analysis, and in GraalVM/AOT builds that apply weaker
escape analysis.
Add TestKMeansCentroidScratchBuffer with two correctness tests:
- centroidsConvergeOnWellSeparatedClusters: confirms that training on 4
well-separated cluster centers assigns every point to the correct centroid.
- singleLloydIterationReducesLoss: confirms that one clusterOnceUnweighted()
call strictly improves (or maintains) the quantisation loss.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cc71f4f to
0572139
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
updateCentroidsUnweighted()allocates a freshVectorFloat<?>for each of thekcentroids on every Lloyd's iteration:With
k = 256,dim/M = 8floats per subspace centroid, 6 iterations, and 32 subspaces, a singleProductQuantization.compute()call produces roughly 256 × 6 × 32 = 49 152 short-livedVectorFloatobjects.In the HerdDB vector indexing service this appeared as a 35 %+ allocation share in async-profiler flamegraphs during the 500 M bigann benchmark (HerdDB issue datastax#283).
Why doesn't JIT escape-analysis fix this automatically?
HotSpot's C2 compiler can eliminate allocations when the object doesn't escape its stack frame, and it does so in long-running micro-benchmarks. However, in the actual indexing service:
updateCentroidsUnweighted()accumulates onlyretrainings × K_MEANS_ITERATIONS × subspacesinvocations over a shard's lifetime — typically fewer than 10 000.Fix
Add a
private final VectorFloat<?> scratchCentroidfield (allocated once in the 3-arg constructor alongside the existingcentroidNums[]entries). InupdateCentroidsUnweighted()replacecopy()with:Thread safety:
KMeansPlusPlusClustererinstances are single-threaded by construction —ProductQuantizationcreates one per subspace inside a parallel stream, and the scratch field isfinal.Tests
New test class
TestKMeansCentroidScratchBuffer:centroidsConvergeOnWellSeparatedClusterscompute()singleLloydIterationReducesLossclusterOnceUnweighted()call must not increase the quantisation loss relative to the initial assignmentBoth tests exercise
updateCentroidsUnweighted()directly (via the package-privateclusterOnceUnweighted()accessor) and prove the scratch buffer does not perturb the centroid arithmetic.🤖 Implemented by the
pr-workeragent (HerdDB issue datastax#283).