Zero-copy ByteBuffer-backed vectors, no float[] materialization#1
Merged
Conversation
Add a zero-copy path from a caller-owned ByteBuffer to a jvector index
build or search, without the per-vector float[] allocation and copy that
integrators have to perform today. The public API already uses
VectorFloat<?>, so the changes are a targeted set of additions at the
abstraction boundary plus polymorphic dispatch in the SIMD backends so
they operate on ByteBuffer-backed vectors without materializing them
to float[].
New types
- BufferVectorFloat (jvector-base): zero-copy VectorFloat<ByteBuffer>
view over a caller-owned buffer. Slices once at construction so
subsequent element access, Panama SIMD dispatch, and mutation of
the caller's buffer position/limit never need to allocate.
- ByteBufferRandomAccessVectorValues (jvector-base): RAVV over a
single concatenated ByteBuffer of N×dimension floats.
- VectorTypeSupport.wrapFloatVector(ByteBuffer[, floatOffset, floatLength]):
typed factory producing the zero-copy view.
- MemorySegmentVectorFloat.wrap(ByteBuffer): zero-copy static factory
that complements the legacy copying constructor.
SIMD preservation
- PanamaVectorUtilSupport detects BufferVectorFloat and dispatches to
FloatVector.fromMemorySegment(MemorySegment.ofBuffer(bb), ...) —
full SIMD with no float[] materialization. ArrayVectorFloat still
uses fromArray.
- NativeVectorUtilSupport's four protected helpers now fall through
to super for non-MemorySegment vectors, so BufferVectorFloat works
under native dispatch too.
- DefaultVectorUtilSupport's scalar kernels gain a polymorphic entry
that uses VectorFloat.get(i) for non-ArrayVectorFloat inputs.
- jvector-twenty release bumped 20 -> 22 so MemorySegment is stable
(matches jvector-native). Preview-locked class files were already
being avoided by the project; this removes the last blocker.
High-level API
- GraphSearcher.search(ByteBuffer, ...) and
GraphIndexBuilder.addGraphNode(int, ByteBuffer) overloads — thin
wrappers that call wrapFloatVector internally.
- MMapRandomAccessVectorValues rewritten to delegate to
ByteBufferRandomAccessVectorValues over a MappedByteBuffer. Drops
the per-getVector float[dimension] scratch allocation that the
old implementation performed.
- MemorySegmentVectorFloat.get/set gain an off-heap fallback via
segment.getAtIndex/setAtIndex, so wrap(direct ByteBuffer) works
correctly (the on-heap fast path remains).
Polymorphic copyFrom
- ArrayVectorFloat.copyFrom, MemorySegmentVectorFloat.copyFrom, and
BufferVectorFloat.copyFrom handle any VectorFloat source instead
of requiring the class-strict cast that was there before.
Tests (all green under SIMD and scalar profiles)
- BufferVectorFloatTest: element access, endianness, zero-copy on
direct buffers, position/limit independence, slice correctness.
- VectorTypeSupportByteBufferTest: typed factory across active +
Default provider, zero-copy proof, subrange views.
- ByteBufferRandomAccessVectorValuesTest: parity with ListRAVV,
concurrent threadLocalSupplier correctness, bounds.
- BuildFromByteBufferEquivalenceTest: builds the same graph from
float[] and from ByteBuffer and asserts structural + search
equivalence across EUCLIDEAN / DOT_PRODUCT / COSINE at multiple
dimensions.
- TestSearchWithByteBufferQuery: ByteBuffer overloads produce same
results as VectorFloat<?> overloads.
- MMapRandomAccessVectorValuesTest: round-trip across rewritten mmap
RAVV.
- MemorySegmentVectorFloatWrapTest: wrap vs legacy copying ctor,
big-endian rejection, alignment validation.
- SearchAllocationProfileTest: per-query allocation stays within a
small multiple of the float[] baseline (guard against accidental
regression to a naive float[] materialization).
HerdDB integration becomes, end-to-end:
ByteBuffer source = herdDBVector;
builder.addGraphNode(ordinal, source); // zero copy
GraphSearcher.search(source, topK, ravv, vsf, graph, Bits.ALL);
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tags the build with a -herddb suffix so HerdDB can pin to this jvector fork's artifacts without colliding with upstream 4.0.0-rc.9-SNAPSHOT in a shared local/remote Maven repository. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closed
4 tasks
…loat CI on JDK 20 failed with "release version 22 not supported" — the earlier release bump of jvector-twenty from 20 -> 22 assumed a higher minimum JDK than the project's CI matrix supports. Revert to release 20 and teach PanamaVectorUtilSupport's three protected SIMD helpers to handle BufferVectorFloat via a small SPEC-length float[] scratch instead of FloatVector.fromMemorySegment (which needs java.lang.foreign, still preview in Java 20). Functional behavior and SIMD on ArrayVectorFloat unchanged. Full SIMD on BufferVectorFloat still available via jvector-native, which targets Java 22 and has stable MemorySegment. Scratch is <= SPEC.length() floats (typically 8 or 16 -> 32-64 B), allocated inside the hot helper so escape analysis can usually elide it. The native backend (jvector-native) remains fully zero-copy and full-SIMD for BufferVectorFloat via FloatVector.fromMemorySegment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HerdDB runs on JDK 25+ and only JDK 22+ can use the stable java.lang.foreign.MemorySegment APIs this branch depends on. Drop the old-JDK scaffolding so the code expresses its real minimum. - parent pom release 11 -> 22; jvector-twenty restored to release 22 with the FloatVector.fromMemorySegment path for BufferVectorFloat (reverts the scalar-fallback workaround from the prior commit) - GitHub Actions unit-tests.yaml: build matrix [11,20,22] -> [22]; build-avx512 matrix [20,24] -> [24]; remove JDK-20-specific "Verify Panama Vector Support" / "Test Panama Support" steps - Drop now-unused jdk11 / jdk20 Maven profiles in jvector-tests and jvector-examples poms (jdk21 in tests and jdk22 in examples remain the active-by-default profiles) Modernize to JDK 22 pattern matching + ByteBuffer.slice(int,int): - BufferVectorFloat constructor uses ByteBuffer.slice(start, length) directly (one allocation instead of duplicate+position+limit+slice) - BufferVectorFloat.copy / copyFrom use slice(int,int) and pattern-matching instanceof - ArrayVectorFloat.copyFrom, MemorySegmentVectorFloat.copyFrom, PanamaVectorUtilSupport helpers, NativeVectorUtilSupport helpers, MemorySegmentVectorProvider.wrapFloatVector, ByteBufferRandomAccess- VectorValues constructor — all simplified with `instanceof T x` and slice(int,int) Local verification: mvn verify (full reactor) green; jvector-tests, jvector-native, jvector-examples test suites green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
async-profiler flamegraphs of HerdDB's indexing service showed
float[] allocations traceable to ProductQuantization.getSubVector —
one fresh float[dim/M] per (training_vector × subspace) during
PQ codebook training. For 100k training vectors × M=8 that's
~800k small allocations and tens of MB of GC pressure at every
index rebuild.
Eliminate them with zero-copy views:
- New VectorFloat.subview(int floatOffset, int floatLength) default
method (materializes via VectorTypeSupport.createFloatVector +
copyFrom as a fallback).
- Zero-copy overrides:
* ArrayVectorFloat -> ArraySliceVectorFloat (new)
* BufferVectorFloat -> another BufferVectorFloat over the same buffer
* MemorySegmentVectorFloat -> a MemorySegmentVectorFloat over
segment.asSlice(...)
- New ArraySliceVectorFloat — a VectorFloat<float[]> that references
an underlying float[] at arrayOffset with its own logical length.
Companion to the existing ArraySliceByteSequence.
- SIMD dispatch awareness:
* PanamaVectorUtilSupport — FloatVector.fromArray(SPEC, asv.get(),
asv.arrayOffset() + offset) and the same for intoArray + the
gather variant. SIMD performance is preserved.
* NativeVectorUtilSupport — falls through to super for
ArraySliceVectorFloat (already did for non-MemorySegment types).
* DefaultVectorUtilSupport — the generic .get(i)-based fallback
path already handles arbitrary VectorFloat<?>.
- ArrayVectorFloat.copyFrom fast-pathed for ArraySliceVectorFloat
source (System.arraycopy with adjusted offset).
- ProductQuantization.getSubVector rewritten to
vector.subview(offset, length).
Tests:
- VectorFloatSubviewTest (7 cases) — subview aliases source,
nested subview, SIMD distance/dot/cosine equivalence with
materialized copies.
- PQTrainingAllocationTest — asserts that getSubVector returns a
live view (mutation visible through subview) and that per-
training-vector allocation during ProductQuantization.compute
stays under a small bound (measured: 37 B/vector on dim=64 M=8
K=16, down from the ~448 B/vector the materialization path cost).
- TestProductQuantization (9 existing cases) all green — codebooks
produced through the view path match the previous materializing
path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The original problem
I'm using jvector from HerdDB (https://github.com/diennea/herddb). HerdDB already holds its
vectors as
ByteBuffer(orbyte[]) in memory and on disk — it's how the storage layernaturally represents fixed-width columnar data. But to feed a vector to jvector today you
have to go through
float[]:For a per-row insertion or a per-query vector, that's a wasted allocation + copy every
single time. Multiply by millions of rows or thousands of queries and it becomes a real
pressure point: GC churn, dirty cache lines, extra memory bandwidth.
The ask: let HerdDB (and any integrator in the same position) feed vectors to jvector as
ByteBufferdirectly — nofloat[]materialization on the hot path — while preservingSIMD everywhere jvector already uses it.
Why this fits jvector cleanly
jvector's public API is already
VectorFloat<?>-based, notfloat[]-based.GraphIndexBuilder.addGraphNode,GraphSearcher.search,RandomAccessVectorValues.getVector,VectorSimilarityFunction.compare, and everyVectorUtildistance kernel already speakVectorFloat<?>. The work is therefore not a rewrite of the API — it's:ByteBuffer-backedVectorFloat<?>implementation.ArrayVectorFloat.overloads on the high-level API.
(
MMapRandomAccessVectorValuesused to allocatefloat[dim]pergetVector; PQ training'sProductQuantization.getSubVectorallocated afloat[dim/M]per training-vector × subspace).Baseline: JDK 22+
HerdDB runs on JDK 25+, so this fork baselines on JDK 22. That's where
java.lang.foreign.MemorySegmentbecame stable — the API we need to make Panama SIMD work onBufferVectorFloatviews — and it matches jvector-native's existing target. CI now testsexactly JDK 22 (and JDK 24 for AVX-512); the old
jdk11/jdk20Maven profiles are removed.Version is bumped to
4.0.0-rc.9-herddb-SNAPSHOT.Changes
Index build + search path (the original ByteBuffer zero-copy work)
BufferVectorFloat(new, jvector-base) —VectorFloat<ByteBuffer>view over acaller-owned buffer. Slices once at construction via
ByteBuffer.slice(int, int)sosubsequent element access and SIMD dispatch are allocation-free, and caller
position()/limit()mutation doesn't disturb the view.ByteBufferRandomAccessVectorValues(new, jvector-base) — bulk input RAVV over a singleconcatenated
ByteBufferofN × dimension × 4bytes.VectorTypeSupport.wrapFloatVector(ByteBuffer[, floatOffset, floatLength])— typedzero-copy factory;
MemorySegmentVectorProvideroverrides it to return a MemorySegment-backedview directly so the native SIMD path stays zero-copy.
MemorySegmentVectorFloat.wrap(ByteBuffer)— zero-copy static factory, companion to thelegacy copying ctor. Also fixes a latent bug in
MemorySegmentVectorFloat.get(int)that threwon off-heap segments (now falls back to
segment.getAtIndexwhenheapBase()is empty).PanamaVectorUtilSupport— the four protectedfromVectorFloat/intoVectorFloathelpers gain polymorphic dispatch:
BufferVectorFloat→FloatVector.fromMemorySegment( MemorySegment.ofBuffer(bb), byteOffset, order). Full SIMD, nofloat[]materialization.NativeVectorUtilSupport— falls through to super for non-MemorySegment vectors, soBufferVectorFloatworks under the native backend too.DefaultVectorUtilSupport— scalar kernels made polymorphic: if both operands areArrayVectorFloat, the existing unrolledfloat[]fast path runs; otherwise the generic.get(i)loop.GraphSearcher.search(ByteBuffer, …)andGraphIndexBuilder.addGraphNode(int, ByteBuffer)overloads — thin wrappers that call
wrapFloatVectorinternally.MMapRandomAccessVectorValues— rewritten to delegate toByteBufferRandomAccessVectorValuesover a
MappedByteBuffer. Drops the per-callfloat[dimension]scratch.PQ codebook training (follow-up)
Flamegraphs of HerdDB's indexing service showed
float[]allocations inProductQuantization.getSubVector— one freshfloat[dim/M]per training-vector × subspaceduring codebook construction. For 100k training vectors × M=8 that's ~800k small allocations.
VectorFloat.subview(int floatOffset, int floatLength)— new default method on theinterface; fallback materializes, concrete types override for zero-copy.
ArraySliceVectorFloat(new, jvector-base) —VectorFloat<float[]>that references aroot
float[]atarrayOffsetwith its own logical length. Companion toArraySliceByteSequence.subviewoverrides onArrayVectorFloat(→ArraySliceVectorFloat),BufferVectorFloat(→ anotherBufferVectorFloatover the same ByteBuffer), andMemorySegmentVectorFloat(→segment.asSlice(...)).ArraySliceVectorFloat:FloatVector.fromArray(SPEC, asv.get(), asv.arrayOffset() + offset)— SIMD preserved.ArrayVectorFloat.copyFromfast-pathed forArraySliceVectorFloatsource.ProductQuantization.getSubVectorrewritten tovector.subview(offset, length)— one-liner.Measured on dim=64, M=8, K=16, N=2000 training vectors:
37 B allocated per training vector (down from ~448 B when subvectors were materialized).
KMeans centroid storage and distance arrays — the algorithmic remainder — are unchanged.
Verification
All tests green locally, on all relevant test targets:
mvn test -pl jvector-tests(SIMD profile): 225 tests, 0 failures, 2 skippedmvn test -pl jvector-native: 6 tests green (incl. newMemorySegmentVectorFloatWrapTest)mvn test -pl jvector-examples: 109 tests green (incl. newMMapRandomAccessVectorValuesTest)mvn -B verify(full reactor): greenNew test classes:
BufferVectorFloatTest— element access, endianness, zero-copy on direct buffers,position/limit independence, slice correctness (18 cases)
VectorTypeSupportByteBufferTest— typed factory across active + Default provider (6)ByteBufferRandomAccessVectorValuesTest— parity with ListRAVV, concurrentthreadLocalSuppliercorrectness (6)BuildFromByteBufferEquivalenceTest— builds the same graph fromfloat[]vs ByteBuffer,asserts structural + search equivalence across EUCLIDEAN / DOT_PRODUCT / COSINE (4)
TestSearchWithByteBufferQuery— ByteBuffer overloads produce same results asVectorFloat<?>overloads (2)MMapRandomAccessVectorValuesTest— rewritten mmap RAVV round-trip (1)MemorySegmentVectorFloatWrapTest— wrap vs legacy copying ctor, endianness, alignment (4)SearchAllocationProfileTest— per-query allocation comparison float[] vs ByteBuffer (1)VectorFloatSubviewTest—subviewaliases source, nested subview, SIMD distance/dot/cosineequivalence with materialized copies (7)
PQTrainingAllocationTest—getSubVectorreturns a live view; PQ trainingper-vector allocation bounded (2)
Existing tests pass — notably
TestProductQuantization(9 cases) confirms the view-basedsubvector extraction produces codebooks identical to the prior materialization path.
HerdDB integration, before and after
Before:
After:
The HerdDB-side upgrade instructions are tracked in
eolivelli/herddb#174.
🤖 Generated with Claude Code