Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#623
Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#623MarkWolters wants to merge 2 commits intomainfrom
Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
jshook
left a comment
There was a problem hiding this comment.
This looks reasonable to me. It would be nice to see the results laid out a little better for before and after, but it still looks like a solid improvement.
|
@MarkWolters, Thank you for creating this pull request. Please let me know if you have any concerns about the current implementation or need any additional data. |
r-devulap
left a comment
There was a problem hiding this comment.
First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.
|
|
||
| /*-------------------- Score functions--------------------*/ | ||
| //adding SPECIES_64 & SPECIES_128 for completeness, will it get there? | ||
| float pqScoreEuclidean_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) { |
There was a problem hiding this comment.
This name is mis-leading, probably better to rename this to pqScoreEuclidean_preferred?
| int length1 = centroidIndex1 * centroidLength; | ||
| int length2 = centroidIndex2 * centroidLength; | ||
|
|
||
| if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) { |
There was a problem hiding this comment.
Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?
| else if (subvectorSizesAndOffsets[0][0] >= FloatVector.SPECIES_128.length()) { | ||
| return pqScoreEuclidean_128( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); | ||
| } | ||
| return pqScoreEuclidean_64( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); |
There was a problem hiding this comment.
We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.
Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:
pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)pqScoreEuclidean(...)selectsspecies(SPECIES_PREFERRED,SPECIES_256,SPECIES_128,SPECIES_64) once and dispatches to the shared impl.
This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.
This pull request introduces improvement to Euclidean similarity function in PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent in jdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly because FloatVector.reducelanes() is expensive and it is being called inside a for loop (via VectorUtil.squareL2Distance). Modification in this pull request moves call to reduceLanes() outside the for loop.
Change proposed here was tested with the benchmark, PQDistanceCalculationBenchmark.diversityCalculation
With this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.
Code modifications:
Added a new function pqDiversityEuclidean in VectorUtilSupport and its corresponding implementations
Removed for loop in PQVectors.diversityFunctionFor and moved it into pqDiversityEuclidean
Moved FloatVector.reducelanes() outside the for loop
Test setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation
New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.
Code modifications:
Added new functions pqScoreEuclidean, pqScoreDotProduct, pqScoreCosine in VectorUtilSupport and its corresponding implementations for diversityFunctionFor
Added overloaded version of above functions for scoreFunctionFor
Removed for loop in PQVectors.diversityFunctionFor and PQVectors.scoreFunctionFor and moved them into respective functions in PanamaVectorUtilSupport
Moved FloatVector.reducelanes() outside the for loop
Added a new benchmark which uses MutablePQVectors to test this
Test setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark : PQDistanceCalculationMutableVectorBenchmark (Added this by replicating PQDistanceCalculationBenchmark to measure performance for MutablePQVectors)