Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function by MarkWolters · Pull Request #623 · datastax/jvector

MarkWolters · 2026-02-12T17:52:49Z

This pull request introduces improvement to Euclidean similarity function in PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent in jdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly because FloatVector.reducelanes() is expensive and it is being called inside a for loop (via VectorUtil.squareL2Distance). Modification in this pull request moves call to reduceLanes() outside the for loop.

Change proposed here was tested with the benchmark, PQDistanceCalculationBenchmark.diversityCalculation
With this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.

Code modifications:

Added a new function pqDiversityEuclidean in VectorUtilSupport and its corresponding implementations
Removed for loop in PQVectors.diversityFunctionFor and moved it into pqDiversityEuclidean
Moved FloatVector.reducelanes() outside the for loop
Test setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation

New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.

Code modifications:

Added new functions pqScoreEuclidean, pqScoreDotProduct, pqScoreCosine in VectorUtilSupport and its corresponding implementations for diversityFunctionFor
Added overloaded version of above functions for scoreFunctionFor
Removed for loop in PQVectors.diversityFunctionFor and PQVectors.scoreFunctionFor and moved them into respective functions in PanamaVectorUtilSupport
Moved FloatVector.reducelanes() outside the for loop
Added a new benchmark which uses MutablePQVectors to test this
Test setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark : PQDistanceCalculationMutableVectorBenchmark (Added this by replicating PQDistanceCalculationBenchmark to measure performance for MutablePQVectors)

github-actions · 2026-02-12T17:53:02Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

jshook

This looks reasonable to me. It would be nice to see the results laid out a little better for before and after, but it still looks like a solid improvement.

shyla226 · 2026-03-02T17:26:17Z

@MarkWolters, Thank you for creating this pull request. Please let me know if you have any concerns about the current implementation or need any additional data.
@jshook, I collected benchmark data using PQDistanceCalculationMutableVectorBenchmark. Would you like to see tabulated results from the runs, or do you have what you need already?

r-devulap

First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.

r-devulap · 2026-03-05T12:39:59Z

+
+    /*-------------------- Score functions--------------------*/
+    //adding SPECIES_64 & SPECIES_128 for completeness, will it get there?
+    float pqScoreEuclidean_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) {


This name is mis-leading, probably better to rename this to pqScoreEuclidean_preferred?

I have renamed the functions, code available here

I rearranged the code a little bit for better clarity.

r-devulap · 2026-03-05T12:40:15Z

+            int length1 = centroidIndex1 * centroidLength;
+            int length2 = centroidIndex2 * centroidLength;
+
+            if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) {


Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?

I did it this way, so we don't process the tail when it is not needed, similar to dotPreferred & dotProductPreferred functions in the file. Not processing the tail seemed to yield some performance benefit in my previous experiments.
Let me do a few more tests and get back to you on this one

r-devulap · 2026-03-05T12:41:07Z

+        else if (subvectorSizesAndOffsets[0][0]  >=  FloatVector.SPECIES_128.length()) {
+            return pqScoreEuclidean_128( codebooks,  subvectorSizesAndOffsets, node1Chunk, node1Offset,  node2Chunk, node2Offset, subspaceCount);
+        }
+        return pqScoreEuclidean_64( codebooks,  subvectorSizesAndOffsets, node1Chunk, node1Offset,  node2Chunk, node2Offset, subspaceCount);


We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.

Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:

pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)

pqScoreEuclidean(...) selects species (SPECIES_PREFERRED, SPECIES_256, SPECIES_128, SPECIES_64) once and dispatches to the shared impl.

This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.

shyla226 · 2026-03-10T23:28:31Z

First‑pass review: there’s quite a bit of duplicated code, and I think we can consolidate it significantly.

@r-devulap, Thank you very much for your feedback. I agree there is quite a bit of duplication of code, but I had tried creating a generic function, providing VectorSpecies as an argument.
For cosine function, something like this
float pqScoreCosine_generic(final VectorSpecies<Float> SPEC, VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> encodedChunk, int encodedOffset, VectorFloat<?> centeredQuery, int subspaceCount)

What I found was that, if the SPECIES is not set at compile-time, we don't see the performance benefit.

I tried it again after I saw your feedback, I see the same behavior as I did before unfortunately!
I am open to other thoughts you have.

r-devulap · 2026-03-12T07:21:34Z

@shyla226 Quick question regarding this patch.

It updates diversityFunctionFor in the abstract PQVectors class, but I’m not convinced the benchmark PQDistanceCalculationBenchmark.diversityCalculation would show any improvement from this change
as you claimed.

In the benchmark setup, pqVectors is populated via pq.encodeAll(ravv), which produces an ImmutablePQVectors. As a result, the call to diversityFunctionFor is dispatched to ImmutablePQVectors.diversityFunctionFor — the overridden implementation that uses precomputed codebook partial sums — rather than the base implementation in PQVectors that this patch modifies. You can confirm this in the attached flame graph: the class method MutablePQVectors.diversityFunctionFor never appears in the benchmark.

That code path is only exercised when the instance is a MutablePQVectors, which does not override the method. The new microbenchmarks you added for MutablePQVectors may indeed show improvements, since they actually hit the modified code.

Would you be able to show whether this change yields improvements in a high‑level index‑construction benchmark rather than only the microbenchmarks?. My understanding is that MutablePQVectors is used during incremental (larger‑than‑memory) index construction, but I don't believe we currently have benchmarks covering that. @jshook or @MarkWolters can likely confirm.

For the standard in‑memory (on‑heap) index build, the code path ImmutablePQVectors.diversityFunctionFor isn’t on the hot path, and I'm not sure whether incremental index construction would behave differently.

shyla226 · 2026-03-12T17:53:04Z

@r-devulap, you have a very good point, if it is not in the critical path there is no point in adding complexity.
Are there any high-level benchmarks, which use MutablePQVectors, that you would like me to try?
Let me know if you have any other thoughts.

shyla226 · 2026-04-06T20:45:38Z

@r-devulap, DataStax fork of Apache Cassandra seems to use the latest JVector with MutablePQVectors. I tested my changes with cql_vector2_fvec schema available in NoSQLBench and I see ~5% drop in execution time, when I run the scenario astra_vectors.train available in cql_vector2_fvec.yaml.

Configuration used:
Platform : Intel(R) Xeon(R) 6979P

Java :
Version : openjdk version "25.0.2" 2026-01-20
Heap Size : 31G

Cassandra :
Version : https://github.com/datastax/cassandra (commit 433048473cdd7e24245cbe9898e7a9e07096cdd8)

NoSQLBench :
Version : 5.25.7-SNAPSHOT
Schema : cql_vector2_fvec
Datafile : text-embedding-3-large_1536_100000_(base_vectors/indices_query/query_vectors).fvec
cycles : 5000000
scenario : astra_vectors.train

Experiment setup :
Cassandra process was bound to a numa node of Socket0 & nb5.jar bound to a numa node of Socket1. After running astra_vectors.train scenario, session_time_ms reported in status.yaml was used as execution time.

r-devulap · 2026-04-08T06:18:55Z

@shyla226 could you generate a flame graph for the workload and post it here? Curious to see where the hotspots are and if the reduce add shows up in there.

shyla226 · 2026-04-08T17:13:51Z

@r-devulap, attaching the flamegraph segment for baseline & modified versions, showing MutablePQVector call stack in the Cassandra version I used. Similarity function appears to be Cosine.

r-devulap · 2026-04-13T02:42:52Z

@shyla226 thanks for sharing those. I am not familiar with these workloads yet, please allow sometime for me to verify these benchmarks.

MarkWolters and others added 2 commits February 2, 2026 10:19

added space to force branch creation

8e85edb

Rebased reduce_lanes changes with latest main (#611)

5969a42

MarkWolters requested review from jshook and tlwillke as code owners February 12, 2026 17:52

jshook approved these changes Feb 12, 2026

View reviewed changes

r-devulap mentioned this pull request Mar 5, 2026

Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function #514

Open

r-devulap reviewed Mar 5, 2026

View reviewed changes

Conversation

MarkWolters commented Feb 12, 2026

Uh oh!

github-actions Bot commented Feb 12, 2026

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

shyla226 commented Mar 2, 2026

Uh oh!

r-devulap left a comment

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shyla226 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

shyla226 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

r-devulap Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

shyla226 commented Mar 10, 2026

Uh oh!

r-devulap commented Mar 12, 2026

Uh oh!

shyla226 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shyla226 commented Apr 6, 2026

Uh oh!

r-devulap commented Apr 8, 2026

Uh oh!

shyla226 commented Apr 8, 2026

Uh oh!

r-devulap commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

r-devulap Mar 5, 2026 •

edited

Loading

shyla226 commented Mar 12, 2026 •

edited

Loading