[vector] Add unified vector index integration#8174
Conversation
Integrate apache/paimon-vector-index (pure Rust IVF-PQ) into Paimon's GlobalIndex SPI framework. Follows the paimon-tantivy two-level module pattern: paimon-ivfpq-jni for Java JNI bindings and NativeLoader, paimon-ivfpq-index for Paimon GlobalIndexer integration. Key features: - IVF-PQ vector index with identifier "ivfpq" - Native Roaring bitmap filter pushdown (byte[] format) - Direct stream I/O via JNI (no adapter classes needed) - Reservoir sampling for training with configurable sample ratio - Batched vector insertion for memory efficiency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for adding the IVF-PQ GlobalIndex integration. I found one blocking issue: the new SPI service file is included in the RAT scan but does not carry the ASF license header. This makes both the new ivfpq_test workflow and a local mvn -pl paimon-ivfpq/paimon-ivfpq-jni,paimon-ivfpq/paimon-ivfpq-index -am -DskipTests -DfailIfNoTests=false verify fail with Too many files with unapproved license: 1.
I also noticed that this PR adds the new JNI facade and GlobalIndex implementation without any src/test coverage under paimon-ivfpq. After fixing the RAT blocker, please add at least basic coverage for the SPI wiring / JNI facade behavior, or an executable test that exercises building and searching an IVF-PQ index with the native library available.
| @@ -0,0 +1 @@ | |||
| org.apache.paimon.ivfpq.index.IvfpqVectorGlobalIndexerFactory | |||
There was a problem hiding this comment.
This service file needs the standard ASF license header. It is currently reported by RAT as an unapproved file, so mvn ... verify fails before the new modules can be tested. Existing service files in the repository, such as the Lumina and Tantivy GlobalIndexerFactory service files, include the # Licensed to the Apache Software Foundation ... header.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the quick update. The previous RAT blocker is fixed, and the main IVF-PQ modules compile locally with -Pfast-build. However, the newly added tests do not compile yet. Options does not provide setBoolean(String, boolean) or setDouble(String, double), so the latest ivfpq_test CI fails during paimon-ivfpq-index test compilation. Please switch these calls to options.set(<ConfigOption>, value) (for example options.set(IvfpqVectorIndexOptions.USE_OPQ, true) and options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5)) or use supported string setters before re-running the CI.
| options.setString("ivfpq.distance.metric", "l2"); | ||
| options.setInteger("ivfpq.nlist", 128); | ||
| options.setInteger("ivfpq.m", 8); | ||
| options.setBoolean("ivfpq.use_opq", true); |
There was a problem hiding this comment.
This test does not compile because Options has no setBoolean(String, boolean) method. The same applies to the setDouble calls below and the setBoolean call in IvfpqVectorGlobalIndexTest. Please use options.set(IvfpqVectorIndexOptions.USE_OPQ, true) / options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5) or supported string setters instead.
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the updates. I re-reviewed the latest revision and the previous blockers look resolved:
- the new GlobalIndexer SPI service file now carries the ASF license header;
- the tests no longer use unsupported
Options#setBoolean/setDoubleAPIs; - the vector-index modules compile cleanly locally;
- native end-to-end vector index tests pass after building
paimon-vector-indexJNI and copyinglibpaimon_vindex_jni.sointo the test resources, matching the new workflow.
Local checks I ran:
mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false -Dcheckstyle.skip=true -Dspotless.check.skip=true -Drat.skip=false verify
cargo build --release -p paimon-vindex-jni
mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -Dcheckstyle.skip=true -Dspotless.check.skip=true test
The module test run completed with Tests run: 21, Failures: 0, Errors: 0, Skipped: 0 once the native library was available. LGTM.
Summary
Integrate
apache/paimon-vector-indexwith Paimon GlobalIndex SPI as the newpaimon-vectormodule. The PR follows the latest upstream unified Java/JNI vector index API and supports multiple ANN algorithms instead of IVF-PQ only.Changes
paimon-vectorwith two submodules:paimon-vector-jni: Java JNI bindings and native library loading.paimon-vector-index: Paimon GlobalIndex reader/writer integration.ivf-flativf-pqivf-hnsw-flativf-hnsw-sqVectorGlobalIndexerFactorythe abstract base class and let each concrete factory provide the native vector index type.index_type; no Paimon-facingvector.index.typeoption is introduced.paimon-vector-indexAPI:org.apache.paimon.index.vector, matching current native JNI symbols.VectorIndexWriteraccepts upstream-styleMap<String, String>options.VectorIndexReaderopens directly fromVectorIndexInputand exposes string metadata for index type and metric.VectorIndexOptions,VectorMetric, and JNI-owned enum wrappers. Vector build/search parameters are documented asvector.*dynamic options and are read directly fromOptionswhere needed.nprobeandef_search; dimension and metric are read from the native vector index metadata.vector.*options, and update multimodal/Flink procedure docs to use the new vector index names.utcase-vector-indexCI workflow and focused tests for SPI registration, metadata, validation, and search behavior.Documented Vector Options
vector.index.dimension128ARRAY<FLOAT>columns.VECTOR<FLOAT>columns use the type dimension.vector.distance.metricinner_productl2,cosine, orinner_product.vector.nlist256vector.pq.m16ivf-pq; the vector dimension must be divisible by this value.vector.pq.use-opqfalseivf-pq.vector.hnsw.m20ivf-hnsw-flatandivf-hnsw-sq.vector.hnsw.ef-construction150vector.hnsw.max-level7vector.nprobe16vector.hnsw.ef-search00uses the native library default.vector.train.sample-ratio1.0vector.add.batch-size10000Notes
ivfpq.*options, old module names, old JNI package names, or the previous JNI config-object API because this integration has not been released.vectornaming.Testing
mvn -pl paimon-vector/paimon-vector-index spotless:apply -DfailIfNoTests=falsemvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest testmvn -pl paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false compilemvn -pl paimon-vector/paimon-vector-index -am -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest testmvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest clean testgit diff --checkNative-library-dependent test cases are skipped locally when the native library is unavailable. The dedicated vector CI workflow builds the latest native library and should exercise those tests on Linux.