Skip to content

[vector] Add unified vector index integration#8174

Open
JingsongLi wants to merge 9 commits into
apache:masterfrom
JingsongLi:ivfpq-integration
Open

[vector] Add unified vector index integration#8174
JingsongLi wants to merge 9 commits into
apache:masterfrom
JingsongLi:ivfpq-integration

Conversation

@JingsongLi

@JingsongLi JingsongLi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Integrate apache/paimon-vector-index with Paimon GlobalIndex SPI as the new paimon-vector module. The PR follows the latest upstream unified Java/JNI vector index API and supports multiple ANN algorithms instead of IVF-PQ only.

Changes

  • Add paimon-vector with two submodules:
    • paimon-vector-jni: Java JNI bindings and native library loading.
    • paimon-vector-index: Paimon GlobalIndex reader/writer integration.
  • Register concrete vector GlobalIndex factories for:
    • ivf-flat
    • ivf-pq
    • ivf-hnsw-flat
    • ivf-hnsw-sq
  • Make VectorGlobalIndexerFactory the abstract base class and let each concrete factory provide the native vector index type.
  • Keep the selected algorithm as a single source of truth from the GlobalIndex index_type; no Paimon-facing vector.index.type option is introduced.
  • Align the JNI wrapper with the latest upstream paimon-vector-index API:
    • Java package is org.apache.paimon.index.vector, matching current native JNI symbols.
    • VectorIndexWriter accepts upstream-style Map<String, String> options.
    • VectorIndexReader opens directly from VectorIndexInput and exposes string metadata for index type and metric.
  • Remove Paimon-side VectorIndexOptions, VectorMetric, and JNI-owned enum wrappers. Vector build/search parameters are documented as vector.* dynamic options and are read directly from Options where needed.
  • Minimize Paimon vector index metadata to only the Paimon-owned search parameters nprobe and ef_search; dimension and metric are read from the native vector index metadata.
  • Add vector search filter pushdown with Roaring include-row-id filtering.
  • Add vector index docs for supported index types and documented vector.* options, and update multimodal/Flink procedure docs to use the new vector index names.
  • Add a dedicated utcase-vector-index CI workflow and focused tests for SPI registration, metadata, validation, and search behavior.

Documented Vector Options

Option Default Description
vector.index.dimension 128 Vector dimension for ARRAY<FLOAT> columns. VECTOR<FLOAT> columns use the type dimension.
vector.distance.metric inner_product Distance metric: l2, cosine, or inner_product.
vector.nlist 256 Number of IVF clusters used during index build.
vector.pq.m 16 Number of PQ sub-vectors for ivf-pq; the vector dimension must be divisible by this value.
vector.pq.use-opq false Whether to enable OPQ for ivf-pq.
vector.hnsw.m 20 HNSW graph out-degree for ivf-hnsw-flat and ivf-hnsw-sq.
vector.hnsw.ef-construction 150 HNSW construction search width.
vector.hnsw.max-level 7 Maximum HNSW level.
vector.nprobe 16 Number of IVF clusters to probe during search.
vector.hnsw.ef-search 0 HNSW search width during search; 0 uses the native library default.
vector.train.sample-ratio 1.0 Ratio of vectors sampled for index training.
vector.add.batch-size 10000 Batch size used when adding vectors to the native index writer.

Notes

  • No compatibility path is kept for old ivfpq.* options, old module names, old JNI package names, or the previous JNI config-object API because this integration has not been released.
  • The Paimon-facing module, artifact, workflow, integration package, and JNI wrapper package use vector naming.

Testing

  • mvn -pl paimon-vector/paimon-vector-index spotless:apply -DfailIfNoTests=false
  • mvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest test
  • mvn -pl paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false compile
  • mvn -pl paimon-vector/paimon-vector-index -am -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest test
  • mvn -pl paimon-vector/paimon-vector-index -am -Pfast-build -DfailIfNoTests=false -Dtest=VectorGlobalIndexerFactoryTest,VectorGlobalIndexTest clean test
  • git diff --check

Native-library-dependent test cases are skipped locally when the native library is unavailable. The dedicated vector CI workflow builds the latest native library and should exercise those tests on Linux.

Integrate apache/paimon-vector-index (pure Rust IVF-PQ) into Paimon's
GlobalIndex SPI framework. Follows the paimon-tantivy two-level module
pattern: paimon-ivfpq-jni for Java JNI bindings and NativeLoader,
paimon-ivfpq-index for Paimon GlobalIndexer integration.

Key features:
- IVF-PQ vector index with identifier "ivfpq"
- Native Roaring bitmap filter pushdown (byte[] format)
- Direct stream I/O via JNI (no adapter classes needed)
- Reservoir sampling for training with configurable sample ratio
- Batched vector insertion for memory efficiency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the IVF-PQ GlobalIndex integration. I found one blocking issue: the new SPI service file is included in the RAT scan but does not carry the ASF license header. This makes both the new ivfpq_test workflow and a local mvn -pl paimon-ivfpq/paimon-ivfpq-jni,paimon-ivfpq/paimon-ivfpq-index -am -DskipTests -DfailIfNoTests=false verify fail with Too many files with unapproved license: 1.

I also noticed that this PR adds the new JNI facade and GlobalIndex implementation without any src/test coverage under paimon-ivfpq. After fixing the RAT blocker, please add at least basic coverage for the SPI wiring / JNI facade behavior, or an executable test that exercises building and searching an IVF-PQ index with the native library available.

@@ -0,0 +1 @@
org.apache.paimon.ivfpq.index.IvfpqVectorGlobalIndexerFactory

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service file needs the standard ASF license header. It is currently reported by RAT as an unapproved file, so mvn ... verify fails before the new modules can be tested. Existing service files in the repository, such as the Lumina and Tantivy GlobalIndexerFactory service files, include the # Licensed to the Apache Software Foundation ... header.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update. The previous RAT blocker is fixed, and the main IVF-PQ modules compile locally with -Pfast-build. However, the newly added tests do not compile yet. Options does not provide setBoolean(String, boolean) or setDouble(String, double), so the latest ivfpq_test CI fails during paimon-ivfpq-index test compilation. Please switch these calls to options.set(<ConfigOption>, value) (for example options.set(IvfpqVectorIndexOptions.USE_OPQ, true) and options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5)) or use supported string setters before re-running the CI.

options.setString("ivfpq.distance.metric", "l2");
options.setInteger("ivfpq.nlist", 128);
options.setInteger("ivfpq.m", 8);
options.setBoolean("ivfpq.use_opq", true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not compile because Options has no setBoolean(String, boolean) method. The same applies to the setDouble calls below and the setBoolean call in IvfpqVectorGlobalIndexTest. Please use options.set(IvfpqVectorIndexOptions.USE_OPQ, true) / options.set(IvfpqVectorIndexOptions.TRAIN_SAMPLE_RATIO, 0.5) or supported string setters instead.

@JingsongLi JingsongLi changed the title [core] Add paimon-ivfpq module for IVF-PQ vector index integration [WIP][core] Add paimon-ivfpq module for IVF-PQ vector index integration Jun 9, 2026
@JingsongLi JingsongLi changed the title [WIP][core] Add paimon-ivfpq module for IVF-PQ vector index integration [core] Add vector index integration Jun 10, 2026
@JingsongLi JingsongLi changed the title [core] Add vector index integration [vector] Add vector index integration Jun 10, 2026

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I re-reviewed the latest revision and the previous blockers look resolved:

  • the new GlobalIndexer SPI service file now carries the ASF license header;
  • the tests no longer use unsupported Options#setBoolean / setDouble APIs;
  • the vector-index modules compile cleanly locally;
  • native end-to-end vector index tests pass after building paimon-vector-index JNI and copying libpaimon_vindex_jni.so into the test resources, matching the new workflow.

Local checks I ran:

mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -am -DskipTests -DfailIfNoTests=false -Dcheckstyle.skip=true -Dspotless.check.skip=true -Drat.skip=false verify
cargo build --release -p paimon-vindex-jni
mvn -B -ntp -pl paimon-vector/paimon-vector-jni,paimon-vector/paimon-vector-index -Dcheckstyle.skip=true -Dspotless.check.skip=true test

The module test run completed with Tests run: 21, Failures: 0, Errors: 0, Skipped: 0 once the native library was available. LGTM.

@JingsongLi JingsongLi changed the title [vector] Add vector index integration [vector] Add unified vector index integration Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants