Skip to content

feat(index): support distributed LabelList scalar index builds#7223

Open
jackye1995 wants to merge 1 commit into
lance-format:mainfrom
jackye1995:jack/label-list-distributed-build
Open

feat(index): support distributed LabelList scalar index builds#7223
jackye1995 wants to merge 1 commit into
lance-format:mainfrom
jackye1995:jack/label-list-distributed-build

Conversation

@jackye1995

Copy link
Copy Markdown
Contributor

Summary

LabelListIndexPlugin::train_index unconditionally rejected fragment-scoped
training (fragment_ids.is_some()), so a LabelList index could not be built
through the distributed / segmented path (per-fragment execute_uncommitted +
merge_existing_index_segments) the way BTree, Inverted, Bitmap, and FM already
are. This makes LabelList a first-class distributed scalar index.

Changes

  • lance-index/src/scalar/label_list.rs: train_index ignores fragment_ids and
    builds over the already fragment-scoped stream (mirrors FMIndexPlugin); a partial
    index over a fragment subset is correct since it covers exactly those rows. Add
    merge_label_list_indices, which unions the per-segment bitmap states and the
    list_nulls row sets. LabelList wraps a BitmapIndex plus a null-row set and
    distributed segments cover disjoint rows, so this is a cheap union (the same
    operation LabelListIndex::update performs) — no source re-scan. Mirrors
    merge_bitmap_indices but also carries list_nulls.
  • lance/src/index/scalar/label_list.rs (new): merge_segments opens each source
    segment as a LabelListIndex and calls merge_label_list_indices (mirrors
    scalar/bitmap.rs).
  • lance/src/index.rs: merge_existing_index_segments routes all_label_list
    segments (details LabelListIndexDetails) to the new merge; add
    segment_has_label_list_details.
  • lance/src/index/scalar.rs: declare the label_list module.

Covered by test_label_list_merge_existing_index_segments: one LabelList segment per
fragment, merged via merge_existing_index_segments, answers array_has_any across
both fragments with the same row count as a pre-index full scan.

LabelListIndexPlugin::train_index rejected fragment-scoped training, so a
LabelList index could not be built through the distributed / segmented path
(per-fragment execute_uncommitted + merge_existing_index_segments) the way
BTree, Inverted, Bitmap, and FM already are. Build over the already
fragment-scoped data stream (mirroring FMIndexPlugin) and add
merge_label_list_indices, which unions the per-segment bitmap states and
list_nulls row sets. Route LabelList segments through merge_existing_index_segments.
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 11, 2026
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.25806% with 33 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/label_list.rs 69.81% 8 Missing and 8 partials ⚠️
rust/lance/src/index/scalar/label_list.rs 67.34% 9 Missing and 7 partials ⚠️
rust/lance/src/index/create.rs 98.71% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant