Skip to content

fix: evaluate all list-element docs in FTS prefilter walk-the-allowlist branch#7246

Open
Ar-maan05 wants to merge 1 commit into
lance-format:mainfrom
Ar-maan05:fix/fts-list-prefilter-drops-matches
Open

fix: evaluate all list-element docs in FTS prefilter walk-the-allowlist branch#7246
Ar-maan05 wants to merge 1 commit into
lance-format:mainfrom
Ar-maan05:fix/fts-list-prefilter-drops-matches

Conversation

@Ar-maan05

Copy link
Copy Markdown
Contributor

Problem

FTS search() combined with a where(...) prefilter on a list<string> / large_list<large_string> column silently drops matches when the query token sits at any position other than the last in a row's list. .postfilter() (FTS first, then filter) returns the correct rows.

Reported as lancedb#3352 with a runnable Python repro. The plan is MatchQuery > ScalarIndexQuery, and the bug only surfaces when the planner picks the small-allowlist prefilter path (index_comparisons ≈ allowlist size):

Target row keywords prefilter (default) postfilter
["needle", "synonym"] 0 rows (bug) 2 rows
["synonym", "needle"] 2 rows 2 rows

Root cause

A list column indexes every element as its own document, so one row_id owns several doc_ids: DocSet.inv (a Vec<(row_id, doc_id)> sorted by row_id) holds multiple entries per row.

DocSet::doc_id(row_id) resolved a row to a single doc_id via binary_search_by_key, and its only caller is Wand::flat_search: the walk-the-allowlist prefilter branch. It therefore evaluated just one of the row's
documents against the posting lists; when the query token lived in any other element, the row became a false negative.

The regular WAND path is forward-driven (document -> row_id, with a per-document mask check), so it was always correct, only flat_search was affected, which is why the bug is specific to the prefilter branch.

Fix

  • Replace DocSet::doc_id with DocSet::doc_ids(row_id) -> impl Iterator, which yields every doc_id in the contiguous equal-key run in inv (the legacy row_id == doc_id shape still resolves to a single document).
  • flat_search now expands each allow-listed row_id to all of its documents (flat_map over doc_ids) before sorting into doc-id order.

This brings flat_search to parity with the WAND path, so it introduces no new duplicate-row behaviour: only documents actually present in the posting lists score.

Tests

  • test_doc_ids_resolves_every_document_a_row_owns: unit coverage of the multi-valued resolution (list shape, legacy shape, and a missing row).
  • test_flat_search_finds_list_row_with_match_at_non_last_position (rstest, compressed + plain): reproduces the bug; it fails on the previous single-doc_id resolution and passes with the fix.

All 143 scalar::inverted tests pass; cargo fmt --all --check and cargo clippy -p lance-index --tests -- -D warnings are clean.

Closes lancedb#3352

A list<string> column indexes each element as its own document sharing
the row's id, but flat_search resolved each allow-listed row_id to a
single doc_id, dropping matches at non-last list positions.

Closes lancedb#3352
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/wand.rs 95.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant