fix: evaluate all list-element docs in FTS prefilter walk-the-allowlist branch#7246
Open
Ar-maan05 wants to merge 1 commit into
Open
fix: evaluate all list-element docs in FTS prefilter walk-the-allowlist branch#7246Ar-maan05 wants to merge 1 commit into
Ar-maan05 wants to merge 1 commit into
Conversation
A list<string> column indexes each element as its own document sharing the row's id, but flat_search resolved each allow-listed row_id to a single doc_id, dropping matches at non-last list positions. Closes lancedb#3352
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
FTS
search()combined with awhere(...)prefilter on alist<string>/large_list<large_string>column silently drops matches when the query token sits at any position other than the last in a row's list..postfilter()(FTS first, then filter) returns the correct rows.Reported as lancedb#3352 with a runnable Python repro. The plan is
MatchQuery > ScalarIndexQuery, and the bug only surfaces when the planner picks the small-allowlist prefilter path (index_comparisons ≈ allowlist size):keywords["needle", "synonym"]["synonym", "needle"]Root cause
A list column indexes every element as its own document, so one
row_idowns severaldoc_ids:DocSet.inv(aVec<(row_id, doc_id)>sorted byrow_id) holds multiple entries per row.DocSet::doc_id(row_id)resolved a row to a singledoc_idviabinary_search_by_key, and its only caller isWand::flat_search: the walk-the-allowlist prefilter branch. It therefore evaluated just one of the row'sdocuments against the posting lists; when the query token lived in any other element, the row became a false negative.
The regular WAND path is forward-driven (document ->
row_id, with a per-document mask check), so it was always correct, onlyflat_searchwas affected, which is why the bug is specific to the prefilter branch.Fix
DocSet::doc_idwithDocSet::doc_ids(row_id) -> impl Iterator, which yields everydoc_idin the contiguous equal-key run ininv(the legacyrow_id == doc_idshape still resolves to a single document).flat_searchnow expands each allow-listedrow_idto all of its documents (flat_mapoverdoc_ids) before sorting into doc-id order.This brings
flat_searchto parity with the WAND path, so it introduces no new duplicate-row behaviour: only documents actually present in the posting lists score.Tests
test_doc_ids_resolves_every_document_a_row_owns: unit coverage of the multi-valued resolution (list shape, legacy shape, and a missing row).test_flat_search_finds_list_row_with_match_at_non_last_position(rstest, compressed + plain): reproduces the bug; it fails on the previous single-doc_idresolution and passes with the fix.All 143
scalar::invertedtests pass;cargo fmt --all --checkandcargo clippy -p lance-index --tests -- -D warningsare clean.Closes lancedb#3352