perf: optimize filtered vector search pre-scan and low-selectivity path#12
Merged
anoop-narang merged 2 commits intomainfrom Mar 24, 2026
Merged
perf: optimize filtered vector search pre-scan and low-selectivity path#12anoop-narang merged 2 commits intomainfrom
anoop-narang merged 2 commits intomainfrom
Conversation
The pre-scan previously projected all columns except the vector column, reading unused data (e.g. id, url, title, sha, raw) that the pre-scan never uses. It only needs _key (to collect valid keys) and columns referenced by the WHERE clause filters. Compute referenced columns from filter expressions, build a minimal projection, and compile separate physical filters against the projected schema so column indices match the narrower batch layout.
2b5f39b to
526b5a2
Compare
src/planner.rs
Outdated
| .flat_map(|f| f.column_refs()) | ||
| .map(|c| c.name.as_str()) | ||
| .collect(); | ||
| let key_col_idx = scan_schema.index_of(®istered.key_col).ok(); |
There was a problem hiding this comment.
Using .ok() here silently converts a missing key column into None, which means _key would be excluded from the projection. The error surfaces later at execution time as a confusing column-not-found or wrong-column panic rather than a clear planning error. Suggest being explicit:
Suggested change
| let key_col_idx = scan_schema.index_of(®istered.key_col).ok(); | |
| let key_col_idx = scan_schema.index_of(®istered.key_col).ok(); |
Or, for an earlier, clearer failure:
let key_col_idx = scan_schema
.index_of(®istered.key_col)
.map_err(|_| DataFusionError::Execution(format!(
"USearchExec: key column '{}' not found in scan schema",
registered.key_col
)))?;(Leaving the .ok() is fine too — the failure will still surface at execution time. Just a suggestion for debuggability.)
Collaborator
Author
There was a problem hiding this comment.
Good catch — adopted the explicit error. Now fails at planning time with a clear message instead of silently excluding _key from the projection.
Replace the parquet-native full scan with direct vector retrieval from the USearch index. The index stores vectors alongside the HNSW graph, so index.get(key) retrieves them in O(1) per key. Previously, the low-selectivity path scanned the entire Parquet file including the vector column (e.g. 6.95GB for 1.2M rows) just to compute distances for the few rows matching the WHERE clause. Now it retrieves vectors only for valid_keys collected during the pre-scan, computes distances, maintains a top-k heap, then fetches result rows from the lookup provider. This eliminates the full_scan DataSourceExec at runtime for filtered queries. The parquet-native code is retained but unused, pending removal after production validation.
526b5a2 to
c49f52d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_key+ columns referenced by filter expressions, instead of all scalar columns. Reduces bytes scanned from ~49MB to ~2.6MB for a 1.2M row dataset.index.get(key)to retrieve vectors directly from USearch by key. Computes distances for only the valid keys, then fetches result rows from the lookup provider. Eliminates the 7GB vector column scan entirely.parquet_native_execute,ScoredRow,compute_distance_for_row, andfull_scanfromSearchParams— all superseded by the index-get path.Performance impact (1.2M row dataset, 768-dim vectors)
WHERE filename = '...'(pre-scan)WHERE title LIKE '%..%'(low-sel)Test plan
WHERE filename = '...'filtered vector search returns correct resultsWHERE title LIKE '%..%'filtered vector search returns correct results and is fast