Skip to content

perf: optimize filtered vector search pre-scan and low-selectivity path#12

Merged
anoop-narang merged 2 commits intomainfrom
perf/minimal-prescan-projection
Mar 24, 2026
Merged

perf: optimize filtered vector search pre-scan and low-selectivity path#12
anoop-narang merged 2 commits intomainfrom
perf/minimal-prescan-projection

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Summary

  • Narrow pre-scan projection: Pre-scan now projects only _key + columns referenced by filter expressions, instead of all scalar columns. Reduces bytes scanned from ~49MB to ~2.6MB for a 1.2M row dataset.
  • Index-get low-selectivity path: Replace the full Parquet scan (which read all columns including vectors) with index.get(key) to retrieve vectors directly from USearch by key. Computes distances for only the valid keys, then fetches result rows from the lookup provider. Eliminates the 7GB vector column scan entirely.
  • Dead code cleanup: Remove parquet_native_execute, ScoredRow, compute_distance_for_row, and full_scan from SearchParams — all superseded by the index-get path.

Performance impact (1.2M row dataset, 768-dim vectors)

Query type Before After
WHERE filename = '...' (pre-scan) 49MB / 54ms 2.6MB / 8ms
WHERE title LIKE '%..%' (low-sel) 6.95GB / 8s ~0ms (40 index.get calls)

Test plan

  • All 40 existing tests pass (optimizer rule, execution, parquet-native paths)
  • Verify WHERE filename = '...' filtered vector search returns correct results
  • Verify WHERE title LIKE '%..%' filtered vector search returns correct results and is fast
  • Verify unfiltered vector search is unaffected

The pre-scan previously projected all columns except the vector column,
reading unused data (e.g. id, url, title, sha, raw) that the pre-scan
never uses. It only needs _key (to collect valid keys) and columns
referenced by the WHERE clause filters.

Compute referenced columns from filter expressions, build a minimal
projection, and compile separate physical filters against the projected
schema so column indices match the narrower batch layout.
claude[bot]
claude bot previously approved these changes Mar 24, 2026
@anoop-narang anoop-narang force-pushed the perf/minimal-prescan-projection branch 3 times, most recently from 2b5f39b to 526b5a2 Compare March 24, 2026 11:04
src/planner.rs Outdated
.flat_map(|f| f.column_refs())
.map(|c| c.name.as_str())
.collect();
let key_col_idx = scan_schema.index_of(&registered.key_col).ok();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .ok() here silently converts a missing key column into None, which means _key would be excluded from the projection. The error surfaces later at execution time as a confusing column-not-found or wrong-column panic rather than a clear planning error. Suggest being explicit:

Suggested change
let key_col_idx = scan_schema.index_of(&registered.key_col).ok();
let key_col_idx = scan_schema.index_of(&registered.key_col).ok();

Or, for an earlier, clearer failure:

let key_col_idx = scan_schema
    .index_of(&registered.key_col)
    .map_err(|_| DataFusionError::Execution(format!(
        "USearchExec: key column '{}' not found in scan schema",
        registered.key_col
    )))?;

(Leaving the .ok() is fine too — the failure will still surface at execution time. Just a suggestion for debuggability.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — adopted the explicit error. Now fails at planning time with a clear message instead of silently excluding _key from the projection.

claude[bot]
claude bot previously approved these changes Mar 24, 2026
Replace the parquet-native full scan with direct vector retrieval from
the USearch index. The index stores vectors alongside the HNSW graph,
so index.get(key) retrieves them in O(1) per key.

Previously, the low-selectivity path scanned the entire Parquet file
including the vector column (e.g. 6.95GB for 1.2M rows) just to
compute distances for the few rows matching the WHERE clause. Now it
retrieves vectors only for valid_keys collected during the pre-scan,
computes distances, maintains a top-k heap, then fetches result rows
from the lookup provider.

This eliminates the full_scan DataSourceExec at runtime for filtered
queries. The parquet-native code is retained but unused, pending
removal after production validation.
@anoop-narang anoop-narang merged commit 4397342 into main Mar 24, 2026
5 checks passed
@anoop-narang anoop-narang deleted the perf/minimal-prescan-projection branch March 24, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant