perf: optimize filtered vector search pre-scan and low-selectivity path by anoop-narang · Pull Request #12 · hotdata-dev/datafusion-vector-search-ext

anoop-narang · 2026-03-24T10:51:51Z

Summary

Narrow pre-scan projection: Pre-scan now projects only _key + columns referenced by filter expressions, instead of all scalar columns. Reduces bytes scanned from ~49MB to ~2.6MB for a 1.2M row dataset.
Index-get low-selectivity path: Replace the full Parquet scan (which read all columns including vectors) with index.get(key) to retrieve vectors directly from USearch by key. Computes distances for only the valid keys, then fetches result rows from the lookup provider. Eliminates the 7GB vector column scan entirely.
Dead code cleanup: Remove parquet_native_execute, ScoredRow, compute_distance_for_row, and full_scan from SearchParams — all superseded by the index-get path.

Performance impact (1.2M row dataset, 768-dim vectors)

Query type	Before	After
`WHERE filename = '...'` (pre-scan)	49MB / 54ms	2.6MB / 8ms
`WHERE title LIKE '%..%'` (low-sel)	6.95GB / 8s	~0ms (40 index.get calls)

Test plan

All 40 existing tests pass (optimizer rule, execution, parquet-native paths)
Verify WHERE filename = '...' filtered vector search returns correct results
Verify WHERE title LIKE '%..%' filtered vector search returns correct results and is fast
Verify unfiltered vector search is unaffected

The pre-scan previously projected all columns except the vector column, reading unused data (e.g. id, url, title, sha, raw) that the pre-scan never uses. It only needs _key (to collect valid keys) and columns referenced by the WHERE clause filters. Compute referenced columns from filter expressions, build a minimal projection, and compile separate physical filters against the projected schema so column indices match the narrower batch layout.

claude · 2026-03-24T11:10:58Z

src/planner.rs

+                .flat_map(|f| f.column_refs())
+                .map(|c| c.name.as_str())
+                .collect();
+            let key_col_idx = scan_schema.index_of(&registered.key_col).ok();


Using .ok() here silently converts a missing key column into None, which means _key would be excluded from the projection. The error surfaces later at execution time as a confusing column-not-found or wrong-column panic rather than a clear planning error. Suggest being explicit:

Suggested change

let key_col_idx = scan_schema.index_of(&registered.key_col).ok();

let key_col_idx = scan_schema.index_of(&registered.key_col).ok();

Or, for an earlier, clearer failure:

let key_col_idx = scan_schema .index_of(&registered.key_col) .map_err(|_| DataFusionError::Execution(format!( "USearchExec: key column '{}' not found in scan schema", registered.key_col )))?;

(Leaving the .ok() is fine too — the failure will still surface at execution time. Just a suggestion for debuggability.)

Good catch — adopted the explicit error. Now fails at planning time with a clear message instead of silently excluding _key from the projection.

Replace the parquet-native full scan with direct vector retrieval from the USearch index. The index stores vectors alongside the HNSW graph, so index.get(key) retrieves them in O(1) per key. Previously, the low-selectivity path scanned the entire Parquet file including the vector column (e.g. 6.95GB for 1.2M rows) just to compute distances for the few rows matching the WHERE clause. Now it retrieves vectors only for valid_keys collected during the pre-scan, computes distances, maintains a top-k heap, then fetches result rows from the lookup provider. This eliminates the full_scan DataSourceExec at runtime for filtered queries. The parquet-native code is retained but unused, pending removal after production validation.

claude bot previously approved these changes Mar 24, 2026

View reviewed changes

anoop-narang dismissed claude[bot]’s stale review via 0a5e8ad March 24, 2026 10:58

anoop-narang force-pushed the perf/minimal-prescan-projection branch 3 times, most recently from 2b5f39b to 526b5a2 Compare March 24, 2026 11:04

claude bot reviewed Mar 24, 2026

View reviewed changes

claude bot previously approved these changes Mar 24, 2026

View reviewed changes

anoop-narang dismissed claude[bot]’s stale review via c49f52d March 24, 2026 11:13

anoop-narang force-pushed the perf/minimal-prescan-projection branch from 526b5a2 to c49f52d Compare March 24, 2026 11:13

claude bot approved these changes Mar 24, 2026

View reviewed changes

anoop-narang merged commit 4397342 into main Mar 24, 2026
5 checks passed

anoop-narang deleted the perf/minimal-prescan-projection branch March 24, 2026 11:17

anoop-narang mentioned this pull request Mar 24, 2026

docs: update README for index-get low-selectivity path #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize filtered vector search pre-scan and low-selectivity path#12

perf: optimize filtered vector search pre-scan and low-selectivity path#12
anoop-narang merged 2 commits intomainfrom
perf/minimal-prescan-projection

anoop-narang commented Mar 24, 2026

Uh oh!

claude bot Mar 24, 2026

Uh oh!

anoop-narang Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	let key_col_idx = scan_schema.index_of(&registered.key_col).ok();
	let key_col_idx = scan_schema.index_of(&registered.key_col).ok();

Conversation

anoop-narang commented Mar 24, 2026

Summary

Performance impact (1.2M row dataset, 768-dim vectors)

Test plan

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

anoop-narang Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant