Skip to content

fix: scan all partitions and use FilterExec for pre-scan#14

Merged
anoop-narang merged 1 commit intomainfrom
fix/prescan-all-partitions-filterexec
Mar 24, 2026
Merged

fix: scan all partitions and use FilterExec for pre-scan#14
anoop-narang merged 1 commit intomainfrom
fix/prescan-all-partitions-filterexec

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Summary

  • Correctness fix: Pre-scan previously called execute(0) on the DataSourceExec, reading only the first partition's file group and missing valid keys from the rest of the dataset. Selectivity calculations and valid_key collection were based on partial data.
  • FilterExec wrapping: Wrap the pre-scan as CoalescePartitionsExec → FilterExec → DataSourceExec. DataFusion's physical optimizer pushes the predicate from FilterExec into the Parquet reader for pruning. No more manual evaluate_filters or prescan_filters.
  • No duplicate predicates: Filters are no longer passed to scan_provider.scan() since FilterExec handles it via the optimizer.
  • Code cleanup: Remove prescan_filters, physical_filters, evaluate_filters, and manual physical filter compilation from SearchParams.

Test plan

  • All 40 existing tests pass
  • Verify LIKE filtered vector search returns correct results across all partitions
  • Verify filename filtered vector search returns correct results
  • Verify unfiltered vector search is unaffected

The pre-scan previously called execute(0) on the DataSourceExec,
reading only the first partition's file group and missing valid keys
from the rest of the dataset. This was a correctness bug — selectivity
calculations and valid_key collection were based on partial data.

Wrap the pre-scan as CoalescePartitionsExec → FilterExec → DataSourceExec:
- CoalescePartitionsExec merges all partitions into a single stream
- FilterExec evaluates the predicate per partition (DataFusion's
  physical optimizer pushes it into the Parquet reader for pruning)
- The stream yields only matching rows — no manual evaluate_filters

This also removes prescan_filters, evaluate_filters, and the manual
physical filter compilation from SearchParams, simplifying the code.
@anoop-narang anoop-narang merged commit b0fdce3 into main Mar 24, 2026
5 checks passed
@anoop-narang anoop-narang deleted the fix/prescan-all-partitions-filterexec branch March 24, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant