Skip to content

fix: gate scan-size on access-plan estimate, not full file size#6

Open
shefeek-jinnah wants to merge 2 commits into
mainfrom
shefeek/scan-gate-access-plan-estimate
Open

fix: gate scan-size on access-plan estimate, not full file size#6
shefeek-jinnah wants to merge 2 commits into
mainfrom
shefeek/scan-gate-access-plan-estimate

Conversation

@shefeek-jinnah

Copy link
Copy Markdown
Collaborator

The max_scan_bytes gate in try_optimize_parquet_source summed each file's full object_meta.size, so point-fetches that read only a few row groups (via a pre-attached ParquetAccessPlan) were judged by the whole file's size and passed through as uncached vanilla DataFusion reads.

Add estimated_scan_bytes(&PartitionedFile): when a precise ParquetAccessPlan is attached to file.extensions, scale the file size by the selected (non-Skip) row-group fraction; otherwise fall back to the full object_meta.size so normal/full scans are unaffected. Use a u128 intermediate to avoid overflow on large files. Wire it into the gate's size computation.

Add a focused unit test covering the targeted vs. full-scan estimate and the cap-crossing regression.

The max_scan_bytes gate in try_optimize_parquet_source summed each file's
full object_meta.size, so point-fetches that read only a few row groups
(via a pre-attached ParquetAccessPlan) were judged by the whole file's
size and passed through as uncached vanilla DataFusion reads.

Add estimated_scan_bytes(&PartitionedFile): when a precise ParquetAccessPlan
is attached to file.extensions, scale the file size by the selected
(non-Skip) row-group fraction; otherwise fall back to the full
object_meta.size so normal/full scans are unaffected. Use a u128
intermediate to avoid overflow on large files. Wire it into the gate's
size computation.

Add a focused unit test covering the targeted vs. full-scan estimate and
the cap-crossing regression.
claude[bot]
claude Bot previously approved these changes Jun 15, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, focused fix. The estimate scales file size by the selected (non-Skip) row-group fraction using plan.len() (validated as total row groups in create_initial_plan) and row_group_indexes(), consistent with existing usage in opener.rs/row_group_filter.rs. Overflow is handled via the u128 intermediate, and edge cases (no plan, total == 0, all-skip) are sound. Test meaningfully covers the cap-crossing regression.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focused, well-tested fix. The access-plan-aware estimate is correct (safe fallback to full size, zero-guard, u128 overflow guard, conservative row-group-count ratio) and the unit test covers the targeted estimate, full-scan fallback, and cap-crossing regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant