fix: gate scan-size on access-plan estimate, not full file size#6
fix: gate scan-size on access-plan estimate, not full file size#6shefeek-jinnah wants to merge 2 commits into
Conversation
The max_scan_bytes gate in try_optimize_parquet_source summed each file's full object_meta.size, so point-fetches that read only a few row groups (via a pre-attached ParquetAccessPlan) were judged by the whole file's size and passed through as uncached vanilla DataFusion reads. Add estimated_scan_bytes(&PartitionedFile): when a precise ParquetAccessPlan is attached to file.extensions, scale the file size by the selected (non-Skip) row-group fraction; otherwise fall back to the full object_meta.size so normal/full scans are unaffected. Use a u128 intermediate to avoid overflow on large files. Wire it into the gate's size computation. Add a focused unit test covering the targeted vs. full-scan estimate and the cap-crossing regression.
There was a problem hiding this comment.
Correct, focused fix. The estimate scales file size by the selected (non-Skip) row-group fraction using plan.len() (validated as total row groups in create_initial_plan) and row_group_indexes(), consistent with existing usage in opener.rs/row_group_filter.rs. Overflow is handled via the u128 intermediate, and edge cases (no plan, total == 0, all-skip) are sound. Test meaningfully covers the cap-crossing regression.
There was a problem hiding this comment.
Focused, well-tested fix. The access-plan-aware estimate is correct (safe fallback to full size, zero-guard, u128 overflow guard, conservative row-group-count ratio) and the unit test covers the targeted estimate, full-scan fallback, and cap-crossing regression.
The max_scan_bytes gate in try_optimize_parquet_source summed each file's full object_meta.size, so point-fetches that read only a few row groups (via a pre-attached ParquetAccessPlan) were judged by the whole file's size and passed through as uncached vanilla DataFusion reads.
Add estimated_scan_bytes(&PartitionedFile): when a precise ParquetAccessPlan is attached to file.extensions, scale the file size by the selected (non-Skip) row-group fraction; otherwise fall back to the full object_meta.size so normal/full scans are unaffected. Use a u128 intermediate to avoid overflow on large files. Wire it into the gate's size computation.
Add a focused unit test covering the targeted vs. full-scan estimate and the cap-crossing regression.