Skip to content

fix: Correct the number of pruned/matched Parquet pages#22031

Open
nuno-faria wants to merge 1 commit intoapache:mainfrom
nuno-faria:fix_parquet_matched_pages
Open

fix: Correct the number of pruned/matched Parquet pages#22031
nuno-faria wants to merge 1 commit intoapache:mainfrom
nuno-faria:fix_parquet_matched_pages

Conversation

@nuno-faria
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Show the correct metrics in the execution plan.

What changes are included in this PR?

  • Keep track of the pages matched by each pruning predicate to avoid double counting them.
  • Added unit test and updated existing ones.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels May 5, 2026
Plan with Metrics
01)SortExec: TopK(fetch=3), expr=[species@0 ASC NULLS LAST], preserve_partitioning=[false], filter=[species@0 < Nlpine Sheep], metrics=[output_rows=3]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=6 total → 6 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=3 total → 3 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]
Copy link
Copy Markdown
Contributor Author

@nuno-faria nuno-faria May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file used contains 4 row groups (1 data page each) and the query matched only 3. So the previous 6 was due to double counting by having two predicates.

Also the same reasoning applies to the remaining changes.

Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

For performance considerations: it does not appear to have any performance issues for now. However, if we want to reduce the overhead of metric updates in the future, we could change this metric to a ratio (for rows passing page pruning), which can be cheaply derived from overall_selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EXPLAIN ANALYZE returns a wrong number of pruned and matched Parquet pages

2 participants