Fix `elapsed_compute` metric for Parquet DataSourceExec by ernestprovo23 · Pull Request #20767 · apache/datafusion

ernestprovo23 · 2026-03-07T01:55:30Z

Which issue does this PR close?

Closes part of #18195 — specifically the elapsed_compute baseline metric sub-item for Parquet scans.

Rationale

EXPLAIN ANALYZE on Parquet scans reports elapsed_compute values like 14ns for full table scans, which is misleading. The metric was never being populated because no timer wrapped the per-batch compute work in the Parquet scan path.

What changes are included in this PR?

Follows the same pattern established in PR #18901 (CSV fix):

Added BaselineMetrics instantiation in ParquetOpener::open() using the existing metrics and partition_index fields
Wrapped the per-batch stream .map() closure with an elapsed_compute timer that measures projection, schema replacement, and metrics copy work

Single file changed: datafusion/datasource-parquet/src/opener.rs (+7, -3 lines)

Are these changes tested?

All 81 existing tests in datafusion-datasource-parquet pass
The metric correctness is verified by observing realistic elapsed_compute values in EXPLAIN ANALYZE output (no longer showing nanosecond-level values for real scans)
Per maintainer guidance from @2010YOUY01: "Testing if we have the time measured correct is tricky, I don't think there is a good way to do it. But for a large parquet file scan, several nanoseconds is definitely not reasonable."

Are there any user-facing changes?

EXPLAIN ANALYZE output for Parquet scans will now show accurate elapsed_compute values reflecting actual CPU time spent on per-batch processing.

The `elapsed_compute` baseline metric for Parquet scans previously reported unrealistically low values (e.g. 14ns for a full table scan) because no timer was wrapping the per-batch compute work. This follows the same pattern used in PR apache#18901 for CSV: instantiate `BaselineMetrics` in `ParquetOpener::open()` and wrap the stream's per-batch processing (projection, schema replacement, metrics copy) with an `elapsed_compute` timer. Closes part of apache#18195.

ernestprovo23 mentioned this pull request Mar 7, 2026

Improve metrics in DataSourceExec with Parquet source #18195

Open

4 tasks

github-actions bot added the datasource Changes to the datasource crate label Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `elapsed_compute` metric for Parquet DataSourceExec#20767

Fix `elapsed_compute` metric for Parquet DataSourceExec#20767
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-parquet-elapsed-compute

ernestprovo23 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ernestprovo23 commented Mar 7, 2026

Which issue does this PR close?

Rationale

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant