Skip to content

Fix elapsed_compute metric for Parquet DataSourceExec#20767

Open
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-parquet-elapsed-compute
Open

Fix elapsed_compute metric for Parquet DataSourceExec#20767
ernestprovo23 wants to merge 1 commit intoapache:mainfrom
ernestprovo23:fix-parquet-elapsed-compute

Conversation

@ernestprovo23
Copy link

Which issue does this PR close?

Closes part of #18195 — specifically the elapsed_compute baseline metric sub-item for Parquet scans.

Rationale

EXPLAIN ANALYZE on Parquet scans reports elapsed_compute values like 14ns for full table scans, which is misleading. The metric was never being populated because no timer wrapped the per-batch compute work in the Parquet scan path.

What changes are included in this PR?

Follows the same pattern established in PR #18901 (CSV fix):

  1. Added BaselineMetrics instantiation in ParquetOpener::open() using the existing metrics and partition_index fields
  2. Wrapped the per-batch stream .map() closure with an elapsed_compute timer that measures projection, schema replacement, and metrics copy work

Single file changed: datafusion/datasource-parquet/src/opener.rs (+7, -3 lines)

Are these changes tested?

  • All 81 existing tests in datafusion-datasource-parquet pass
  • The metric correctness is verified by observing realistic elapsed_compute values in EXPLAIN ANALYZE output (no longer showing nanosecond-level values for real scans)
  • Per maintainer guidance from @2010YOUY01: "Testing if we have the time measured correct is tricky, I don't think there is a good way to do it. But for a large parquet file scan, several nanoseconds is definitely not reasonable."

Are there any user-facing changes?

EXPLAIN ANALYZE output for Parquet scans will now show accurate elapsed_compute values reflecting actual CPU time spent on per-batch processing.

The `elapsed_compute` baseline metric for Parquet scans previously
reported unrealistically low values (e.g. 14ns for a full table scan)
because no timer was wrapping the per-batch compute work.

This follows the same pattern used in PR apache#18901 for CSV: instantiate
`BaselineMetrics` in `ParquetOpener::open()` and wrap the stream's
per-batch processing (projection, schema replacement, metrics copy)
with an `elapsed_compute` timer.

Closes part of apache#18195.
@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant