Skip to content

chore: Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec#3729

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:refactor/file-partition-planner
Open

chore: Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec#3729
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:refactor/file-partition-planner

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 18, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

This PR is a refactor to remove the dependency between CometNativeScanExec and CometScanExec. This is a step towards removing the current behavior where CometScanRule creates a CometScanExec for native_datafusion which then gets replaced with CometNativeScanExec in CometExecRule.

CometNativeScanExec currently holds a runtime reference to CometScanExec (@transient scan: CometScanExec) solely to access file partition computation and driver metrics. This creates an unnecessary coupling between the two plan nodes. The file partition logic (partition listing, dynamic pruning, bucketed/non-bucketed splitting, driver metrics) is general-purpose and can be shared without one node depending on the other.

What changes are included in this PR?

Extracts file partition computation into a standalone FilePartitionPlanner utility class:

  • New FilePartitionPlanner — owns partition listing, dynamic partition pruning, bucketed/non-bucketed file splitting, and driver metric accumulation. Used by both CometScanExec (hybrid scans) and CometNativeScanExec (native scans).
  • New ShimFilePartitionPlanner traits (Spark 3.4/3.5/4.0) — version-specific splitFiles, getPartitionedFile, isNeededForSchema methods moved from ShimCometScanExec.
  • CometScanExec — adds planner field, delegates selectedPartitions and getFilePartitions() to it. Removes ~250 lines of inline partition logic.
  • CometNativeScanExec — replaces scan: CometScanExec with planner: FilePartitionPlanner. Uses originalPlan.driverMetrics directly for metrics.
  • CometNativeScan serde — passes op.planner to CometNativeScanExec instead of the scan itself.
  • ShimCometScanExec (all 3 versions) — removes methods that moved to ShimFilePartitionPlanner. Retains newFileScanRDD, getPushedDownFilters, and version-specific helpers.

Net: +500 lines added, -404 removed (new utility class accounts for most additions).

How are these changes tested?

  • Existing ParquetReadV1Suite and CometExecSuite tests pass, exercising both hybrid and native scan paths.
  • Updated ParquetReadFromFakeHadoopFsSuite to use the new planner field.
  • This is a pure refactoring with no behavioral changes — all existing tests provide coverage.

…raits

Create ShimFilePartitionPlanner traits for Spark 3.4, 3.5, and 4.0 that
provide version-specific file operations (isNeededForSchema,
getPartitionedFile, splitFiles). Create FilePartitionPlanner utility class
that encapsulates partition listing, dynamic pruning, bucketed/non-bucketed
splitting, and driver metric accumulation for file-based scans.
…nPlanner

Remove duplicated splitFiles/getPartitionedFile/isNeededForSchema methods
from all three ShimCometScanExec variants (now in ShimFilePartitionPlanner).

CometScanExec delegates selectedPartitions and getFilePartitions() to a new
@transient lazy val planner field, removing internal partition computation
methods (dynamicallySelectedPartitions, createFilePartitionsForBucketedScan,
createFilePartitionsForNonBucketedScan, createBucketedReadRDD, createReadRDD,
setFilesNumAndSizeMetric, driverMetrics, sendDriverMetrics).

CometNativeScanExec takes FilePartitionPlanner instead of CometScanExec,
breaking the direct dependency. CometNativeScan serde passes op.planner
instead of op when creating CometNativeScanExec.
@andygrove andygrove changed the title Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec chore: Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec Mar 18, 2026
@andygrove andygrove marked this pull request as ready for review March 18, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant