chore: Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec#3729
Open
andygrove wants to merge 3 commits intoapache:mainfrom
Open
chore: Extract FilePartitionPlanner to decouple CometNativeScanExec from CometScanExec#3729andygrove wants to merge 3 commits intoapache:mainfrom
andygrove wants to merge 3 commits intoapache:mainfrom
Conversation
…raits Create ShimFilePartitionPlanner traits for Spark 3.4, 3.5, and 4.0 that provide version-specific file operations (isNeededForSchema, getPartitionedFile, splitFiles). Create FilePartitionPlanner utility class that encapsulates partition listing, dynamic pruning, bucketed/non-bucketed splitting, and driver metric accumulation for file-based scans.
…nPlanner Remove duplicated splitFiles/getPartitionedFile/isNeededForSchema methods from all three ShimCometScanExec variants (now in ShimFilePartitionPlanner). CometScanExec delegates selectedPartitions and getFilePartitions() to a new @transient lazy val planner field, removing internal partition computation methods (dynamicallySelectedPartitions, createFilePartitionsForBucketedScan, createFilePartitionsForNonBucketedScan, createBucketedReadRDD, createReadRDD, setFilesNumAndSizeMetric, driverMetrics, sendDriverMetrics). CometNativeScanExec takes FilePartitionPlanner instead of CometScanExec, breaking the direct dependency. CometNativeScan serde passes op.planner instead of op when creating CometNativeScanExec.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
This PR is a refactor to remove the dependency between
CometNativeScanExecandCometScanExec. This is a step towards removing the current behavior whereCometScanRulecreates aCometScanExecfornative_datafusionwhich then gets replaced withCometNativeScanExecinCometExecRule.CometNativeScanExeccurrently holds a runtime reference toCometScanExec(@transient scan: CometScanExec) solely to access file partition computation and driver metrics. This creates an unnecessary coupling between the two plan nodes. The file partition logic (partition listing, dynamic pruning, bucketed/non-bucketed splitting, driver metrics) is general-purpose and can be shared without one node depending on the other.What changes are included in this PR?
Extracts file partition computation into a standalone
FilePartitionPlannerutility class:FilePartitionPlanner— owns partition listing, dynamic partition pruning, bucketed/non-bucketed file splitting, and driver metric accumulation. Used by bothCometScanExec(hybrid scans) andCometNativeScanExec(native scans).ShimFilePartitionPlannertraits (Spark 3.4/3.5/4.0) — version-specificsplitFiles,getPartitionedFile,isNeededForSchemamethods moved fromShimCometScanExec.CometScanExec— addsplannerfield, delegatesselectedPartitionsandgetFilePartitions()to it. Removes ~250 lines of inline partition logic.CometNativeScanExec— replacesscan: CometScanExecwithplanner: FilePartitionPlanner. UsesoriginalPlan.driverMetricsdirectly for metrics.CometNativeScanserde — passesop.plannertoCometNativeScanExecinstead of the scan itself.ShimCometScanExec(all 3 versions) — removes methods that moved toShimFilePartitionPlanner. RetainsnewFileScanRDD,getPushedDownFilters, and version-specific helpers.Net: +500 lines added, -404 removed (new utility class accounts for most additions).
How are these changes tested?
ParquetReadV1SuiteandCometExecSuitetests pass, exercising both hybrid and native scan paths.ParquetReadFromFakeHadoopFsSuiteto use the newplannerfield.