-
Notifications
You must be signed in to change notification settings - Fork 270
perf: [WIP] optimize CometColumnarToRow for complex types #3213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
andygrove
wants to merge
6
commits into
apache:main
Choose a base branch
from
andygrove:perf/optimize-complex-type-codegen
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
perf: [WIP] optimize CometColumnarToRow for complex types #3213
andygrove
wants to merge
6
commits into
apache:main
from
andygrove:perf/optimize-complex-type-codegen
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a work-in-progress optimization for complex types (arrays and maps) in CometColumnarToRowExec. The changes include: Phase 1: Cache offset buffer addresses in Comet vectors - CometListVector and CometMapVector now cache the offset buffer memory address at construction time - getArray()/getMap() use Platform.getInt() for direct memory access instead of getOffsetBuffer().getInt() - This eliminates virtual method calls per-row Phase 2: Custom code generation for complex types - CometColumnarToRowExec now generates optimized code for ArrayType and MapType columns - Per-batch caching of offset buffer addresses and child vectors - Uses Platform.getInt() directly in generated code for offset access TODO: - Phase 3: Reusable wrapper objects (MutableColumnarArray/Map) - Benchmarking to measure performance improvements - Additional test coverage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 3 of CometColumnarToRow optimization: - Add CometColumnarArray: mutable ArrayData impl that allows updating offset/length without creating new objects - Add CometColumnarMap: mutable MapData impl with same reusability - Update code generation to create wrapper once per batch and reuse across all rows via update() calls This eliminates object allocation per-row for array and map columns, reducing GC pressure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 of CometColumnarToRow optimization: - Add CometColumnarRow: mutable InternalRow impl that allows updating the rowId without creating new objects - Update code generation to detect StructType columns and use CometColumnarRow wrappers created once per batch This eliminates object allocation per-row for struct columns, reducing GC pressure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3213 +/- ##
============================================
+ Coverage 56.12% 60.25% +4.12%
- Complexity 976 1448 +472
============================================
Files 119 171 +52
Lines 11743 15890 +4147
Branches 2251 2603 +352
============================================
+ Hits 6591 9574 +2983
- Misses 4012 4998 +986
- Partials 1140 1318 +178 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add getVariant() method to CometColumnarArray and CometColumnarRow without @OverRide annotation to maintain compatibility with both Spark 3.x (which doesn't have this method) and Spark 4.x (which requires it). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… dirs Move these classes to spark-3.x and spark-4.0 directories to handle the getVariant() method difference: - Spark 3.x: SpecializedGetters doesn't have getVariant() - Spark 4.x: SpecializedGetters.getVariant() returns VariantVal This fixes the CI build failures for Spark 4.x. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This is a work-in-progress PR - not ready for review yet.
This PR optimizes
CometColumnarToRowExecfor complex types (arrays, maps, and structs) through a series of optimizations:Phase 1: Cache offset buffer addresses in Comet vectors
CometListVectorandCometMapVectornow cache the offset buffer memory address at construction timegetArray()/getMap()usePlatform.getInt()for direct memory access instead ofgetOffsetBuffer().getInt()getArray()/getMap()call)Phase 2: Custom code generation for complex types
CometColumnarToRowExecnow generates optimized code forArrayTypeandMapTypecolumnsPlatform.getInt()directly in generated code for offset accessPhase 3: Reusable wrapper objects for arrays and maps
CometColumnarArray: mutableArrayDataimplementation that allows updating offset/length without object allocationCometColumnarMap: mutableMapDataimplementation with same reusabilityupdate()calls per rowPhase 4: Reusable row wrapper for structs
CometColumnarRow: mutableInternalRowimplementation that allows updating the rowIdStructTypecolumns and uses reusable row wrappersBefore (per row):
After (per row):
Test plan
CometArrayExpressionSuitepasses (32 tests)CometMapExpressionSuitepasses (2 tests)CometHashExpressionSuitepasses (37 tests, includes struct hashing)CometExecSuitepasses (89 tests)TODO
🤖 Generated with Claude Code