Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 18, 2026

Summary

This is a work-in-progress PR - not ready for review yet.

This PR optimizes CometColumnarToRowExec for complex types (arrays, maps, and structs) through a series of optimizations:

Phase 1: Cache offset buffer addresses in Comet vectors

  • CometListVector and CometMapVector now cache the offset buffer memory address at construction time
  • getArray()/getMap() use Platform.getInt() for direct memory access instead of getOffsetBuffer().getInt()
  • This eliminates 2 virtual method calls per offset read (4 total per getArray()/getMap() call)

Phase 2: Custom code generation for complex types

  • CometColumnarToRowExec now generates optimized code for ArrayType and MapType columns
  • Per-batch caching of offset buffer addresses and child vectors in generated code
  • Uses Platform.getInt() directly in generated code for offset access

Phase 3: Reusable wrapper objects for arrays and maps

  • Added CometColumnarArray: mutable ArrayData implementation that allows updating offset/length without object allocation
  • Added CometColumnarMap: mutable MapData implementation with same reusability
  • Code generation creates wrappers once per batch and reuses via update() calls per row
  • Eliminates per-row object allocation for array and map columns

Phase 4: Reusable row wrapper for structs

  • Added CometColumnarRow: mutable InternalRow implementation that allows updating the rowId
  • Code generation detects StructType columns and uses reusable row wrappers
  • Eliminates per-row object allocation for struct columns

Before (per row):

// Array/Map: 2 virtual calls + new object allocation
listVector.getOffsetBuffer().getInt(i * OFFSET_WIDTH)
new ColumnarArray(dataVector, start, length)

// Struct: new object allocation
new ColumnarRow(structVector, rowId)

After (per row):

// Array/Map: JIT intrinsified memory access + reuse existing object
Platform.getInt(null, offsetBufferAddress + (long) i * OFFSET_WIDTH)
reusableArray.update(start, length)

// Struct: reuse existing object
reusableRow.update(rowId)

Test plan

  • CometArrayExpressionSuite passes (32 tests)
  • CometMapExpressionSuite passes (2 tests)
  • CometHashExpressionSuite passes (37 tests, includes struct hashing)
  • CometExecSuite passes (89 tests)
  • Full test suite
  • Performance benchmarks with production workloads

TODO

  • Benchmarking with actual Parquet I/O to measure improvements
  • Consider similar optimizations for nested complex types

🤖 Generated with Claude Code

andygrove and others added 4 commits January 16, 2026 12:48
This is a work-in-progress optimization for complex types (arrays and maps)
in CometColumnarToRowExec. The changes include:

Phase 1: Cache offset buffer addresses in Comet vectors
- CometListVector and CometMapVector now cache the offset buffer memory
  address at construction time
- getArray()/getMap() use Platform.getInt() for direct memory access
  instead of getOffsetBuffer().getInt()
- This eliminates virtual method calls per-row

Phase 2: Custom code generation for complex types
- CometColumnarToRowExec now generates optimized code for ArrayType and
  MapType columns
- Per-batch caching of offset buffer addresses and child vectors
- Uses Platform.getInt() directly in generated code for offset access

TODO:
- Phase 3: Reusable wrapper objects (MutableColumnarArray/Map)
- Benchmarking to measure performance improvements
- Additional test coverage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 3 of CometColumnarToRow optimization:

- Add CometColumnarArray: mutable ArrayData impl that allows updating
  offset/length without creating new objects
- Add CometColumnarMap: mutable MapData impl with same reusability
- Update code generation to create wrapper once per batch and reuse
  across all rows via update() calls

This eliminates object allocation per-row for array and map columns,
reducing GC pressure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 4 of CometColumnarToRow optimization:

- Add CometColumnarRow: mutable InternalRow impl that allows updating
  the rowId without creating new objects
- Update code generation to detect StructType columns and use
  CometColumnarRow wrappers created once per batch

This eliminates object allocation per-row for struct columns,
reducing GC pressure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 83.00000% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.25%. Comparing base (f09f8af) to head (a18e0e1).
⚠️ Report is 855 commits behind head on main.

Files with missing lines Patch % Lines
...java/org/apache/comet/vector/CometColumnarMap.java 0.00% 18 Missing ⚠️
.../java/org/apache/comet/vector/CometListVector.java 0.00% 6 Missing ⚠️
...n/java/org/apache/comet/vector/CometMapVector.java 0.00% 6 Missing ⚠️
...pache/spark/sql/comet/CometColumnarToRowExec.scala 97.64% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3213      +/-   ##
============================================
+ Coverage     56.12%   60.25%   +4.12%     
- Complexity      976     1448     +472     
============================================
  Files           119      171      +52     
  Lines         11743    15890    +4147     
  Branches       2251     2603     +352     
============================================
+ Hits           6591     9574    +2983     
- Misses         4012     4998     +986     
- Partials       1140     1318     +178     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove and others added 2 commits January 17, 2026 20:30
Add getVariant() method to CometColumnarArray and CometColumnarRow
without @OverRide annotation to maintain compatibility with both
Spark 3.x (which doesn't have this method) and Spark 4.x (which
requires it).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… dirs

Move these classes to spark-3.x and spark-4.0 directories to handle
the getVariant() method difference:
- Spark 3.x: SpecializedGetters doesn't have getVariant()
- Spark 4.x: SpecializedGetters.getVariant() returns VariantVal

This fixes the CI build failures for Spark 4.x.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants