GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533
GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533iemejia wants to merge 1 commit into
Conversation
…ning counter IntList.size() was iterating over all slabs to sum their lengths on every call. Replace with a simple totalSize counter incremented on each add(). This eliminates O(slabs) overhead from dictionary encoding hot paths where size() is called frequently.
|
|
||
| private void allocateSlab() { | ||
| currentSlab = new int[currentSlabSize]; | ||
| currentSlabPos = 0; |
There was a problem hiding this comment.
Should we also set totalSize to 0 here?
|
|
||
| currentSlab[currentSlabPos] = i; | ||
| ++currentSlabPos; | ||
| ++totalSize; |
There was a problem hiding this comment.
Looking at the code, totalSize will always be the same as currentSlabSize
| for (int[] slab : slabs) { | ||
| size += slab.length; | ||
| } |
There was a problem hiding this comment.
I don't think this change is correct. We're returning the slab length, and this one can grow as well (double until MAX_SLAB_SIZE)
|
Closing in favor of #3566. I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward. Apologies for the churn. If you've been reviewing this PR, please continue the discussion on #3566 which supersedes it. Thank you. |
Summary
IntList.size()slab-iteration with a simpletotalSizecounter incremented on eachadd()size()is called frequentlyDetails
IntList.size()was iterating over all slabs to sum their lengths on every call. For delta encoding writers that check size thresholds frequently, this becomes O(slabs) per check. With a running counter it's O(1).The improvement is primarily relevant for large row groups where the number of slabs grows.
All 576 parquet-column tests pass.