Skip to content

GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533

Closed
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:perf-intlist-size
Closed

GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:perf-intlist-size

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 1, 2026

Summary

  • Replace IntList.size() slab-iteration with a simple totalSize counter incremented on each add()
  • Eliminates O(slabs) overhead from dictionary encoding hot paths where size() is called frequently

Details

IntList.size() was iterating over all slabs to sum their lengths on every call. For delta encoding writers that check size thresholds frequently, this becomes O(slabs) per check. With a running counter it's O(1).

The improvement is primarily relevant for large row groups where the number of slabs grows.

All 576 parquet-column tests pass.

…ning counter

IntList.size() was iterating over all slabs to sum their lengths on every
call. Replace with a simple totalSize counter incremented on each add().
This eliminates O(slabs) overhead from dictionary encoding hot paths where
size() is called frequently.

private void allocateSlab() {
currentSlab = new int[currentSlabSize];
currentSlabPos = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set totalSize to 0 here?


currentSlab[currentSlabPos] = i;
++currentSlabPos;
++totalSize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, totalSize will always be the same as currentSlabSize

Comment on lines -154 to -156
for (int[] slab : slabs) {
size += slab.length;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is correct. We're returning the slab length, and this one can grow as well (double until MAX_SLAB_SIZE)

@iemejia iemejia marked this pull request as draft May 15, 2026 09:35
@iemejia
Copy link
Copy Markdown
Member Author

iemejia commented May 17, 2026

Closing in favor of #3566.

I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward.

Apologies for the churn. If you've been reviewing this PR, please continue the discussion on #3566 which supersedes it. Thank you.

@iemejia iemejia closed this May 17, 2026
@iemejia iemejia deleted the perf-intlist-size branch May 17, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants