Skip to content

row-spine: inline DatumContainer::index to fix CLU-116 regression#37036

Draft
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:moritz/clu-116-dictionary-compressed-arrangements-causes-10-time-regression
Draft

row-spine: inline DatumContainer::index to fix CLU-116 regression#37036
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:moritz/clu-116-dictionary-compressed-arrangements-causes-10-time-regression

Conversation

@antiguru

@antiguru antiguru commented Jun 15, 2026

Copy link
Copy Markdown
Member

Motivation

#32095 (dictionary-compressed arrangements, alpha, default off) introduced a ~10% wallclock regression in the ParallelDataflows feature benchmark, present even with the feature disabled (CLU-116).

Description

The dictionary read item DatumSeq { ColumnsIter } is 32 bytes (codec pointer + column + slice), exceeding the System V two-register return threshold. Emitted out-of-line, DatumContainer::index returns via a hidden sret pointer and spills registers around the offset-decode jump table — ~250ms of the regression, on the read path of every arrangement regardless of the flag. Before the dictionary work the read item was a 16-byte &[u8] returned in registers.

Marking index #[inline(always)] removes the call boundary: the caller builds the read item in registers (SROA) and drops the unused codec/column fields on the no-codec path (DCE), restoring the pre-dictionary cost. This is not an inlining-loss or branch-mispredict issue — the codec check is a branchless cmovnz and the reduce compute closure self-time is unchanged; it is purely the return-value ABI.

Verification

A/B of optimized builds at the PR commit vs its parent, cold (fresh environmentd --reset per trial), profiled with samply

  • Regression reproduced: +2.2% (n=100k×25), +1.7% (n=2M×10).
  • With the fix: +0.3% (within run-to-run noise).
  • DatumContainer::index disappears as a profile symbol (inlined), leaving only the inner BytesContainer::index at its pre-dictionary self-cost (1374 vs 1357 samples).
  • clusterd memory unaffected (≤0.7% delta at both scales); the CI memory delta was sub-threshold noise.

The dictionary read item `DatumSeq { ColumnsIter }` is 32 bytes, exceeding
the System V two-register return threshold. Emitted out-of-line, `index`
returns via a hidden `sret` pointer and spills registers around the
offset-decode jump table — ~250ms of a ~10% `ParallelDataflows` wallclock
regression, present even with dictionary compression disabled.

Marking `index` `inline(always)` lets the caller build the read item in
registers (SROA) and drop the unused codec/column fields on the no-codec
path (DCE), restoring the pre-dictionary cost.

Confirmed by A/B + samply on optimized builds: the regression (+2.2% at
n=100k x25, +1.7% at n=2M x10) drops to noise (+0.3%), and
`DatumContainer::index` disappears as a profile symbol (inlined), leaving
only the inner `BytesContainer::index` at the pre-dictionary self-cost.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@antiguru antiguru requested review from def- and frankmcsherry June 15, 2026 09:28

@def- def- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression reproduced: +2.2% (n=100k×25), +1.7% (n=2M×10).

Hm, why is it only 2% and not 10%?

Otherwise this looks harmless, so no complaints.

antiguru commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

I can't tell. This mostly restored performance for me locally, after the patch it was within .3% of the pre-dictionary-compression PR.

@def-

def- commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I'll try running it on my machine with this PR

Copy link
Copy Markdown
Member Author

Also, I was running on x86, what does feature benchmark use?

@def-

def- commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Same.

@def-

def- commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Seems to indeed fix the regressino in ParallelDataflows. Thanks!

@frankmcsherry frankmcsherry left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very mergeable. I'm trying to get my head around where the regression comes from, as we switch the control flow around to not return iterators, and to use internal iteration instead. I can understand "yes, but until it is deleted the codegen matters" and happy to go ahead with that premise (and I'll work to rip iterators out entirely).

@antiguru antiguru marked this pull request as draft June 16, 2026 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants