row-spine: inline DatumContainer::index to fix CLU-116 regression#37036
Conversation
The dictionary read item `DatumSeq { ColumnsIter }` is 32 bytes, exceeding
the System V two-register return threshold. Emitted out-of-line, `index`
returns via a hidden `sret` pointer and spills registers around the
offset-decode jump table — ~250ms of a ~10% `ParallelDataflows` wallclock
regression, present even with dictionary compression disabled.
Marking `index` `inline(always)` lets the caller build the read item in
registers (SROA) and drop the unused codec/column fields on the no-codec
path (DCE), restoring the pre-dictionary cost.
Confirmed by A/B + samply on optimized builds: the regression (+2.2% at
n=100k x25, +1.7% at n=2M x10) drops to noise (+0.3%), and
`DatumContainer::index` disappears as a profile symbol (inlined), leaving
only the inner `BytesContainer::index` at the pre-dictionary self-cost.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
def-
left a comment
There was a problem hiding this comment.
Regression reproduced: +2.2% (n=100k×25), +1.7% (n=2M×10).
Hm, why is it only 2% and not 10%?
Otherwise this looks harmless, so no complaints.
|
I can't tell. This mostly restored performance for me locally, after the patch it was within .3% of the pre-dictionary-compression PR. |
|
I'll try running it on my machine with this PR |
|
Also, I was running on x86, what does feature benchmark use? |
|
Same. |
|
Seems to indeed fix the regressino in ParallelDataflows. Thanks! |
frankmcsherry
left a comment
There was a problem hiding this comment.
Seems very mergeable. I'm trying to get my head around where the regression comes from, as we switch the control flow around to not return iterators, and to use internal iteration instead. I can understand "yes, but until it is deleted the codegen matters" and happy to go ahead with that premise (and I'll work to rip iterators out entirely).
Motivation
#32095 (dictionary-compressed arrangements, alpha, default off) introduced a ~10% wallclock regression in the
ParallelDataflowsfeature benchmark, present even with the feature disabled (CLU-116).Description
The dictionary read item
DatumSeq { ColumnsIter }is 32 bytes (codec pointer + column + slice), exceeding the System V two-register return threshold. Emitted out-of-line,DatumContainer::indexreturns via a hiddensretpointer and spills registers around the offset-decode jump table — ~250ms of the regression, on the read path of every arrangement regardless of the flag. Before the dictionary work the read item was a 16-byte&[u8]returned in registers.Marking
index#[inline(always)]removes the call boundary: the caller builds the read item in registers (SROA) and drops the unused codec/column fields on the no-codec path (DCE), restoring the pre-dictionary cost. This is not an inlining-loss or branch-mispredict issue — the codec check is a branchlesscmovnzand the reducecomputeclosure self-time is unchanged; it is purely the return-value ABI.Verification
A/B of optimized builds at the PR commit vs its parent, cold (fresh
environmentd --resetper trial), profiled with samplyDatumContainer::indexdisappears as a profile symbol (inlined), leaving only the innerBytesContainer::indexat its pre-dictionary self-cost (1374 vs 1357 samples).