row-spine: inline DatumContainer::index to fix CLU-116 regression by antiguru · Pull Request #37036 · MaterializeInc/materialize

antiguru · 2026-06-15T09:28:21Z

Motivation

#32095 (dictionary-compressed arrangements, alpha, default off) introduced a ~10% wallclock regression in the ParallelDataflows feature benchmark, present even with the feature disabled (CLU-116).

Description

The dictionary read item DatumSeq { ColumnsIter } is 32 bytes (codec pointer + column + slice), exceeding the System V two-register return threshold. Emitted out-of-line, DatumContainer::index returns via a hidden sret pointer and spills registers around the offset-decode jump table — ~250ms of the regression, on the read path of every arrangement regardless of the flag. Before the dictionary work the read item was a 16-byte &[u8] returned in registers.

Marking index #[inline(always)] removes the call boundary: the caller builds the read item in registers (SROA) and drops the unused codec/column fields on the no-codec path (DCE), restoring the pre-dictionary cost. This is not an inlining-loss or branch-mispredict issue — the codec check is a branchless cmovnz and the reduce compute closure self-time is unchanged; it is purely the return-value ABI.

Verification

A/B of optimized builds at the PR commit vs its parent, cold (fresh environmentd --reset per trial), profiled with samply

Regression reproduced: +2.2% (n=100k×25), +1.7% (n=2M×10).
With the fix: +0.3% (within run-to-run noise).
DatumContainer::index disappears as a profile symbol (inlined), leaving only the inner BytesContainer::index at its pre-dictionary self-cost (1374 vs 1357 samples).
clusterd memory unaffected (≤0.7% delta at both scales); the CI memory delta was sub-threshold noise.

The dictionary read item `DatumSeq { ColumnsIter }` is 32 bytes, exceeding the System V two-register return threshold. Emitted out-of-line, `index` returns via a hidden `sret` pointer and spills registers around the offset-decode jump table — ~250ms of a ~10% `ParallelDataflows` wallclock regression, present even with dictionary compression disabled. Marking `index` `inline(always)` lets the caller build the read item in registers (SROA) and drop the unused codec/column fields on the no-codec path (DCE), restoring the pre-dictionary cost. Confirmed by A/B + samply on optimized builds: the regression (+2.2% at n=100k x25, +1.7% at n=2M x10) drops to noise (+0.3%), and `DatumContainer::index` disappears as a profile symbol (inlined), leaving only the inner `BytesContainer::index` at the pre-dictionary self-cost. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

def-

Regression reproduced: +2.2% (n=100k×25), +1.7% (n=2M×10).

Hm, why is it only 2% and not 10%?

Otherwise this looks harmless, so no complaints.

antiguru · 2026-06-15T11:31:45Z

I can't tell. This mostly restored performance for me locally, after the patch it was within .3% of the pre-dictionary-compression PR.

def- · 2026-06-15T11:39:08Z

I'll try running it on my machine with this PR

antiguru · 2026-06-15T11:39:53Z

Also, I was running on x86, what does feature benchmark use?

def- · 2026-06-15T11:42:05Z

Same.

def- · 2026-06-15T13:25:00Z

Seems to indeed fix the regressino in ParallelDataflows. Thanks!

frankmcsherry

Seems very mergeable. I'm trying to get my head around where the regression comes from, as we switch the control flow around to not return iterators, and to use internal iteration instead. I can understand "yes, but until it is deleted the codegen matters" and happy to go ahead with that premise (and I'll work to rip iterators out entirely).

antiguru requested review from def- and frankmcsherry June 15, 2026 09:28

def- approved these changes Jun 15, 2026

View reviewed changes

frankmcsherry approved these changes Jun 15, 2026

View reviewed changes

antiguru marked this pull request as draft June 16, 2026 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row-spine: inline DatumContainer::index to fix CLU-116 regression#37036

row-spine: inline DatumContainer::index to fix CLU-116 regression#37036
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:moritz/clu-116-dictionary-compressed-arrangements-causes-10-time-regression

antiguru commented Jun 15, 2026 •

edited

Loading

Uh oh!

def- left a comment

Uh oh!

antiguru commented Jun 15, 2026 •

edited

Loading

Uh oh!

def- commented Jun 15, 2026

Uh oh!

antiguru commented Jun 15, 2026

Uh oh!

def- commented Jun 15, 2026

Uh oh!

def- commented Jun 15, 2026

Uh oh!

frankmcsherry left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antiguru commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Verification

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

def- commented Jun 15, 2026

Uh oh!

antiguru commented Jun 15, 2026

Uh oh!

def- commented Jun 15, 2026

Uh oh!

def- commented Jun 15, 2026

Uh oh!

frankmcsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

antiguru commented Jun 15, 2026 •

edited

Loading

antiguru commented Jun 15, 2026 •

edited

Loading