Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive by lyang24 · Pull Request #9391 · apache/arrow-rs

lyang24 · 2026-02-11T05:52:07Z

Which issue does this PR close?

On clickbench query profiles Noticed the buffers allocated on producer thread, dropped on consumer thread. Mimalloc's deferred-free mechanism caused up to 9.26% CPU in _mi_heap_delayed_free_partial.

Rationale for this change

The previous implementation treated take as a purely generic gather, leaving buffer compaction to the downstream coalescing step. While correct, this meant that in sparse cases we would:

Read and copy string data twice (once during take, again during coalescing), and
Keep large backing buffers alive across thread boundaries via Arc, even when only a small subset of the data was referenced.

This PR optimizes the sparse case by recognizing that compaction is effectively inevitable downstream. When take selects a small fraction of rows, we fuse gather and compaction into a single pass. This:

Avoids the second read/copy of string data
Prevents retaining large unused buffers
Reduces cross-thread memory retention

The dense case remains unchanged, so the existing fast path is preserved. The heuristic is intentionally conservative and only triggers when sparsity suggests compaction will be beneficial.

What changes are included in this PR?

The optimization fuses the take and compaction into a single pass via take_byte_view_compact. It kicks in automatically when the take is sparse (indices.len() < array.len() / 2):

Are these changes tested?

arrow select test passed

Are there any user-facing changes?

no

Dandandan · 2026-02-11T07:55:05Z

run benchmark take_kernels

alamb-ghbot · 2026-02-11T09:02:43Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing take_byte_view_compact (c4d8432) to 7dbe58a diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=take_byte_view_compact
Results will be posted here when complete

alamb-ghbot · 2026-02-11T09:13:41Z

🤖: Benchmark completed

Details

group                                                                     main                                   take_byte_view_compact
-----                                                                     ----                                   ----------------------
take bool 1024                                                            1.00  1338.6±10.73ns        ? ?/sec    1.00   1332.0±2.99ns        ? ?/sec
take bool 512                                                             1.01    734.4±2.15ns        ? ?/sec    1.00    729.3±1.27ns        ? ?/sec
take bool null indices 1024                                               1.20  1256.1±45.38ns        ? ?/sec    1.00   1043.1±5.96ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
take bool null values null indices 1024                                   1.44      3.0±0.03µs        ? ?/sec    1.00      2.1±0.07µs        ? ?/sec
take check bounds i32 1024                                                1.00   849.9±19.98ns        ? ?/sec    1.00    848.9±3.09ns        ? ?/sec
take check bounds i32 512                                                 1.01    592.5±3.26ns        ? ?/sec    1.00   589.1±12.71ns        ? ?/sec
take i32 1024                                                             1.00    718.7±1.15ns        ? ?/sec    1.00   715.7±10.72ns        ? ?/sec
take i32 512                                                              1.01    444.9±2.09ns        ? ?/sec    1.00    442.5±1.05ns        ? ?/sec
take i32 null indices 1024                                                1.00    997.1±2.51ns        ? ?/sec    1.01  1009.0±86.37ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.01µs        ? ?/sec    1.01      2.1±0.02µs        ? ?/sec
take i32 null values null indices 1024                                    1.02      2.2±0.08µs        ? ?/sec    1.00      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00      3.5±0.04µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      4.8±0.06µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.02     20.7±0.07µs        ? ?/sec    1.00     20.3±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.1±0.04µs        ? ?/sec    1.00     11.2±0.05µs        ? ?/sec
take str 512                                                              1.01      5.4±0.03µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
take str null indices 1024                                                1.00      7.8±0.03µs        ? ?/sec    1.01      7.8±0.03µs        ? ?/sec
take str null indices 512                                                 1.00      3.7±0.02µs        ? ?/sec    1.01      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.00      8.8±0.08µs        ? ?/sec    1.00      8.8±0.05µs        ? ?/sec
take str null values null indices 1024                                    1.02      7.0±0.11µs        ? ?/sec    1.00      6.8±0.16µs        ? ?/sec
take stringview 1024                                                      1.10   898.5±10.53ns        ? ?/sec    1.00    819.5±7.92ns        ? ?/sec
take stringview 512                                                       1.18    599.4±8.94ns        ? ?/sec    1.00    509.6±4.78ns        ? ?/sec
take stringview null indices 1024                                         1.00  1431.0±26.67ns        ? ?/sec    1.00  1431.8±15.38ns        ? ?/sec
take stringview null indices 512                                          1.00    738.8±2.44ns        ? ?/sec    1.00    738.6±2.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.02µs        ? ?/sec    1.04      2.2±0.01µs        ? ?/sec
take stringview null values null indices 1024                             1.07      2.5±0.03µs        ? ?/sec    1.00      2.3±0.02µs        ? ?/sec

arrow-select/src/take.rs

Dandandan · 2026-02-11T19:33:39Z

arrow-select/src/take.rs

+    // This avoids reallocations during the copy phase. We only read the u128
+    // view descriptors here, not the actual string data.
+    let mut total_bytes: usize = 0;
+    for (i, idx) in indices.values().iter().enumerate() {


I think it makes sense to add some special cases, e.g. :

no input buffers (i.e. source was all inlined)

no output buffer (target is all inlined)

a single output buffer (offset always zero)

good catch added special cases especially all inlined case could be a fast path

alamb · 2026-02-11T19:34:11Z

run benchmark take_kernels

alamb-ghbot · 2026-02-11T21:50:04Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing take_byte_view_compact (c4d8432) to 7dbe58a diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=take_byte_view_compact
Results will be posted here when complete

alamb-ghbot · 2026-02-11T22:00:33Z

🤖: Benchmark completed

Details

group                                                                     main                                   take_byte_view_compact
-----                                                                     ----                                   ----------------------
take bool 1024                                                            1.00   1333.8±1.21ns        ? ?/sec    1.00   1331.1±1.78ns        ? ?/sec
take bool 512                                                             1.01    734.8±1.01ns        ? ?/sec    1.00    730.8±1.32ns        ? ?/sec
take bool null indices 1024                                               1.00  1292.5±28.41ns        ? ?/sec    1.61      2.1±0.03µs        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.01µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.9±0.11µs        ? ?/sec    1.48      4.3±0.09µs        ? ?/sec
take check bounds i32 1024                                                1.00    846.0±2.16ns        ? ?/sec    1.00    846.6±2.51ns        ? ?/sec
take check bounds i32 512                                                 1.14    588.9±7.75ns        ? ?/sec    1.00    517.2±6.61ns        ? ?/sec
take i32 1024                                                             1.01   721.6±20.07ns        ? ?/sec    1.00    716.5±1.13ns        ? ?/sec
take i32 512                                                              1.00    444.7±2.67ns        ? ?/sec    1.00    445.4±1.94ns        ? ?/sec
take i32 null indices 1024                                                1.44  1448.4±12.46ns        ? ?/sec    1.00  1007.1±79.72ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.02µs        ? ?/sec    1.01      2.1±0.02µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.2±0.02µs        ? ?/sec    1.38      3.0±0.04µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.01      3.5±0.07µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      4.8±0.12µs        ? ?/sec    1.00      4.8±0.09µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.8±0.10µs        ? ?/sec    1.00     20.8±0.23µs        ? ?/sec
take str 1024                                                             1.00     11.1±0.04µs        ? ?/sec    1.00     11.1±0.06µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.00      5.4±0.11µs        ? ?/sec
take str null indices 1024                                                1.01      7.9±0.05µs        ? ?/sec    1.00      7.8±0.09µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.01      8.8±0.26µs        ? ?/sec    1.00      8.8±0.14µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.09µs        ? ?/sec    1.09      7.6±0.05µs        ? ?/sec
take stringview 1024                                                      1.10    894.5±9.39ns        ? ?/sec    1.00    810.3±1.60ns        ? ?/sec
take stringview 512                                                       1.16    591.9±6.99ns        ? ?/sec    1.00    510.2±2.35ns        ? ?/sec
take stringview null indices 1024                                         1.00  1454.3±33.02ns        ? ?/sec    1.00  1448.6±54.65ns        ? ?/sec
take stringview null indices 512                                          1.01    739.6±2.48ns        ? ?/sec    1.00    731.9±4.32ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.07µs        ? ?/sec    1.01      2.1±0.11µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.5±0.05µs        ? ?/sec    1.25      3.1±0.03µs        ? ?/sec

…ginal buffers alive when selectivity is high Signed-off-by: lyang24 <lanqingy93@gmail.com>

github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2026

lyang24 changed the title ~~Parquet Compact StringView buffer during sparse take to avoid holding the original buffers alive when selectivity is high~~ Parquet Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive Feb 11, 2026

lyang24 force-pushed the take_byte_view_compact branch from 0a86a90 to c4d8432 Compare February 11, 2026 06:11

lyang24 changed the title ~~Parquet Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive~~ Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive Feb 11, 2026

lyang24 marked this pull request as ready for review February 11, 2026 17:47

Dandandan reviewed Feb 11, 2026

View reviewed changes

arrow-select/src/take.rs Show resolved Hide resolved

Dandandan reviewed Feb 11, 2026

View reviewed changes

arrow-select/src/take.rs Show resolved Hide resolved

Dandandan reviewed Feb 11, 2026

View reviewed changes

lyang24 force-pushed the take_byte_view_compact branch from c4d8432 to 21830ad Compare February 12, 2026 05:22

Compact StringView buffer during sparse take to avoid holding the ori…

b8be571

…ginal buffers alive when selectivity is high Signed-off-by: lyang24 <lanqingy93@gmail.com>

lyang24 force-pushed the take_byte_view_compact branch from 21830ad to b8be571 Compare February 12, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391

Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:take_byte_view_compact

lyang24 commented Feb 11, 2026 •

edited

Loading

Uh oh!

Dandandan commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Dandandan Feb 11, 2026

Uh oh!

lyang24 Feb 12, 2026

Uh oh!

alamb commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lyang24 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Uh oh!

Uh oh!

Dandandan Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

lyang24 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

alamb-ghbot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lyang24 commented Feb 11, 2026 •

edited

Loading