Skip to content

Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391

Open
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:take_byte_view_compact
Open

Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive#9391
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:take_byte_view_compact

Conversation

@lyang24
Copy link
Contributor

@lyang24 lyang24 commented Feb 11, 2026

Which issue does this PR close?

On clickbench query profiles Noticed the buffers allocated on producer thread, dropped on consumer thread. Mimalloc's deferred-free mechanism caused up to 9.26% CPU in _mi_heap_delayed_free_partial.

Rationale for this change

The previous implementation treated take as a purely generic gather, leaving buffer compaction to the downstream coalescing step. While correct, this meant that in sparse cases we would:

  1. Read and copy string data twice (once during take, again during coalescing), and
  2. Keep large backing buffers alive across thread boundaries via Arc, even when only a small subset of the data was referenced.

This PR optimizes the sparse case by recognizing that compaction is effectively inevitable downstream. When take selects a small fraction of rows, we fuse gather and compaction into a single pass. This:

  1. Avoids the second read/copy of string data
  2. Prevents retaining large unused buffers
  3. Reduces cross-thread memory retention

The dense case remains unchanged, so the existing fast path is preserved. The heuristic is intentionally conservative and only triggers when sparsity suggests compaction will be beneficial.

What changes are included in this PR?

The optimization fuses the take and compaction into a single pass via take_byte_view_compact. It kicks in automatically when the take is sparse (indices.len() < array.len() / 2):

Are these changes tested?

arrow select test passed

Are there any user-facing changes?

no

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 11, 2026
@lyang24 lyang24 changed the title Parquet Compact StringView buffer during sparse take to avoid holding the original buffers alive when selectivity is high Parquet Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive Feb 11, 2026
@lyang24 lyang24 force-pushed the take_byte_view_compact branch from 0a86a90 to c4d8432 Compare February 11, 2026 06:11
@lyang24 lyang24 changed the title Parquet Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive Compact StringView buffer during sparse (<50%) take to avoid holding the original buffers alive Feb 11, 2026
@Dandandan
Copy link
Contributor

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing take_byte_view_compact (c4d8432) to 7dbe58a diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=take_byte_view_compact
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     main                                   take_byte_view_compact
-----                                                                     ----                                   ----------------------
take bool 1024                                                            1.00  1338.6±10.73ns        ? ?/sec    1.00   1332.0±2.99ns        ? ?/sec
take bool 512                                                             1.01    734.4±2.15ns        ? ?/sec    1.00    729.3±1.27ns        ? ?/sec
take bool null indices 1024                                               1.20  1256.1±45.38ns        ? ?/sec    1.00   1043.1±5.96ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
take bool null values null indices 1024                                   1.44      3.0±0.03µs        ? ?/sec    1.00      2.1±0.07µs        ? ?/sec
take check bounds i32 1024                                                1.00   849.9±19.98ns        ? ?/sec    1.00    848.9±3.09ns        ? ?/sec
take check bounds i32 512                                                 1.01    592.5±3.26ns        ? ?/sec    1.00   589.1±12.71ns        ? ?/sec
take i32 1024                                                             1.00    718.7±1.15ns        ? ?/sec    1.00   715.7±10.72ns        ? ?/sec
take i32 512                                                              1.01    444.9±2.09ns        ? ?/sec    1.00    442.5±1.05ns        ? ?/sec
take i32 null indices 1024                                                1.00    997.1±2.51ns        ? ?/sec    1.01  1009.0±86.37ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.01µs        ? ?/sec    1.01      2.1±0.02µs        ? ?/sec
take i32 null values null indices 1024                                    1.02      2.2±0.08µs        ? ?/sec    1.00      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00      3.5±0.04µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      4.8±0.06µs        ? ?/sec    1.00      4.8±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.02     20.7±0.07µs        ? ?/sec    1.00     20.3±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.1±0.04µs        ? ?/sec    1.00     11.2±0.05µs        ? ?/sec
take str 512                                                              1.01      5.4±0.03µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
take str null indices 1024                                                1.00      7.8±0.03µs        ? ?/sec    1.01      7.8±0.03µs        ? ?/sec
take str null indices 512                                                 1.00      3.7±0.02µs        ? ?/sec    1.01      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.00      8.8±0.08µs        ? ?/sec    1.00      8.8±0.05µs        ? ?/sec
take str null values null indices 1024                                    1.02      7.0±0.11µs        ? ?/sec    1.00      6.8±0.16µs        ? ?/sec
take stringview 1024                                                      1.10   898.5±10.53ns        ? ?/sec    1.00    819.5±7.92ns        ? ?/sec
take stringview 512                                                       1.18    599.4±8.94ns        ? ?/sec    1.00    509.6±4.78ns        ? ?/sec
take stringview null indices 1024                                         1.00  1431.0±26.67ns        ? ?/sec    1.00  1431.8±15.38ns        ? ?/sec
take stringview null indices 512                                          1.00    738.8±2.44ns        ? ?/sec    1.00    738.6±2.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.02µs        ? ?/sec    1.04      2.2±0.01µs        ? ?/sec
take stringview null values null indices 1024                             1.07      2.5±0.03µs        ? ?/sec    1.00      2.3±0.02µs        ? ?/sec

@lyang24 lyang24 marked this pull request as ready for review February 11, 2026 17:47
// This avoids reallocations during the copy phase. We only read the u128
// view descriptors here, not the actual string data.
let mut total_bytes: usize = 0;
for (i, idx) in indices.values().iter().enumerate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add some special cases, e.g. :

  • no input buffers (i.e. source was all inlined)
  • no output buffer (target is all inlined)
  • a single output buffer (offset always zero)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch added special cases especially all inlined case could be a fast path

@alamb
Copy link
Contributor

alamb commented Feb 11, 2026

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing take_byte_view_compact (c4d8432) to 7dbe58a diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=take_byte_view_compact
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     main                                   take_byte_view_compact
-----                                                                     ----                                   ----------------------
take bool 1024                                                            1.00   1333.8±1.21ns        ? ?/sec    1.00   1331.1±1.78ns        ? ?/sec
take bool 512                                                             1.01    734.8±1.01ns        ? ?/sec    1.00    730.8±1.32ns        ? ?/sec
take bool null indices 1024                                               1.00  1292.5±28.41ns        ? ?/sec    1.61      2.1±0.03µs        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.01µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.9±0.11µs        ? ?/sec    1.48      4.3±0.09µs        ? ?/sec
take check bounds i32 1024                                                1.00    846.0±2.16ns        ? ?/sec    1.00    846.6±2.51ns        ? ?/sec
take check bounds i32 512                                                 1.14    588.9±7.75ns        ? ?/sec    1.00    517.2±6.61ns        ? ?/sec
take i32 1024                                                             1.01   721.6±20.07ns        ? ?/sec    1.00    716.5±1.13ns        ? ?/sec
take i32 512                                                              1.00    444.7±2.67ns        ? ?/sec    1.00    445.4±1.94ns        ? ?/sec
take i32 null indices 1024                                                1.44  1448.4±12.46ns        ? ?/sec    1.00  1007.1±79.72ns        ? ?/sec
take i32 null values 1024                                                 1.00      2.0±0.02µs        ? ?/sec    1.01      2.1±0.02µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.2±0.02µs        ? ?/sec    1.38      3.0±0.04µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.01      3.5±0.07µs        ? ?/sec    1.00      3.4±0.01µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      4.8±0.12µs        ? ?/sec    1.00      4.8±0.09µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.8±0.10µs        ? ?/sec    1.00     20.8±0.23µs        ? ?/sec
take str 1024                                                             1.00     11.1±0.04µs        ? ?/sec    1.00     11.1±0.06µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.00      5.4±0.11µs        ? ?/sec
take str null indices 1024                                                1.01      7.9±0.05µs        ? ?/sec    1.00      7.8±0.09µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.02µs        ? ?/sec
take str null values 1024                                                 1.01      8.8±0.26µs        ? ?/sec    1.00      8.8±0.14µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.09µs        ? ?/sec    1.09      7.6±0.05µs        ? ?/sec
take stringview 1024                                                      1.10    894.5±9.39ns        ? ?/sec    1.00    810.3±1.60ns        ? ?/sec
take stringview 512                                                       1.16    591.9±6.99ns        ? ?/sec    1.00    510.2±2.35ns        ? ?/sec
take stringview null indices 1024                                         1.00  1454.3±33.02ns        ? ?/sec    1.00  1448.6±54.65ns        ? ?/sec
take stringview null indices 512                                          1.01    739.6±2.48ns        ? ?/sec    1.00    731.9±4.32ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.07µs        ? ?/sec    1.01      2.1±0.11µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.5±0.05µs        ? ?/sec    1.25      3.1±0.03µs        ? ?/sec

@lyang24 lyang24 force-pushed the take_byte_view_compact branch from c4d8432 to 21830ad Compare February 12, 2026 05:22
…ginal buffers alive when selectivity is high

Signed-off-by: lyang24 <lanqingy93@gmail.com>
@lyang24 lyang24 force-pushed the take_byte_view_compact branch from 21830ad to b8be571 Compare February 12, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants