perf(parquet): Improve cache locality in BYTE_STREAM_SPLIT decoding#10007
perf(parquet): Improve cache locality in BYTE_STREAM_SPLIT decoding#10007pchintar wants to merge 1 commit into
Conversation
|
run benchmark parquet_round_trip env:
BENCH_FILTER: byte_stream_split |
|
run benchmark arrow_reader env:
BENCH_FILTER: byte_stream_split |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-byte-stream-split-cache-locality (3aac60b) to 2f923f7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-byte-stream-split-cache-locality (3aac60b) to 2f923f7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
BYTE_STREAM_SPLITdecoding currently reconstructs values using a nested scalar loop with strided reads across byte streams in:Current logic:
This results in poor cache locality and inefficient memory access patterns during reconstruction of the original value layout.
What changes are included in this PR?
This PR changes the reconstruction logic in
join_streams_constandjoin_streams_variableto process values in contiguous blocks instead of value-by-value scalar iteration.The updated implementation reads contiguous regions from each byte stream before writing reconstructed values back into the destination buffer.
Example:
Are these changes tested?
Existing parquet tests pass:
Benchmarks from
parquet/benches/encoding.rsshow considerable improvements forBYTE_STREAM_SPLITdecoding:Are there any user-facing changes?
No.