Skip to content

sketch out improved performance by refactoring codec pipeline logic#3719

Draft
d-v-b wants to merge 37 commits intozarr-developers:mainfrom
d-v-b:perf/smarter-codecs
Draft

sketch out improved performance by refactoring codec pipeline logic#3719
d-v-b wants to merge 37 commits intozarr-developers:mainfrom
d-v-b:perf/smarter-codecs

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 25, 2026

This builds on top of #3715 and achieves even more perf improvements by refactoring the basic logic of codec encoding / decoding. The design document behind these changes is here.

I do not think this is merge-worthy, as it's far too big. But i'm going to post the performance gains, and start figuring out how to break this into pieces.

A big feature this adds is the ability to write individual subchunks to uncompressed shards on storage backends that support range writes (local and memory).

Benchmark comparison: perf/smarter-codecs vs main

Test Name perf/smarter-codecs (ms) main (ms) Speedup
test_write_array[memory-chunks=100,shards=1M-None] 46.85 1018.60 21.74×
test_write_array[local-chunks=100,shards=1M-None] 68.45 1006.27 14.70×
test_sharded_morton_indexing[(32,32,32)] 22.45 247.49 11.02×
test_slice_indexing[None-(0,0,0)] 0.03 0.28 10.05×
test_sharded_morton_indexing_large[(33,33,33)] 254.85 2521.32 9.89×
test_slice_indexing[(50,50,50)-full_slice] 9.12 89.80 9.85×
test_sharded_morton_indexing_large[(32,32,32)] 226.80 2153.47 9.50×
test_sharded_morton_indexing_large[(30,30,30)] 181.86 1725.29 9.49×
test_slice_indexing[(50,50,50)-(0,0,0)] 0.06 0.60 9.45×
test_write_array[memory-chunks=100,shards=1M-gzip] 211.85 1978.29 9.34×
test_slice_indexing[None-(slice(None,10,None))*3] 0.03 0.28 9.28×
test_write_array[local-chunks=100,shards=1M-gzip] 217.00 1965.37 9.06×
test_slice_indexing[(50,50,50)-strided_4] 8.96 80.44 8.98×
test_slice_indexing[(50,50,50)-strided_4_offset] 4.83 43.21 8.95×
test_sharded_morton_indexing[(16,16,16)] 2.78 23.26 8.37×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3] 0.07 0.60 8.36×
test_slice_indexing[None-full_slice] 11.07 85.12 7.69×
test_read_array[memory-chunks=100,shards=1M-gzip] 181.44 1372.13 7.56×
test_read_array[memory-chunks=100,shards=1M-None] 80.73 609.53 7.55×
test_read_array[memory-chunks=1K,no_shards-None] 7.21 53.52 7.43×
test_read_array[local-chunks=100,shards=1M-None] 83.79 612.18 7.31×
test_slice_indexing[None-strided_4] 12.19 84.98 6.97×
test_read_array[memory-chunks=1K,no_shards-gzip] 16.79 115.69 6.89×
test_write_array[memory-chunks=1K,no_shards-gzip] 32.54 219.93 6.76×
test_read_array[local-chunks=100,shards=1M-gzip] 190.20 1277.12 6.71×
test_read_array[local-chunks=1K,no_shards-None] 24.23 142.33 5.87×
test_slice_indexing[None-mixed_slice] 0.12 0.71 5.84×
test_read_array[local-chunks=1K,no_shards-gzip] 37.83 216.45 5.72×
test_slice_indexing[None-strided_4_offset] 5.77 32.55 5.64×
test_write_array[memory-chunks=1K,no_shards-None] 19.16 106.12 5.54×
test_slice_indexing[(50,50,50)-mixed_slice] 0.23 1.21 5.29×
test_slice_indexing[(50,50,50)-strided_4-get_latency] 15.82 82.14 5.19×
test_write_array[memory-chunks=1K,shards=1K-gzip] 100.60 451.93 4.49×
test_read_array[local-chunks=1K,shards=1K-None] 66.94 298.83 4.46×
test_write_array[memory-chunks=1K,shards=1K-None] 71.06 315.12 4.43×
test_read_array[memory-chunks=1K,shards=1K-None] 43.90 192.21 4.38×
test_sharded_morton_single_chunk[(32,32,32)] 0.18 0.74 4.12×
test_read_array[memory-chunks=1K,shards=1K-gzip] 68.96 283.15 4.11×
test_read_array[local-chunks=1K,shards=1K-gzip] 90.06 368.79 4.09×
test_sharded_morton_single_chunk[(33,33,33)] 0.19 0.76 4.03×
test_slice_indexing[(50,50,50)-full_slice-get_latency] 20.53 79.62 3.88×
test_sharded_morton_single_chunk[(30,30,30)] 0.20 0.72 3.69×
test_slice_indexing[(50,50,50)-strided_4_offset-get_latency] 17.73 50.27 2.84×
test_slice_indexing[None-strided_4_offset-get_latency] 19.29 48.98 2.54×
test_slice_indexing[None-strided_4-get_latency] 43.55 102.36 2.35×
test_slice_indexing[None-full_slice-get_latency] 46.77 101.82 2.18×
test_write_array[local-chunks=1K,shards=1K-gzip] 367.61 733.76 2.00×
test_slice_indexing[(50,50,50)-(0,0,0)-get_latency] 0.49 0.87 1.79×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3-get_latency] 0.50 0.88 1.77×
test_slice_indexing[None-mixed_slice-get_latency] 0.58 1.00 1.74×
test_slice_indexing[(50,50,50)-mixed_slice-get_latency] 1.12 1.92 1.72×
test_write_array[local-chunks=1K,shards=1K-None] 390.76 619.05 1.58×
test_write_array[local-chunks=1K,no_shards-None] 225.86 340.42 1.51×
test_write_array[local-chunks=1K,no_shards-gzip] 284.20 400.62 1.41×
test_slice_indexing[None-(slice(None,10,None))*3-get_latency] 0.31 0.43 1.39×
test_slice_indexing[None-(0,0,0)-get_latency] 0.33 0.43 1.28×
test_morton_order_iter[(30,30,30)] 104.17 123.84 1.19×
test_morton_order_iter[(16,16,16)] 2.56 3.01 1.18×
test_morton_order_iter[(10,10,10)] 7.44 8.61 1.16×
test_morton_order_iter[(8,8,8)] 0.31 0.36 1.15×
test_morton_order_iter[(33,33,33)] 644.86 740.97 1.15×
test_morton_order_iter[(20,20,20)] 78.57 88.98 1.13×
test_sharded_morton_write_single_chunk[(33,33,33)] 668.32 754.39 1.13×
test_morton_order_iter[(32,32,32)] 22.80 24.85 1.09×
test_sharded_morton_write_single_chunk[(30,30,30)] 130.08 129.83 1.00×
test_sharded_morton_write_single_chunk[(32,32,32)] 48.33 32.43 0.67×

d-v-b added 30 commits February 18, 2026 21:48
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Mar 10, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 10, 2026

Merging this PR will improve performance by ×270

⚡ 59 improved benchmarks
✅ 7 untouched benchmarks
⏩ 6 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_sharded_morton_single_chunk[(32, 32, 32)-memory] 1,875.7 µs 687.2 µs ×2.7
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip] 1,014.9 ms 285.3 ms ×3.6
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 555.8 ms 193.7 ms ×2.9
WallTime test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory] 1,758 µs 537.8 µs ×3.3
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None] 1,616.3 ms 596.8 ms ×2.7
WallTime test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory_get_latency] 4 ms 3.2 ms +26.19%
WallTime test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory_get_latency] 4.1 ms 2.2 ms +82.89%
WallTime test_sharded_morton_single_chunk[(30, 30, 30)-memory] 1,965 µs 651.9 µs ×3
WallTime test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory] 3.7 ms 1.1 ms ×3.4
WallTime test_sharded_morton_indexing_large[(33, 33, 33)-memory] 10.2 s 1.8 s ×5.8
WallTime test_sharded_morton_indexing_large[(30, 30, 30)-memory] 7.7 s 1.3 s ×5.8
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip] 9.5 s 1.2 s ×7.7
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory] 418.6 ms 83.5 ms ×5
WallTime test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory] 413.4 ms 78.8 ms ×5.2
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None] 5,355.3 ms 253.9 ms ×21
WallTime test_sharded_morton_single_chunk[(33, 33, 33)-memory] 1,959.6 µs 703.4 µs ×2.8
WallTime test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency] 435.3 ms 117.6 ms ×3.7
WallTime test_sharded_morton_write_single_chunk[(30, 30, 30)-memory] 147,069.6 µs 727.5 µs ×200
WallTime test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None] 1,212.5 ms 626.5 ms +93.53%
WallTime test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip] 2,117.8 ms 712.4 ms ×3
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing d-v-b:perf/smarter-codecs (116e417) with main (fa61ed8)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 10, 2026

@zarr-developers/python-core-devs please look at these numbers. Getting this PR into main is probably not practical, but I am confident that breaking it into pieces will work.

@normanrz
Copy link
Member

The numbers are certainly impressive! Great work.
Breaking it up makes a lot of sense. The changes in tier 1 and probably also tier 2 are not really controversial. For tier 3, I think we'll need some discussions and probably extended benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants