sketch out improved performance by refactoring codec pipeline logic by d-v-b · Pull Request #3719 · zarr-developers/zarr-python

d-v-b · 2026-02-25T22:15:39Z

This builds on top of #3715 and achieves even more perf improvements by refactoring the basic logic of codec encoding / decoding. The design document behind these changes is here.

I do not think this is merge-worthy, as it's far too big. But i'm going to post the performance gains, and start figuring out how to break this into pieces.

A big feature this adds is the ability to write individual subchunks to uncompressed shards on storage backends that support range writes (local and memory).

Benchmark comparison: perf/smarter-codecs vs main

Test Name	perf/smarter-codecs (ms)	main (ms)	Speedup
test_write_array[memory-chunks=100,shards=1M-None]	46.85	1018.60	21.74×
test_write_array[local-chunks=100,shards=1M-None]	68.45	1006.27	14.70×
test_sharded_morton_indexing[(32,32,32)]	22.45	247.49	11.02×
test_slice_indexing[None-(0,0,0)]	0.03	0.28	10.05×
test_sharded_morton_indexing_large[(33,33,33)]	254.85	2521.32	9.89×
test_slice_indexing[(50,50,50)-full_slice]	9.12	89.80	9.85×
test_sharded_morton_indexing_large[(32,32,32)]	226.80	2153.47	9.50×
test_sharded_morton_indexing_large[(30,30,30)]	181.86	1725.29	9.49×
test_slice_indexing[(50,50,50)-(0,0,0)]	0.06	0.60	9.45×
test_write_array[memory-chunks=100,shards=1M-gzip]	211.85	1978.29	9.34×
test_slice_indexing[None-(slice(None,10,None))*3]	0.03	0.28	9.28×
test_write_array[local-chunks=100,shards=1M-gzip]	217.00	1965.37	9.06×
test_slice_indexing[(50,50,50)-strided_4]	8.96	80.44	8.98×
test_slice_indexing[(50,50,50)-strided_4_offset]	4.83	43.21	8.95×
test_sharded_morton_indexing[(16,16,16)]	2.78	23.26	8.37×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3]	0.07	0.60	8.36×
test_slice_indexing[None-full_slice]	11.07	85.12	7.69×
test_read_array[memory-chunks=100,shards=1M-gzip]	181.44	1372.13	7.56×
test_read_array[memory-chunks=100,shards=1M-None]	80.73	609.53	7.55×
test_read_array[memory-chunks=1K,no_shards-None]	7.21	53.52	7.43×
test_read_array[local-chunks=100,shards=1M-None]	83.79	612.18	7.31×
test_slice_indexing[None-strided_4]	12.19	84.98	6.97×
test_read_array[memory-chunks=1K,no_shards-gzip]	16.79	115.69	6.89×
test_write_array[memory-chunks=1K,no_shards-gzip]	32.54	219.93	6.76×
test_read_array[local-chunks=100,shards=1M-gzip]	190.20	1277.12	6.71×
test_read_array[local-chunks=1K,no_shards-None]	24.23	142.33	5.87×
test_slice_indexing[None-mixed_slice]	0.12	0.71	5.84×
test_read_array[local-chunks=1K,no_shards-gzip]	37.83	216.45	5.72×
test_slice_indexing[None-strided_4_offset]	5.77	32.55	5.64×
test_write_array[memory-chunks=1K,no_shards-None]	19.16	106.12	5.54×
test_slice_indexing[(50,50,50)-mixed_slice]	0.23	1.21	5.29×
test_slice_indexing[(50,50,50)-strided_4-get_latency]	15.82	82.14	5.19×
test_write_array[memory-chunks=1K,shards=1K-gzip]	100.60	451.93	4.49×
test_read_array[local-chunks=1K,shards=1K-None]	66.94	298.83	4.46×
test_write_array[memory-chunks=1K,shards=1K-None]	71.06	315.12	4.43×
test_read_array[memory-chunks=1K,shards=1K-None]	43.90	192.21	4.38×
test_sharded_morton_single_chunk[(32,32,32)]	0.18	0.74	4.12×
test_read_array[memory-chunks=1K,shards=1K-gzip]	68.96	283.15	4.11×
test_read_array[local-chunks=1K,shards=1K-gzip]	90.06	368.79	4.09×
test_sharded_morton_single_chunk[(33,33,33)]	0.19	0.76	4.03×
test_slice_indexing[(50,50,50)-full_slice-get_latency]	20.53	79.62	3.88×
test_sharded_morton_single_chunk[(30,30,30)]	0.20	0.72	3.69×
test_slice_indexing[(50,50,50)-strided_4_offset-get_latency]	17.73	50.27	2.84×
test_slice_indexing[None-strided_4_offset-get_latency]	19.29	48.98	2.54×
test_slice_indexing[None-strided_4-get_latency]	43.55	102.36	2.35×
test_slice_indexing[None-full_slice-get_latency]	46.77	101.82	2.18×
test_write_array[local-chunks=1K,shards=1K-gzip]	367.61	733.76	2.00×
test_slice_indexing[(50,50,50)-(0,0,0)-get_latency]	0.49	0.87	1.79×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3-get_latency]	0.50	0.88	1.77×
test_slice_indexing[None-mixed_slice-get_latency]	0.58	1.00	1.74×
test_slice_indexing[(50,50,50)-mixed_slice-get_latency]	1.12	1.92	1.72×
test_write_array[local-chunks=1K,shards=1K-None]	390.76	619.05	1.58×
test_write_array[local-chunks=1K,no_shards-None]	225.86	340.42	1.51×
test_write_array[local-chunks=1K,no_shards-gzip]	284.20	400.62	1.41×
test_slice_indexing[None-(slice(None,10,None))*3-get_latency]	0.31	0.43	1.39×
test_slice_indexing[None-(0,0,0)-get_latency]	0.33	0.43	1.28×
test_morton_order_iter[(30,30,30)]	104.17	123.84	1.19×
test_morton_order_iter[(16,16,16)]	2.56	3.01	1.18×
test_morton_order_iter[(10,10,10)]	7.44	8.61	1.16×
test_morton_order_iter[(8,8,8)]	0.31	0.36	1.15×
test_morton_order_iter[(33,33,33)]	644.86	740.97	1.15×
test_morton_order_iter[(20,20,20)]	78.57	88.98	1.13×
test_sharded_morton_write_single_chunk[(33,33,33)]	668.32	754.39	1.13×
test_morton_order_iter[(32,32,32)]	22.80	24.85	1.09×
test_sharded_morton_write_single_chunk[(30,30,30)]	130.08	129.83	1.00×
test_sharded_morton_write_single_chunk[(32,32,32)]	48.33	32.43	0.67×

…thon into perf/faster-codecs

…ospection more efficient

…into perf/faster-codecs

codspeed-hq · 2026-03-10T18:46:47Z

Merging this PR will improve performance by ×260

⚡ 59 improved benchmarks
✅ 7 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2,196.4 ms	704.2 ms	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	556.8 ms	193.8 ms	×2.9
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory]`	1,742.5 µs	549.1 µs	×3.2
⚡	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory]`	3.8 ms	1.1 ms	×3.5
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,031.3 ms	283.1 ms	×3.6
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory_get_latency]`	436.2 ms	117.8 ms	×3.7
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	419.6 ms	83.5 ms	×5
⚡	WallTime	`test_sharded_morton_indexing_large[(33, 33, 33)-memory]`	10.3 s	1.8 s	×5.8
⚡	WallTime	`test_slice_indexing[None-(slice(None, None, None), slice(0, 3, 2), slice(0, 10, None))-memory_get_latency]`	4.2 ms	2.2 ms	+84.66%
⚡	WallTime	`test_sharded_morton_single_chunk[(32, 32, 32)-memory]`	2,018.4 µs	681 µs	×3
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(0, 0, 0)-memory_get_latency]`	4.2 ms	3.2 ms	+28.54%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,622.2 ms	590.1 ms	×2.7
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5,367.9 ms	252.9 ms	×21
⚡	WallTime	`test_sharded_morton_single_chunk[(33, 33, 33)-memory]`	1,978.3 µs	703.4 µs	×2.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory_get_latency]`	4.2 ms	3.3 ms	+27.81%
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.5 s	1.2 s	×7.7
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory]`	414.6 ms	78.9 ms	×5.3
⚡	WallTime	`test_sharded_morton_single_chunk[(30, 30, 30)-memory]`	1,958 µs	659.6 µs	×3
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(0, None, 4), slice(0, None, 4), slice(0, None, 4))-memory_get_latency]`	431.3 ms	113.2 ms	×3.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(10, -10, 4), slice(10, -10, 4), slice(10, -10, 4))-memory_get_latency]`	243.5 ms	70.6 ms	×3.4
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/smarter-codecs (d5c712c) with main (a02d996)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

d-v-b · 2026-03-10T20:18:40Z

@zarr-developers/python-core-devs please look at these numbers. Getting this PR into main is probably not practical, but I am confident that breaking it into pieces will work.

normanrz · 2026-03-10T20:25:31Z

The numbers are certainly impressive! Great work.
Breaking it up makes a lot of sense. The changes in tier 1 and probably also tier 2 are not really controversial. For tier 3, I think we'll need some discussions and probably extended benchmarking.

d-v-b · 2026-06-18T20:29:54Z

closing this as superseded by #3885

d-v-b added 30 commits February 18, 2026 21:48

sketch out sync codecs + threadpool

f427898

Merge branch 'main' into perf/faster-codecs

dbdc3d4

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

lint

284e5e2

use protocols for new sync behavior

b1b876a

remove batch size parameter; add changelog entry

6996284

prune dead code, make protocols useful

204dda1

restore batch size but it's only there for warnings

e9db616

fix type hints, prevent thread pool leakage, make codec pipeline intr…

01e1f73

…ospection more efficient

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

fbde3af

…into perf/faster-codecs

restore old comments / docstrings

11534b0

simplify threadpool management

b40d53a

use isinstance instead of explicit list of codec names

83c1dc1

consolidate thread pool configuration

e8a0cc6

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9a1d5eb

…into perf/faster-codecs

Merge remote-tracking branch 'origin/main' into perf/smarter-codecs

9071954

d-v-b added 2 commits February 25, 2026 16:58

execute performance improvement plan

3297e0d

Merge branch 'main' into perf/smarter-codecs

0766289

d-v-b mentioned this pull request Feb 25, 2026

chunk encoding performance improvements #3720

Open

8 tasks

v2codec supports sync

aa38698

TomNicholas mentioned this pull request Mar 10, 2026

Zarr3 vs Zarr2 benchmarking ecmwf/anemoi-datasets#486

Open

d-v-b added the benchmark Code will be benchmarked in a CI job. label Mar 10, 2026

Merge remote-tracking branch 'origin/main' into perf/smarter-codecs

ea8eaf1

d-v-b added 2 commits March 10, 2026 20:01

be a little smarter about morton coding

abda3b9

set_range is a protocol

b290881

d-v-b mentioned this pull request Mar 10, 2026

define store behavior via protocols #3758

Open

d-v-b added 2 commits March 11, 2026 16:09

Merge branch 'main' into perf/smarter-codecs

116e417

Merge branch 'main' into perf/smarter-codecs

d5c712c

d-v-b mentioned this pull request Mar 26, 2026

perf/chunktransform #3722

Merged

d-v-b mentioned this pull request Apr 8, 2026

perf: use sync methods for chunk encoding / decoding #3885

Merged

d-v-b closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

sketch out improved performance by refactoring codec pipeline logic#3719

sketch out improved performance by refactoring codec pipeline logic#3719
d-v-b wants to merge 38 commits into
zarr-developers:mainfrom
d-v-b:perf/smarter-codecs

d-v-b commented Feb 25, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

d-v-b commented Mar 10, 2026

Uh oh!

normanrz commented Mar 10, 2026

Uh oh!

d-v-b commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

d-v-b commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark comparison: perf/smarter-codecs vs main

Uh oh!

codspeed-hq Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×260

Performance Changes

Footnotes

Uh oh!

d-v-b commented Mar 10, 2026

Uh oh!

normanrz commented Mar 10, 2026

Uh oh!

d-v-b commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

d-v-b commented Feb 25, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 10, 2026 •

edited

Loading