Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ crate-type = ["cdylib", "rlib"]

[dependencies]
pyo3 = { version = "0.27.1", features = ["abi3-py311"] }
zarrs = { version = "0.23.0", features = ["async", "zlib", "pcodec", "bz2"] }
zarrs = { version = "0.23.6", features = ["async", "zlib", "pcodec", "bz2"] }
rayon_iter_concurrent_limit = "0.2.0"
rayon = "1.10.0"
# fix for https://stackoverflow.com/questions/76593417/package-openssl-was-not-found-in-the-pkg-config-search-path
Expand All @@ -29,3 +29,8 @@ zarrs_object_store = "0.5.0" # object_store 0.12

[profile.release]
lto = true

[patch.crates-io]
zarrs = { git = "https://github.com/zarrs/zarrs.git", branch = "feat/DecodeMode" }
zarrs_storage = { git = "https://github.com/zarrs/zarrs.git", branch = "feat/DecodeMode" }
zarrs_codec = { git = "https://github.com/zarrs/zarrs.git", branch = "feat/DecodeMode" }
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ The `ZarrsCodecPipeline` specific options are:
- Defaults to `True`. See [here](https://docs.rs/zarrs/latest/zarrs/config/struct.Config.html#validate-checksums) for more info.
- `codec_pipeline.direct_io`: enable `O_DIRECT` read/write, needs support from the operating system (currently only Linux) and file system.
- Defaults to `False`.
- `codec_pipeline.decode_mode`: controls the decode path used when reading a chunk subset.
- `"auto"` (default): use the full-chunk decode path when the requested subset covers the entire chunk, and the partial-decoder path otherwise.
- `"partial"`: always use the partial-decoder path, even for whole-chunk reads. Useful for codecs (e.g. sharding) where partial decoding is more efficient even for full-chunk reads.
- `"full"`: always decode the full chunk and extract the subset.
- `codec_pipeline.strict`: raise exceptions for unsupported operations instead of falling back to the default codec pipeline of `zarr-python`.
- Defaults to `False`.

Expand All @@ -63,6 +67,7 @@ zarr.config.set({
"chunk_concurrent_maximum": None,
"chunk_concurrent_minimum": 4,
"direct_io": False,
"decode_mode": None,
"strict": False
}
})
Expand Down
1 change: 1 addition & 0 deletions examples/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
issue_152.zarr
54 changes: 54 additions & 0 deletions examples/issue_152.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import platform
import subprocess
import time

import numpy as np
import zarr


def clear_cache():
if platform.system() == "Darwin":
subprocess.call(["sync", "&&", "sudo", "purge"])
elif platform.system() == "Linux":
subprocess.call(["sudo", "sh", "-c", "sync; echo 3 > /proc/sys/vm/drop_caches"])
else:
raise Exception("Unsupported platform")


zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

# zarr.config.set({"codec_pipeline.decode_mode": "auto"})
# full read took: 3.3279900550842285
# partial shard read (4095) took: 1.211921215057373
# partial shard read (4096) took: 2.3402509689331055

zarr.config.set({"codec_pipeline.decode_mode": "partial"})
# full read took: 2.2892508506774902
# partial shard read (4095) took: 1.1934266090393066
# partial shard read (4096) took: 1.1788337230682373
Comment on lines +20 to +28
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf exactly matches my expectations with this mode @ilan-gold

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thin there is something strange going on here - these numbers are reversed for me i.e., forcing partial makes (4096) slower for me (which makes sense since it is a full outer-chunk read) but leaves the 4095 one unchanged. What I do see to be constant also on my own machine is that 4095 is the same in both cases, but I would expect this feature to help with that request specifically i.e., somehow make it faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, if I use auto here (i.e., from what I can tell, what is currently on main in zarrs-python) or go back to the original example in #152 with the main branch, but make the partial read 2048, it's still slow - it takes just as long to read 2048 as it does the full 4096 for me.

Apologies for the crosstalk, but I mentioned in my comment in the zarrs PR that I believe the hotpath is in the partial decoder and this behavior would confirm that.

Reading half the data takes just as long as reading the full data - even with sequential i/o, I can't believe those numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a consequence of using a compressed filesystem. Mine isn't and I assume yours is on Mac? Otherwise I can't really explain this without actually profiling on a machine like yours

Copy link
Collaborator

@ilan-gold ilan-gold Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely right. With this reproducer on linux, I see that now. But I think my bad reproducer masked my original problem when I moved from linux to mac to create the repro (since the numbers looked like the ones I was seeing on the linux machine).

The data is fundamentally not random, and contains a lot of zeros (so compression kicks in, presumably). Aghh, this is getting to be a mess.

I guess my question here would be why this specific fix here works i.e., why is partial mode causing the full-shard reads to be faster? I suspect this to be somehow related to my problem.


z = zarr.create_array(
"examples/issue_152.zarr",
shape=(8192, 4, 128, 128),
shards=(4096, 4, 128, 128),
chunks=(1, 1, 128, 128),
dtype=np.float64,
overwrite=True,
)
z[...] = np.random.randn(8192, 4, 128, 128)
Copy link
Collaborator

@ilan-gold ilan-gold Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
z[...] = np.random.randn(8192, 4, 128, 128)
z[...] = np.ones((8192, 4, 128, 128))

Ok so if you do this (i.e., let compression do its thing on non-random data), the issue becomes apparent. I also changed the sub-shard read size to highlight this isn't some chunk-boundary condition of sorts. On my linux machine, with auto:

full read took:  0.2127237319946289
partial shard read (2048) took:  0.5163655281066895
partial shard read (4096) took:  0.09389019012451172

with partial:

full read took:  1.429715633392334
partial shard read (2048) took:  0.4784278869628906
partial shard read (4096) took:  0.8891105651855469

and full:

full read took:  0.2066342830657959
partial shard read (2048) took:  0.6641993522644043
partial shard read (4096) took:  0.0905311107635498

These numbers are much more like what I'm seeing. I believe (as I just learned from you) that this also explains the mac-linux performance difference.

Again, apologies, but this appears to the core of the issue. This matches the behavior I see on my "real" data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this problem appears to also be independent of the chunking - chunks of (128, 4, 128, 128) also reproduce this behavior of full-shard reads being orders of magnitude faster, even if they are much more data (although small chunks sizes exasperates the problem).


clear_cache()
t = time.time()
z[...]
print("full read took: ", time.time() - t)

clear_cache()
t = time.time()
z[:4095, ...]
print("partial shard read (4095) took: ", time.time() - t)


clear_cache()
t = time.time()
z[:4096, ...]
print("partial shard read (4096) took: ", time.time() - t)
1 change: 1 addition & 0 deletions python/zarrs/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def get_codec_pipeline_impl(
),
num_threads=config.get("threading.max_workers", None),
direct_io=config.get("codec_pipeline.direct_io", False),
decode_mode=config.get("codec_pipeline.decode_mode", None),
)
except TypeError as e:
if strict:
Expand Down
24 changes: 20 additions & 4 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ use utils::is_whole_chunk;
use zarrs::array::{
ArrayBytes, ArrayBytesDecodeIntoTarget, ArrayBytesFixedDisjointView, ArrayMetadata,
ArrayPartialDecoderTraits, ArrayToBytesCodecTraits, CodecChain, CodecOptions, DataType,
FillValue, StoragePartialDecoder, copy_fill_value_into, update_array_bytes,
DecodeMode, FillValue, StoragePartialDecoder, copy_fill_value_into, update_array_bytes,
};
use zarrs::config::global_config;
use zarrs::convert::array_metadata_v2_to_v3;
Expand Down Expand Up @@ -218,6 +218,7 @@ impl CodecPipelineImpl {
chunk_concurrent_maximum=None,
num_threads=None,
direct_io=false,
decode_mode=None,
))]
#[new]
fn new(
Expand All @@ -228,6 +229,7 @@ impl CodecPipelineImpl {
chunk_concurrent_maximum: Option<usize>,
num_threads: Option<usize>,
direct_io: bool,
decode_mode: Option<&str>,
) -> PyResult<Self> {
store_config.direct_io(direct_io);
let metadata = serde_json::from_str(array_metadata).map_py_err::<PyTypeError>()?;
Expand All @@ -239,7 +241,19 @@ impl CodecPipelineImpl {
};
let codec_chain =
Arc::new(CodecChain::from_metadata(&metadata_v3.codecs).map_py_err::<PyTypeError>()?);
let codec_options = CodecOptions::default().with_validate_checksums(validate_checksums);
let decode_mode = match decode_mode {
None | Some("auto") => DecodeMode::Auto,
Some("partial") => DecodeMode::Partial,
Some("full") => DecodeMode::Full,
Some(s) => {
return Err(PyErr::new::<PyValueError, _>(format!(
"invalid decode_mode {s:?}, expected \"auto\", \"partial\", or \"full\""
)));
}
};
let codec_options = CodecOptions::default()
.with_validate_checksums(validate_checksums)
.with_decode_mode(decode_mode);

let chunk_concurrent_minimum =
chunk_concurrent_minimum.unwrap_or(global_config().chunk_concurrent_minimum());
Expand Down Expand Up @@ -300,7 +314,9 @@ impl CodecPipelineImpl {
// Assemble partial decoders ahead of time and in parallel
let partial_chunk_items = chunk_descriptions
.iter()
.filter(|item| !(is_whole_chunk(item)))
.filter(|item| {
!is_whole_chunk(item) || codec_options.decode_mode() == DecodeMode::Partial
})
.unique_by(|item| item.key.clone())
.collect::<Vec<_>>();
let mut partial_decoder_cache: HashMap<StoreKey, Arc<dyn ArrayPartialDecoderTraits>> =
Expand Down Expand Up @@ -350,7 +366,7 @@ impl CodecPipelineImpl {
};
let target = ArrayBytesDecodeIntoTarget::Fixed(&mut output_view);
// See zarrs::array::Array::retrieve_chunk_subset_into
if is_whole_chunk(&item) {
if is_whole_chunk(&item) && codec_options.decode_mode() != DecodeMode::Partial {
// See zarrs::array::Array::retrieve_chunk_into
if let Some(chunk_encoded) =
self.store.get(&item.key).map_py_err::<PyRuntimeError>()?
Expand Down
Loading