Skip to content

REF: Reuse async loading logic for DataTree and open_groups (Fixes #11131)#11149

Open
AAlexxis222 wants to merge 51 commits intopydata:mainfrom
AAlexxis222:refactor-async-zarr-loading
Open

REF: Reuse async loading logic for DataTree and open_groups (Fixes #11131)#11149
AAlexxis222 wants to merge 51 commits intopydata:mainfrom
AAlexxis222:refactor-async-zarr-loading

Conversation

@AAlexxis222
Copy link

Fixes #11131.
Previously, _open_datatree_from_stores_async (robust) and open_groups_as_dict_async (naive) contained duplicated logic with inconsistent behavior. The former handled semaphores and async indexing correctly, while the latter did not.

Changes

  • Extracted the robust logic (Semaphores, TaskGroup, open_dataset_async, _maybe_create_default_indexes_async) into a shared private method: _open_groups_from_stores_async.
  • Updated open_groups_as_dict_async to open stores asynchronously and then delegate to the shared core.
  • Updated open_datatree to delegate to the shared core (via zarr_sync), removing the need for _open_datatree_from_stores_async.

Implementation Details
I chose to extract a shared _open_groups_from_stores_async instead of making one public function call the other directly. This allows open_datatree (which holds pre-opened sync stores) and open_groups_as_dict_async (which opens stores internally) to share the exact same execution engine.

Verification

  • Ran tests locally: pytest xarray/tests/test_backends.py xarray/tests/test_datatree.py -k "zarr or async".
  • Result: 767 passed (identical to baseline).

Changes:
- Refactor open_datatree() to use zarr_sync() with async implementation
  for concurrent dataset and index creation across groups
- Add _open_datatree_from_stores_async() helper that opens datasets and
  creates indexes concurrently using asyncio.gather with a semaphore
  to limit concurrency (avoids deadlocks with stores like Icechunk)
- Add open_datatree_async() method for explicit async API
- Remove duplicate _maybe_create_default_indexes_async from zarr.py,
  now imports from api.py (single source of truth)

This significantly improves performance when opening DataTrees from
high-latency storage backends (e.g., ~2 seconds vs sequential loading).
Remove the asyncio.Semaphore that was limiting concurrency to 10
concurrent operations. Investigation showed:

- Zarr already has built-in concurrency control (async.concurrency=10)
- The semaphore only applied to asyncio.to_thread() calls, not zarr I/O
- Removing it improves performance by ~30-40% (~2s -> ~1.2-1.4s)

The semaphore was defensive code for a problem that doesn't exist -
zarr and icechunk handle their own concurrency limits internally.
The async implementation uses zarr.core.sync which only exists in
zarr v3. Add a conditional check using _zarr_v3() to:
- Use async path with zarr_sync() for zarr v3 (concurrent loading)
- Fall back to sequential loading for zarr v2

This fixes CI failures on min-versions environment which uses zarr v2.
The previous commit (0ee2a73) removed the semaphore thinking zarr handles
its own concurrency, but icechunk can deadlock when too many asyncio.to_thread()
calls try to access it simultaneously. This was discovered when testing with
larger stores (23+ groups) where all threads would start but never complete.

The semaphore limits concurrent to_thread calls to 10, which prevents the
deadlock while still providing significant performance benefits over sequential
loading.
- Add helper methods _build_group_members and _create_stores_from_members
  to reduce code duplication between sync and async store opening
- Use zarr_sync() to run async index creation in _datatree_from_backend_datatree
  for zarr engine, making open_datatree fully async behind the scenes
- Fix missing chunks validation and source encoding in open_datatree_async
- Add tests for chunks validation, source encoding, and chunks parameter
- Add type annotations to nested async functions in _datatree_from_backend_datatree
  to fix mypy annotation-unchecked notes breaking pytest-mypy-plugins tests
- Use os.path.join and os.path.normpath in test_async_source_encoding
  for cross-platform compatibility on Windows
Add type annotations to _maybe_create_default_indexes_async and its
nested functions (load_var, create_index, _create) to satisfy mypy's
annotation-unchecked checks. Also add Variable and Hashable imports
to the TYPE_CHECKING block.

This fixes pytest-mypy-plugins tests that were failing due to mypy
emitting annotation-unchecked notes for untyped nested functions.
aladinor and others added 15 commits January 16, 2026 09:40
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
Benchmarking showed async index creation provides no measurable benefit
since it's CPU-bound work. Simplified to sync loop per reviewer feedback.
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
- Replace asyncio.gather with asyncio.TaskGroup for better error handling
  (cancels outstanding tasks on error)
- Add max_concurrency parameter to open_datatree for controlling parallel
  I/O operations (defaults to 10)
- Add StoreBackendEntrypoint.open_dataset_async method
- Add test for open_dataset_async equivalence
@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Feb 9, 2026
@AAlexxis222 AAlexxis222 force-pushed the refactor-async-zarr-loading branch from 7148848 to a07e5d8 Compare February 9, 2026 09:55
@AAlexxis222 AAlexxis222 force-pushed the refactor-async-zarr-loading branch from a07e5d8 to c204229 Compare February 9, 2026 14:18
@AAlexxis222
Copy link
Author

@keewis Logic is ready and tested, but depends on unmerged PR #10742.

I rebased on #10742 to get open_dataset_async, inheriting its commits and a conflict in whats-new.rst.

Should I wait for #10742 to merge first, or should I resolve the conflicts here?

@keewis
Copy link
Collaborator

keewis commented Feb 11, 2026

at the moment it's pretty hard to review this, so I think we should try to merge #10742 first and rebase on main afterwards

@AAlexxis222
Copy link
Author

Sounds good. I'll put this on hold until #10742 is merged, and then I'll rebase on main. Thanks for the guidance!

Copilot AI review requested due to automatic review settings February 16, 2026 08:37
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Zarr async group loading so open_datatree and async group-opening share a single, more robust execution path (bounded concurrency, TaskGroup, and async dataset opening/indexing), and exposes a max_concurrency control to callers.

Changes:

  • Added StoreBackendEntrypoint.open_dataset_async() and ZarrStore.open_store_async() to support non-blocking async loading.
  • Extracted shared async group opening core into ZarrBackendEntrypoint._open_groups_from_stores_async() and wired it into open_datatree and open_groups_as_dict_async.
  • Added max_concurrency parameter to open_datatree, plus changelog entry and new async-focused tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
xarray/backends/zarr.py Adds async store discovery and shared async group-opening core; wires concurrency-limited loading into open_datatree / open_groups_as_dict_async.
xarray/backends/store.py Introduces open_dataset_async() via asyncio.to_thread() to avoid blocking the event loop.
xarray/backends/api.py Adds async default-index creation helper and uses it for zarr-v3 datatree index creation; threads max_concurrency into backend kwargs.
xarray/tests/test_backends_zarr_async.py Adds targeted tests for async zarr group opening and async default-index creation.
doc/whats-new.rst Documents new open_datatree(max_concurrency=...) feature and performance note.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix typo in comment (rootshalgive -> root)
- Add max_concurrency validation (raise ValueError if < 1)
- Update open_store_async docstring to reflect actual behavior
- Add semaphore bound to _iter_zarr_groups_async
- Add create_default_indexes param to _open_groups_from_stores_async
- Add semaphore bound to create_indexes_async in api.py
- Fix whats-new.rst PR references (10742 -> 11149)
@AAlexxis222 AAlexxis222 force-pushed the refactor-async-zarr-loading branch from 73b2eb2 to 547ddd4 Compare February 20, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor async datatree loading to reuse open_groups_as_dict_async

4 participants