feat: implement Windows read limiter to prevent crashes from concurrent TensorStore reads

rhoadesScholar · rhoadesScholar · commit 53dc5f5fac09 · 2026-02-20T14:42:50.000-05:00
- Add `read_limiter.py` with a semaphore-based context manager for managing TensorStore reads on Windows.
- Update `CellMapDataset` to use the read limiter in `__getitem__` methods.
- Introduce logging warnings for potential threading issues on Windows with multiple DataLoader workers.
- Implement `close()` method in `CellMapDataset` for safe cleanup.
- Add tests in `test_windows_stress.py` to validate read limiter functionality and executor lifecycle.
- Enhance README with Windows compatibility guidelines and environment variable configurations.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,55 @@
 # CHANGELOG
 
+## Unreleased
+
+### Fix
+
+* fix: prevent Windows hard-crash from concurrent TensorStore reads
+
+  - Add `src/cellmap_data/read_limiter.py`: global `threading.Semaphore` that
+    gates TensorStore materializations on Windows+TensorStore backend. No-op on
+    Linux/macOS and when using the Dask backend.
+  - Wrap the three read-triggering lines in `CellMapDataset.__getitem__`
+    (`get_input_array`, `get_target_array` raw-only, `get_label_array`) with
+    the `limit_tensorstore_reads()` context manager. Torch-only operations
+    (`infer_label_array`, stacking, `.to(device)`) are left unconstrained.
+  - Configure via `CELLMAP_MAX_CONCURRENT_READS` env var (default `"1"` on
+    Windows; unlimited elsewhere). Must be set before importing `cellmap_data`.
+
+### Feature
+
+* feat: add `CellMapDataset.close()` and `atexit` registration
+
+  - `close()` calls `executor.shutdown(wait=True, cancel_futures=True)` and
+    resets `_executor` to `None`, enabling safe deterministic cleanup.
+  - `atexit.register(self.close)` ensures the executor is always shut down at
+    interpreter exit, even when `__del__` is not called.
+
+* feat: add nested-worker warning on Windows
+
+  - When `CellMapDataset.executor` is lazily created inside a DataLoader worker
+    process (`torch.utils.data.get_worker_info() is not None`) on Windows with
+    `max_workers > 1`, a `logger.warning` is emitted.
+  - `CellMapDataLoader` warns when `num_workers > 0` on Windows.
+
+* feat: improve init logging
+
+  - Replaced `logger.debug` with `logger.info` at dataset construction time,
+    now including OS, backend, `max_workers`, and `max_concurrent_reads`.
+
+### Test
+
+* test: add `tests/test_windows_stress.py`
+
+  - `TestReadLimiterUnit`: semaphore state, context manager correctness,
+    exception propagation, 50-thread deadlock test.
+  - `TestExecutorLifecycle`: `close()` idempotency, executor recreation.
+  - `TestConcurrentGetitem`: 200-iteration serial tests (both multi-class and
+    raw-only paths); multi-thread tests where each thread has its own dataset
+    instance, accurately mirroring DataLoader `num_workers > 0` behavior.
+  - `test_windows_high_concurrency_no_crash`: 8 simulated workers × 100
+    iterations each; skipped on non-Windows.
+
 ## v0.1.0 (2024-09-06)
 
 ### Build
diff --git a/README.md b/README.md
@@ -399,6 +399,51 @@ input_arrays = {
 }
 ```
 
+## Windows Compatibility
+
+CellMap-Data includes specific hardening for Windows to prevent native hard-crashes caused by concurrent TensorStore reads from multiple threads.
+
+### TensorStore Read Limiter
+
+On Windows, concurrent materializations of TensorStore-backed xarray arrays (triggered by `source[center]`, `.interp`, `.__array__`, etc.) can cause the Python process to abort. A global semaphore serializes these reads automatically:
+
+```python
+# The limiter activates automatically on Windows with the default TensorStore backend.
+# No code changes required — it is transparent to all callers.
+
+# Override the concurrency limit (default is 1 on Windows):
+import os
+os.environ["CELLMAP_MAX_CONCURRENT_READS"] = "2"  # set BEFORE importing cellmap_data
+
+from cellmap_data import CellMapDataset
+```
+
+### Environment Variables
+
+| Variable | Default | Description |
+|---|---|---|
+| `CELLMAP_DATA_BACKEND` | `"tensorstore"` | Backend for array reads (`"tensorstore"` or `"dask"`) |
+| `CELLMAP_MAX_WORKERS` | `8` | Max threads in the internal `ThreadPoolExecutor` |
+| `CELLMAP_MAX_CONCURRENT_READS` | `1` (Windows) / unlimited | Max concurrent TensorStore reads (Windows+TensorStore only) |
+
+### Recommendations for Windows
+
+- Keep the default `num_workers=0` in `CellMapDataLoader` (safest on Windows); the internal executor still parallelizes per-array I/O within each `__getitem__` call.
+- If you need `num_workers > 0`, each DataLoader worker process gets its own dataset copy and its own read semaphore — this is safe.
+- Do **not** share a single `CellMapDataset` instance across multiple threads that each call `__getitem__` concurrently. Use separate dataset instances instead (which is exactly what DataLoader workers do).
+
+### Explicit Shutdown
+
+`CellMapDataset` registers an `atexit` handler and exposes an explicit `close()` method for deterministic cleanup:
+
+```python
+dataset = CellMapDataset(...)
+try:
+    # ... training ...
+finally:
+    dataset.close()  # shuts down the internal ThreadPoolExecutor immediately
+```
+
 ## Performance Optimization
 
 ### Memory Management
@@ -407,15 +452,15 @@ input_arrays = {
 - Automatic GPU memory management
 - Streaming data loading for large volumes
 
-### Parallel Processing  
+### Parallel Processing
 
-- Multi-threaded data loading
+- Multi-threaded data loading via persistent `ThreadPoolExecutor`
 - CUDA streams for GPU optimization
 - Process-safe dataset pickling
 
 ### Caching Strategy
 
-- Persistent ThreadPoolExecutor for reduced overhead
+- Persistent `ThreadPoolExecutor` per process (lazy-initialized, PID-tracked)
 - Optimized coordinate transformations
 - Minimal redundant computations
 
diff --git a/src/cellmap_data/dataloader.py b/src/cellmap_data/dataloader.py
@@ -78,6 +78,15 @@ def __init__(
         self.is_train = is_train
         self.rng = rng
 
+        if platform.system() == "Windows" and num_workers > 0:
+            logger.warning(
+                "CellMapDataLoader: num_workers=%d on Windows may cause nested "
+                "threading x multiprocessing issues with TensorStore. "
+                "The internal read limiter serializes reads, but num_workers=0 "
+                "is safer if crashes occur.",
+                num_workers,
+            )
+
         # Set device
         if device is None:
             if torch.cuda.is_available():
diff --git a/src/cellmap_data/dataset.py b/src/cellmap_data/dataset.py
@@ -1,7 +1,9 @@
 # %%
+import atexit
 import functools
 import logging
 import os
+import platform
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from typing import Any, Callable, Mapping, Optional, Sequence
 
@@ -15,12 +17,17 @@
 from .empty_image import EmptyImage
 from .image import CellMapImage
 from .mutable_sampler import MutableSubsetRandomSampler
+from .read_limiter import MAX_CONCURRENT_READS, limit_tensorstore_reads
 from .utils import get_sliced_shape, is_array_2D, min_redundant_inds, split_target_path
 
 logger = logging.getLogger(__name__)
 if logger.level == logging.NOTSET:
     logger.setLevel(logging.INFO)
 
+# Cache system values to avoid repeated calls during dataset instantiation
+_OS_NAME = platform.system()
+_DATA_BACKEND = os.environ.get("CELLMAP_DATA_BACKEND", "tensorstore")
+
 
 # %%
 class CellMapDataset(CellMapBaseDataset, Dataset):
@@ -154,14 +161,22 @@ def __init__(
                 int(os.environ.get("CELLMAP_MAX_WORKERS", 8)),  # Cap at 8 by default
             )
 
-        logger.debug(
-            "CellMapDataset initialized with %d inputs, %d targets, %d classes. "
-            "Using ThreadPoolExecutor with %d workers for parallel I/O.",
+        logger.info(
+            "CellMapDataset: OS=%s backend=%s max_workers=%d max_concurrent_reads=%s "
+            "inputs=%d targets=%d classes=%d",
+            _OS_NAME,
+            _DATA_BACKEND,
+            self._max_workers,
+            (
+                str(MAX_CONCURRENT_READS)
+                if MAX_CONCURRENT_READS is not None
+                else "unlimited"
+            ),
             len(self.input_arrays),
             len(self.target_arrays),
             len(self.classes),
-            self._max_workers,
         )
+        atexit.register(self.close)
 
     @property
     def executor(self) -> ThreadPoolExecutor:
@@ -177,6 +192,19 @@ def executor(self) -> ThreadPoolExecutor:
             self._executor_pid = current_pid
 
         if self._executor is None:
+            if _OS_NAME == "Windows":
+                worker_info = torch.utils.data.get_worker_info()
+                if worker_info is not None and self._max_workers > 1:
+                    logger.warning(
+                        "CellMapDataset running inside DataLoader worker "
+                        "(id=%d, total=%d) on Windows with max_workers=%d. "
+                        "Prefer max_workers=1 or num_workers=0 to avoid nested "
+                        "threading x multiprocessing crashes. "
+                        "TensorStore reads are still serialized by the read limiter.",
+                        worker_info.id,
+                        worker_info.num_workers,
+                        self._max_workers,
+                    )
             self._executor = ThreadPoolExecutor(max_workers=self._max_workers)
         return self._executor
 
@@ -188,6 +216,16 @@ def __del__(self):
         if hasattr(self, "_executor") and self._executor is not None:
             self._executor.shutdown(wait=True)
 
+    def close(self) -> None:
+        """Shut down the ThreadPoolExecutor and release resources.
+
+        Called automatically via atexit to ensure clean shutdown at interpreter
+        exit, regardless of whether __del__ is invoked.
+        """
+        if hasattr(self, "_executor") and self._executor is not None:
+            self._executor.shutdown(wait=True, cancel_futures=True)
+            self._executor = None
+
     def __new__(
         cls,
         raw_path: str,
@@ -218,7 +256,7 @@ def __new__(
         ):
             from cellmap_data.multidataset import CellMapMultiDataset
 
-            logger.warning(
+            logger.info(
                 "2D arrays requested without slicing axis. Creating datasets "
                 "that each slice along one axis. If this is not intended, "
                 "specify the slicing axis in the input and target arrays."
@@ -549,7 +587,8 @@ def __getitem__(self, idx: ArrayLike) -> dict[str, torch.Tensor]:
 
         def get_input_array(array_name: str) -> tuple[str, torch.Tensor]:
             self.input_sources[array_name].set_spatial_transforms(spatial_transforms)
-            array = self.input_sources[array_name][center]
+            with limit_tensorstore_reads():
+                array = self.input_sources[array_name][center]
             return array_name, array.squeeze()[None, ...]
 
         futures = [
@@ -563,7 +602,8 @@ def get_target_array(array_name: str) -> tuple[str, torch.Tensor]:
                 self.target_sources[array_name].set_spatial_transforms(
                     spatial_transforms
                 )
-                array = self.target_sources[array_name][center]
+                with limit_tensorstore_reads():
+                    array = self.target_sources[array_name][center]
                 return array_name, array.squeeze()[None, ...]
 
         else:
@@ -578,7 +618,8 @@ def get_label_array(
                     source = self.target_sources[array_name].get(label)
                     if isinstance(source, (CellMapImage, EmptyImage)):
                         source.set_spatial_transforms(spatial_transforms)
-                        array = source[center].squeeze()
+                        with limit_tensorstore_reads():
+                            array = source[center].squeeze()
                     else:
                         array = None
                     return label, array
diff --git a/src/cellmap_data/read_limiter.py b/src/cellmap_data/read_limiter.py
@@ -0,0 +1,70 @@
+"""
+Global TensorStore read limiter for Windows crash prevention.
+
+On Windows, concurrent TensorStore materializations from multiple threads
+(triggered by source[center], .interp, ._TensorStoreAdapter.__array__, etc.)
+cause native hard crashes / aborts. This module provides a semaphore-backed
+context manager that serializes those reads on Windows+TensorStore while
+acting as a no-op on all other platforms.
+
+Configuration
+-------------
+CELLMAP_DATA_BACKEND : str
+    Set to "tensorstore" (default) to enable the limiter on Windows.
+    Set to anything else (e.g. "dask") to disable it entirely.
+
+CELLMAP_MAX_CONCURRENT_READS : int
+    Maximum concurrent TensorStore reads allowed on Windows.
+    Defaults to 1 (fully serialized). Increase cautiously.
+
+Notes
+-----
+Both environment variables must be set **before** this module is imported,
+as the semaphore is created once at import time.
+"""
+
+import os
+import platform
+import threading
+from contextlib import contextmanager
+
+_IS_WINDOWS = platform.system() == "Windows"
+_IS_TENSORSTORE = (
+    os.environ.get("CELLMAP_DATA_BACKEND", "tensorstore").lower() == "tensorstore"
+)
+
+MAX_CONCURRENT_READS: int | None
+_read_semaphore: threading.Semaphore | None
+
+if _IS_WINDOWS and _IS_TENSORSTORE:
+    MAX_CONCURRENT_READS = int(os.environ.get("CELLMAP_MAX_CONCURRENT_READS", "1"))
+    _read_semaphore = threading.Semaphore(MAX_CONCURRENT_READS)
+else:
+    MAX_CONCURRENT_READS = None
+    _read_semaphore = None
+
+
+@contextmanager
+def limit_tensorstore_reads():
+    """Context manager that gates TensorStore reads on Windows.
+
+    On Windows with the TensorStore backend, at most ``MAX_CONCURRENT_READS``
+    threads may be inside this context at once.  On all other platforms (or
+    when using the Dask backend) this is a true no-op with zero overhead.
+
+    Usage
+    -----
+    ::
+
+        with limit_tensorstore_reads():
+            array = source[center]          # the unsafe read
+        # torch-only work continues here unconstrained
+    """
+    if _read_semaphore is not None:
+        _read_semaphore.acquire()
+        try:
+            yield
+        finally:
+            _read_semaphore.release()
+    else:
+        yield
diff --git a/tests/README.md b/tests/README.md
diff --git a/tests/test_windows_stress.py b/tests/test_windows_stress.py