|
| 1 | +--- |
| 2 | +title: "PolyStore: Unified Storage Abstraction with Streaming Backends for Scientific Python" |
| 3 | +tags: |
| 4 | + - Python |
| 5 | + - storage |
| 6 | + - scientific computing |
| 7 | + - microscopy |
| 8 | + - streaming |
| 9 | +authors: |
| 10 | + - name: Tristan Simas |
| 11 | + orcid: 0000-0002-6526-3149 |
| 12 | + affiliation: 1 |
| 13 | +affiliations: |
| 14 | + - name: McGill University |
| 15 | + index: 1 |
| 16 | +date: 15 January 2026 |
| 17 | +bibliography: paper.bib |
| 18 | +--- |
| 19 | + |
| 20 | +# Summary |
| 21 | + |
| 22 | +PolyStore provides a unified API for heterogeneous storage backends—disk, memory, Zarr, and live streaming to Napari or Fiji—through a single interface. The key insight: **streaming viewers are just backends**: |
| 23 | + |
| 24 | +```python |
| 25 | +from polystore import FileManager, BackendRegistry |
| 26 | + |
| 27 | +fm = FileManager(BackendRegistry()) |
| 28 | + |
| 29 | +# Same API for persistent storage, cache, and live visualization |
| 30 | +fm.save(image, "result.npy", backend="disk") |
| 31 | +fm.save(image, "result.npy", backend="memory") |
| 32 | +fm.save(image, "result.npy", backend="napari_stream") # Appears in Napari |
| 33 | +``` |
| 34 | + |
| 35 | +The `FileManager` routes operations to explicitly selected backends with no implicit fallback. Backends auto-register via metaclass, support lazy imports for optional dependencies, and provide atomic file operations for concurrent metadata updates. |
| 36 | + |
| 37 | +# Statement of Need |
| 38 | + |
| 39 | +Scientific pipelines move data between arrays, files, chunked formats, and visualization tools. Each destination has different I/O conventions: |
| 40 | + |
| 41 | +```python |
| 42 | +# Without PolyStore: per-backend code everywhere |
| 43 | +np.save("result.npy", data) # Disk |
| 44 | +memory_store["result.npy"] = data # Memory |
| 45 | +zarr.save("result.zarr", data) # Zarr |
| 46 | +socket.send(msgpack.packb({"data": data})) # Streaming |
| 47 | +``` |
| 48 | + |
| 49 | +With PolyStore, one call handles all backends. The explicit `backend=` parameter ensures deterministic behavior—no silent fallbacks, no hidden resolution logic. |
| 50 | + |
| 51 | +# State of the Field |
| 52 | + |
| 53 | +| Feature | PolyStore | fsspec | zarr | xarray | |
| 54 | +|---------|:---------:|:------:|:----:|:------:| |
| 55 | +| Unified storage API | ✓ | ✓ | — | — | |
| 56 | +| Streaming backends | ✓ | — | — | — | |
| 57 | +| Multi-framework I/O | ✓ | — | — | ✓ | |
| 58 | +| Atomic concurrent writes | ✓ | — | — | — | |
| 59 | +| Explicit backend selection | ✓ | — | — | — | |
| 60 | +| Zero implicit fallback | ✓ | — | — | — | |
| 61 | + |
| 62 | +**fsspec** [@fsspec] provides unified filesystem access but lacks streaming and array framework handling. **zarr** [@zarr] handles chunked arrays but is a single format, not a storage abstraction. **xarray** [@xarray] provides multi-dimensional arrays with NetCDF/Zarr backends but no streaming or explicit backend routing. |
| 63 | + |
| 64 | +# Software Design |
| 65 | + |
| 66 | +**Backend Hierarchy**: `DataSource` (read-only), `DataSink` (write-only), `StorageBackend` (read/write). Backends auto-register via `metaclass-registry` [@metaclassregistry] and are lazily instantiated. |
| 67 | + |
| 68 | +**FileManager**: Thin router enforcing explicit backend selection. No magic resolution—if you don't specify a backend, you get an error. |
| 69 | + |
| 70 | +**Streaming Backends**: ZeroMQ transport with shared memory for zero-copy image transfer. ROI data model provides backend-neutral shapes/points with converters for Napari and Fiji. |
| 71 | + |
| 72 | +**Atomic Operations**: Cross-platform file locking (`fcntl` on Unix, `portalocker` on Windows) with `atomic_update_json()` for concurrent metadata writes from multiple pipeline workers. |
| 73 | + |
| 74 | +```python |
| 75 | +# Multiple workers safely update shared metadata |
| 76 | +from polystore import AtomicMetadataWriter |
| 77 | + |
| 78 | +writer = AtomicMetadataWriter() |
| 79 | +writer.merge_subdirectory_metadata(metadata_path, { |
| 80 | + "TimePoint_1": {"available_backends": {"zarr": True}} |
| 81 | +}) |
| 82 | +``` |
| 83 | + |
| 84 | +# Research Application |
| 85 | + |
| 86 | +PolyStore was developed for OpenHCS (Open High-Content Screening) where microscopy pipelines: |
| 87 | + |
| 88 | +- Load images from disk or virtual workspace |
| 89 | +- Process in memory (avoiding I/O between steps) |
| 90 | +- Write results to Zarr (chunked, compressed) |
| 91 | +- Stream intermediate results to Napari for live preview |
| 92 | + |
| 93 | +All through one interface: |
| 94 | + |
| 95 | +```python |
| 96 | +# Load → process → save → stream: same API |
| 97 | +images = fm.load_batch(paths, backend="disk") |
| 98 | +processed = pipeline(images) |
| 99 | +fm.save_batch(processed, paths, backend="zarr") |
| 100 | +fm.save_batch(processed, paths, backend="napari_stream") |
| 101 | +``` |
| 102 | + |
| 103 | +The explicit backend model eliminated an entire class of bugs where code assumed disk storage but ran against memory or streaming backends. |
| 104 | + |
| 105 | +# AI Usage Disclosure |
| 106 | + |
| 107 | +Generative AI (Claude) assisted with code generation and documentation. All content was reviewed and tested. |
| 108 | + |
| 109 | +# Acknowledgements |
| 110 | + |
| 111 | +This work was supported in part by the Fournier lab at the Montreal Neurological Institute, McGill University. |
| 112 | + |
| 113 | +# References |
0 commit comments