You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: clarify DataSink abstraction vs fsspec in JOSS paper
- Explain why fsspec cannot support streaming (filesystem vs data sink abstraction)
- Show DataSink/StorageBackend hierarchy with code example
- Highlight streaming complexity hidden behind save_batch() interface
- Add fsspec, xarray, metaclass-registry references to bibliography
Copy file name to clipboardExpand all lines: paper.md
+17-13Lines changed: 17 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,27 +59,31 @@ With PolyStore, one call handles all backends. The explicit `backend=` parameter
59
59
| Explicit backend selection | ✓ | — | — | — |
60
60
| Zero implicit fallback | ✓ | — | — | — |
61
61
62
-
**fsspec**[@fsspec] provides unified filesystem access but lacks streaming and array framework handling. **zarr**[@zarr] handles chunked arrays but is a single format, not a storage abstraction. **xarray**[@xarray] provides multi-dimensional arrays with NetCDF/Zarr backends but no streaming or explicit backend routing.
62
+
**fsspec**[@fsspec] provides unified filesystem access but cannot support streaming because its abstraction is *filesystems*—everything must behave like a file. PolyStore's abstraction is *data sinks*, which includes destinations that consume data without persisting it. This distinction is fundamental: a Napari viewer is not a filesystem, but it is a valid data sink.
63
+
64
+
**zarr**[@zarr] handles chunked arrays but is a single format, not a storage abstraction. **xarray**[@xarray] provides multi-dimensional arrays with NetCDF/Zarr backends but no streaming or explicit backend routing.
63
65
64
66
# Software Design
65
67
66
-
**Backend Hierarchy**: `DataSource` (read-only), `DataSink` (write-only), `StorageBackend` (read/write). Backends auto-register via `metaclass-registry`[@metaclassregistry] and are lazily instantiated.
68
+
**Backend Hierarchy**: The key architectural decision is the base abstraction. `DataSink` (write-only) is the root interface—not `StorageBackend`. This allows streaming backends that consume data without supporting reads:
67
69
68
-
**FileManager**: Thin router enforcing explicit backend selection. No magic resolution—if you don't specify a backend, you get an error.
**Streaming Backends**: ZeroMQ transport with shared memory for zero-copy image transfer. ROI data model provides backend-neutral shapes/points with converters for Napari and Fiji.
**Atomic Operations**: Cross-platform file locking (`fcntl` on Unix, `portalocker` on Windows) with `atomic_update_json()` for concurrent metadata writes from multiple pipeline workers.
80
+
The `FileManager` routes to any `DataSink`. Pipeline code doesn't know whether data goes to disk, memory, or a live Napari viewer—and doesn't need to.
73
81
74
-
```python
75
-
# Multiple workers safely update shared metadata
76
-
from polystore import AtomicMetadataWriter
82
+
**Streaming Internals Hidden**: Streaming backends handle substantial complexity internally—GPU tensor conversion, shared memory allocation, ZMQ socket management, ROI serialization—all behind the same `save_batch()` interface. The orchestrator remains backend-agnostic.
**Atomic Operations**: Cross-platform file locking (`fcntl` on Unix, `portalocker` on Windows) with `atomic_update_json()` for concurrent metadata writes from multiple pipeline workers.
85
+
86
+
Backends auto-register via `metaclass-registry`[@metaclassregistry] and are lazily instantiated, keeping optional dependencies unloaded until used.
0 commit comments