From 12f6879fd679925ad61e8e05b998fb1d6063e246 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 11:06:36 -0600 Subject: [PATCH 1/9] docs: fix external dtype notation in codec comparison table Changed to in the External dtype row to match the correct store-only notation used throughout the documentation. --- src/reference/specs/type-system.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/reference/specs/type-system.md b/src/reference/specs/type-system.md index 9fe16489..80c6bd05 100644 --- a/src/reference/specs/type-system.md +++ b/src/reference/specs/type-system.md @@ -636,7 +636,7 @@ def garbage_collect(store_name): |---------|----------|------------|-------------|--------------|---------------| | Storage modes | Both | Both | External only | External only | External only | | Internal dtype | `bytes` | `bytes` | N/A | N/A | N/A | -| External dtype | `` | `` | `json` | `json` | `json` | +| External dtype | `` | `` | `json` | `json` | `json` | | Addressing | Hash | Hash | Primary key | Hash | Relative path | | Deduplication | Yes (external) | Yes (external) | No | Yes | No | | Structure | Single blob | Single file | Files, folders | Single blob | Any | From 3aa243dfc38fe33491b6ceeb7cb870aef5bc1158 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 11:49:19 -0600 Subject: [PATCH 2/9] docs: Add plugin codecs guide with dj-zarr-codecs example Add comprehensive how-to guide for using plugin codecs - codec packages that extend DataJoint via entry point discovery. Uses dj-zarr-codecs as the primary example. Key sections: - Installation and automatic registration via entry points - Complete Zarr codec usage example with storage structure - Finding DataJoint-maintained and community codecs - Comparison with built-in codecs (, ) - Best practices for dependency management - Troubleshooting common issues Terminology: Uses 'plugin codecs' instead of 'external/third-party' to accurately describe the architectural pattern (separate packages with entry point discovery) without implying ownership. --- mkdocs.yaml | 1 + src/how-to/index.md | 1 + src/how-to/use-plugin-codecs.md | 330 ++++++++++++++++++++++++++++++++ 3 files changed, 332 insertions(+) create mode 100644 src/how-to/use-plugin-codecs.md diff --git a/mkdocs.yaml b/mkdocs.yaml index 0a0812a9..0a8d29ab 100644 --- a/mkdocs.yaml +++ b/mkdocs.yaml @@ -74,6 +74,7 @@ nav: - Object Storage: - Use Object Storage: how-to/use-object-storage.md - Use NPY Codec: how-to/use-npy-codec.md + - Use Plugin Codecs: how-to/use-plugin-codecs.md - Create Custom Codecs: how-to/create-custom-codec.md - Manage Large Data: how-to/manage-large-data.md - Clean Up Storage: how-to/garbage-collection.md diff --git a/src/how-to/index.md b/src/how-to/index.md index 95c99f28..d54246ab 100644 --- a/src/how-to/index.md +++ b/src/how-to/index.md @@ -42,6 +42,7 @@ they assume you understand the basics and focus on getting things done. - [Object Storage Overview](object-storage-overview.md) — Navigation guide for all storage docs - [Choose a Storage Type](choose-storage-type.md) — Decision guide for codecs - [Use Object Storage](use-object-storage.md) — When and how +- [Use Plugin Codecs](use-plugin-codecs.md) — Install codec packages via entry points - [Create Custom Codecs](create-custom-codec.md) — Domain-specific types - [Manage Large Data](manage-large-data.md) — Blobs, streaming, efficiency - [Clean Up External Storage](garbage-collection.md) — Garbage collection diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md new file mode 100644 index 00000000..6820915c --- /dev/null +++ b/src/how-to/use-plugin-codecs.md @@ -0,0 +1,330 @@ +# Use Plugin Codecs + +Install and use plugin codec packages to extend DataJoint's type system. + +## Overview + +Plugin codecs are distributed as separate Python packages that extend DataJoint's type system. They add support for domain-specific data types without modifying DataJoint itself. Once installed, they register automatically via Python's entry point system and work seamlessly with DataJoint. + +**Benefits:** +- Automatic registration via entry points - no code changes needed +- Domain-specific types maintained independently +- Clean separation of core framework from specialized formats +- Easy to share across projects and teams + +## Quick Start + +### 1. Install the Codec Package + +```bash +pip install dj-zarr-codecs +``` + +### 2. Use in Table Definitions + +```python +import datajoint as dj + +schema = dj.Schema('my_schema') + +@schema +class Recording(dj.Manual): + definition = """ + recording_id : int + --- + waveform : # Automatically available after install + """ +``` + +That's it! No imports or registration needed. The codec is automatically discovered via Python's entry point system. + +## Example: Zarr Array Storage + +The `dj-zarr-codecs` package adds support for storing NumPy arrays in Zarr format with schema-addressed paths. + +### Installation + +```bash +pip install dj-zarr-codecs +``` + +### Configuration + +Configure object storage for external data: + +```python +import datajoint as dj + +dj.config['stores'] = { + 'mystore': { + 'protocol': 's3', + 'endpoint': 's3.amazonaws.com', + 'bucket': 'my-bucket', + 'location': 'data', + } +} +``` + +### Basic Usage + +```python +import numpy as np + +schema = dj.Schema('neuroscience') + +@schema +class Recording(dj.Manual): + definition = """ + recording_id : int + --- + waveform : # Store as Zarr array + """ + +# Insert NumPy array +Recording.insert1({ + 'recording_id': 1, + 'waveform': np.random.randn(1000, 32), +}) + +# Fetch returns zarr.Array (read-only) +zarr_array = (Recording & {'recording_id': 1}).fetch1('waveform') + +# Use with NumPy +mean_waveform = np.mean(zarr_array, axis=0) + +# Access Zarr features +print(zarr_array.shape) # (1000, 32) +print(zarr_array.chunks) # Zarr chunking info +print(zarr_array.dtype) # float64 +``` + +### Storage Structure + +Zarr arrays are stored with schema-addressed paths that mirror your database structure: + +``` +s3://my-bucket/data/ +└── neuroscience/ # Schema name + └── recording/ # Table name + └── recording_id=1/ # Primary key + └── waveform.zarr/ # Field name + .zarr extension + ├── .zarray + └── 0.0 +``` + +This organization makes external storage browsable and self-documenting. + +### When to Use `` + +**Use `` when:** +- Arrays are large (> 10 MB) +- You need chunked access patterns +- Compression is beneficial +- Cross-language compatibility matters (any Zarr library can read) +- You want browsable, organized storage paths + +**Use `` instead when:** +- You need lazy loading with metadata inspection before download +- Memory mapping is important +- Storage format simplicity is preferred + +**Use `` instead when:** +- Arrays are small (< 10 MB) +- Deduplication of repeated values is important +- Storing mixed Python objects (not just arrays) + +## Finding Plugin Codecs + +### DataJoint-Maintained Codecs + +- **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — Zarr array storage +- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Anscombe variance stabilization for imaging + +### Community Codecs + +Check PyPI for packages with the `datajoint` keyword: + +```bash +pip search datajoint codec +``` + +Or browse GitHub: https://github.com/topics/datajoint + +### Domain-Specific Examples + +**Neuroscience:** +- Spike train formats (NEO, NWB) +- Neural network models +- Connectivity matrices + +**Imaging:** +- OME-TIFF, OME-ZARR +- DICOM medical images +- Point cloud data + +**Genomics:** +- BAM/SAM alignments +- VCF variant calls +- Phylogenetic trees + +## Verifying Installation + +Check that a codec is registered: + +```python +import datajoint as dj + +# List all available codecs +print(dj.list_codecs()) +# ['blob', 'attach', 'hash', 'object', 'npy', 'filepath', 'zarr', ...] + +# Check specific codec +assert 'zarr' in dj.list_codecs() +``` + +## How Auto-Registration Works + +Plugin codecs use Python's entry point system for automatic discovery. When you install a codec package, it registers itself via `pyproject.toml`: + +```toml +[project.entry-points."datajoint.codecs"] +zarr = "dj_zarr_codecs:ZarrCodec" +``` + +DataJoint discovers these entry points at import time, so the codec is immediately available after `pip install`. + +**No manual registration needed** — unlike DataJoint 0.x which required `dj.register_codec()`. + +## Troubleshooting + +### "Unknown codec: \" + +The codec package is not installed or not found. Verify installation: + +```bash +pip list | grep dj-zarr-codecs +``` + +If installed but not working: + +```python +# Force entry point reload +import importlib.metadata +importlib.metadata.entry_points().select(group='datajoint.codecs') +``` + +### Codec Not Found After Installation + +Restart your Python session or kernel. Entry points are discovered at import time: + +```python +# Restart kernel, then: +import datajoint as dj +print('zarr' in dj.list_codecs()) # Should be True +``` + +### Version Conflicts + +Check compatibility with your DataJoint version: + +```bash +pip show dj-zarr-codecs +# Requires: datajoint>=2.0.0a22 +``` + +Upgrade DataJoint if needed: + +```bash +pip install --upgrade datajoint +``` + +## Creating Your Own Codecs + +If you need a codec that doesn't exist yet, see: + +- [Create Custom Codecs](create-custom-codec.md) — Step-by-step guide +- [Codec API Specification](../reference/specs/codec-api.md) — Technical reference +- [Custom Codecs Explanation](../explanation/custom-codecs.md) — Design concepts + +Consider publishing your codec as a package so others can benefit! + +## Best Practices + +### 1. Install Codecs with Your Project + +Add plugin codecs to your project dependencies: + +**requirements.txt:** +``` +datajoint>=2.0.0a22 +dj-zarr-codecs>=0.1.0 +``` + +**pyproject.toml:** +```toml +dependencies = [ + "datajoint>=2.0.0a22", + "dj-zarr-codecs>=0.1.0", +] +``` + +### 2. Document Codec Requirements + +In your pipeline documentation, specify required codecs: + +```python +""" +My Pipeline +=========== + +Requirements: +- datajoint>=2.0.0a22 +- dj-zarr-codecs>=0.1.0 # For waveform storage + +Install: + pip install datajoint dj-zarr-codecs +""" +``` + +### 3. Pin Versions for Reproducibility + +Use exact versions in production: + +``` +dj-zarr-codecs==0.1.0 # Exact version +``` + +Use minimum versions in libraries: + +``` +dj-zarr-codecs>=0.1.0 # Minimum version +``` + +### 4. Test Codec Availability + +Add checks in your pipeline setup: + +```python +import datajoint as dj + +REQUIRED_CODECS = ['zarr'] + +def check_requirements(): + available = dj.list_codecs() + missing = [c for c in REQUIRED_CODECS if c not in available] + + if missing: + raise ImportError( + f"Missing required codecs: {missing}\n" + f"Install with: pip install dj-zarr-codecs" + ) + +check_requirements() +``` + +## See Also + +- [Use Object Storage](use-object-storage.md) — Object storage configuration +- [Create Custom Codecs](create-custom-codec.md) — Build your own codecs +- [Type System](../reference/specs/type-system.md) — Complete type reference +- [dj-zarr-codecs Repository](https://github.com/datajoint/dj-zarr-codecs) — Example implementation From 81c9f99c8c21b493d20a206ae8eb89868f8044ca Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:09:30 -0600 Subject: [PATCH 3/9] docs: Add dj-photon-codecs to plugin codecs guide Update plugin codecs documentation to include dj-photon-codecs: - Add to DataJoint-maintained codecs list - Include in imaging domain examples - Reference in See Also section dj-photon-codecs provides Anscombe transformation + Zarr compression for photon-limited imaging data (calcium imaging, low-light microscopy). --- src/how-to/use-plugin-codecs.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md index 6820915c..aa3d626e 100644 --- a/src/how-to/use-plugin-codecs.md +++ b/src/how-to/use-plugin-codecs.md @@ -137,8 +137,9 @@ This organization makes external storage browsable and self-documenting. ### DataJoint-Maintained Codecs -- **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — Zarr array storage -- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Anscombe variance stabilization for imaging +- **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — Zarr array storage for general numpy arrays +- **[dj-photon-codecs](https://github.com/datajoint/dj-photon-codecs)** — Photon-limited movies with Anscombe transformation and compression +- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Anscombe variance stabilization (Zarr/Numcodecs integration) ### Community Codecs @@ -158,6 +159,7 @@ Or browse GitHub: https://github.com/topics/datajoint - Connectivity matrices **Imaging:** +- Photon-limited movies (calcium imaging, low-light microscopy) - OME-TIFF, OME-ZARR - DICOM medical images - Point cloud data @@ -327,4 +329,5 @@ check_requirements() - [Use Object Storage](use-object-storage.md) — Object storage configuration - [Create Custom Codecs](create-custom-codec.md) — Build your own codecs - [Type System](../reference/specs/type-system.md) — Complete type reference -- [dj-zarr-codecs Repository](https://github.com/datajoint/dj-zarr-codecs) — Example implementation +- [dj-zarr-codecs Repository](https://github.com/datajoint/dj-zarr-codecs) — General Zarr array storage +- [dj-photon-codecs Repository](https://github.com/datajoint/dj-photon-codecs) — Photon-limited movies with compression From c0802b78a415f2220a5e6395cb3949946a722e34 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:10:02 -0600 Subject: [PATCH 4/9] docs: Reference plugin codecs in custom codecs explanation Add 'Before Creating Your Own' section to custom-codecs.md that directs readers to check existing plugin codecs (dj-zarr-codecs, dj-photon-codecs, anscombe-transform) before implementing their own. Encourages reuse and ensures users are aware of existing solutions. --- src/explanation/custom-codecs.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/src/explanation/custom-codecs.md b/src/explanation/custom-codecs.md index 44acf9f8..296b890b 100644 --- a/src/explanation/custom-codecs.md +++ b/src/explanation/custom-codecs.md @@ -334,3 +334,13 @@ Custom codecs enable: The codec system makes DataJoint extensible to any scientific domain without modifying the core framework. + +## Before Creating Your Own + +Check for existing plugin codecs that may already solve your needs: + +- **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — General numpy arrays with Zarr storage +- **[dj-photon-codecs](https://github.com/datajoint/dj-photon-codecs)** — Photon-limited movies with Anscombe transformation +- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Variance stabilization for imaging + +See the [Use Plugin Codecs](../how-to/use-plugin-codecs.md) guide for installation and usage of existing codec packages. Creating a custom codec is straightforward, but reusing existing ones saves time and ensures compatibility. From cfa05d166c3412f842fe813c578d2a6c0a312aaa Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:16:57 -0600 Subject: [PATCH 5/9] docs: Remove anscombe-transform from DataJoint plugin codecs list anscombe-transform is a Zarr/Numcodecs codec (not a DataJoint codec). It doesn't have a datajoint.codecs entry point - it's a dependency used by dj-photon-codecs, not a standalone DataJoint plugin codec. Removed from: - DataJoint-maintained codecs list in use-plugin-codecs.md - Before Creating Your Own section in custom-codecs.md --- src/explanation/custom-codecs.md | 3 +-- src/how-to/use-plugin-codecs.md | 1 - 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/src/explanation/custom-codecs.md b/src/explanation/custom-codecs.md index 296b890b..5fefbc28 100644 --- a/src/explanation/custom-codecs.md +++ b/src/explanation/custom-codecs.md @@ -340,7 +340,6 @@ modifying the core framework. Check for existing plugin codecs that may already solve your needs: - **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — General numpy arrays with Zarr storage -- **[dj-photon-codecs](https://github.com/datajoint/dj-photon-codecs)** — Photon-limited movies with Anscombe transformation -- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Variance stabilization for imaging +- **[dj-photon-codecs](https://github.com/datajoint/dj-photon-codecs)** — Photon-limited movies with Anscombe transformation and compression See the [Use Plugin Codecs](../how-to/use-plugin-codecs.md) guide for installation and usage of existing codec packages. Creating a custom codec is straightforward, but reusing existing ones saves time and ensures compatibility. diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md index aa3d626e..7bcaa77b 100644 --- a/src/how-to/use-plugin-codecs.md +++ b/src/how-to/use-plugin-codecs.md @@ -139,7 +139,6 @@ This organization makes external storage browsable and self-documenting. - **[dj-zarr-codecs](https://github.com/datajoint/dj-zarr-codecs)** — Zarr array storage for general numpy arrays - **[dj-photon-codecs](https://github.com/datajoint/dj-photon-codecs)** — Photon-limited movies with Anscombe transformation and compression -- **[anscombe-transform](https://github.com/datajoint/anscombe-transform)** — Anscombe variance stabilization (Zarr/Numcodecs integration) ### Community Codecs From 0652c85005ad6d3dfb3b65ca2706bd95dcc3e2bc Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:28:49 -0600 Subject: [PATCH 6/9] docs: Add comprehensive versioning and backward compatibility guide Add detailed guidance on versioning plugin codecs for backward compatibility: - Version strategy: package version vs data format version - When to bump versions (breaking vs non-breaking changes) - Implementation patterns for version dispatch - Migration strategies (lazy, explicit, deprecation warnings) - Real-world example with dj-photon-codecs evolution - Testing version compatibility - Semantic versioning guidelines for codec packages Critical for maintaining data accessibility as codecs evolve. --- src/how-to/use-plugin-codecs.md | 258 ++++++++++++++++++++++++++++++++ 1 file changed, 258 insertions(+) diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md index 7bcaa77b..4c5bb291 100644 --- a/src/how-to/use-plugin-codecs.md +++ b/src/how-to/use-plugin-codecs.md @@ -239,6 +239,264 @@ Upgrade DataJoint if needed: pip install --upgrade datajoint ``` +## Versioning and Backward Compatibility + +Plugin codecs evolve over time. Following versioning best practices ensures your data remains accessible across codec updates. + +### Version Strategy + +**Two version numbers matter:** + +1. **Package version** (semantic versioning: `0.1.0`, `1.0.0`, `2.0.0`) + - For codec package releases + - Follows standard semantic versioning + +2. **Data format version** (stored with each encoded value) + - Tracks storage format changes + - Enables decode() to handle multiple formats + +### Implementing Versioning + +**Include version in encoded metadata:** + +```python +def encode(self, value, *, key=None, store_name=None): + # ... encoding logic ... + + return { + "path": path, + "store": store_name, + "codec_version": "1.0", # Data format version + "shape": list(value.shape), + "dtype": str(value.dtype), + } +``` + +**Handle multiple versions in decode:** + +```python +def decode(self, stored, *, key=None): + version = stored.get("codec_version", "1.0") # Default for old data + + if version == "2.0": + return self._decode_v2(stored) + elif version == "1.0": + return self._decode_v1(stored) + else: + raise DataJointError(f"Unsupported codec version: {version}") +``` + +### When to Bump Versions + +**Bump data format version when:** +- ✅ Changing storage structure or encoding algorithm +- ✅ Modifying metadata schema +- ✅ Changing compression parameters that affect decode + +**Don't bump for:** +- ❌ Bug fixes that don't affect stored data format +- ❌ Performance improvements to encode/decode logic +- ❌ Adding new optional features (store version in attributes instead) + +### Backward Compatibility Patterns + +**Pattern 1: Version dispatch in decode()** + +```python +class MyCodec(SchemaCodec): + name = "mycodec" + CURRENT_VERSION = "2.0" + + def encode(self, value, *, key=None, store_name=None): + # Always encode with current version + metadata = { + "codec_version": self.CURRENT_VERSION, + # ... other metadata ... + } + return metadata + + def decode(self, stored, *, key=None): + version = stored.get("codec_version", "1.0") + + if version == "2.0": + # Current version - optimized path + return self._decode_current(stored) + elif version == "1.0": + # Legacy version - compatibility path + return self._decode_legacy_v1(stored) + else: + raise DataJointError( + f"Cannot decode {self.name} version {version}. " + f"Upgrade codec package or migrate data." + ) +``` + +**Pattern 2: Zarr attributes for feature versions** + +For codecs using Zarr (like dj-zarr-codecs, dj-photon-codecs): + +```python +def encode(self, value, *, key=None, store_name=None): + # ... write to Zarr ... + + z = zarr.open(store_map, mode="r+") + z.attrs["codec_version"] = "2.0" + z.attrs["codec_name"] = self.name + z.attrs["feature_flags"] = ["compression", "chunking"] + + return { + "path": path, + "store": store_name, + "codec_version": "2.0", # Also in DB for quick access + } + +def decode(self, stored, *, key=None): + z = zarr.open(store_map, mode="r") + version = z.attrs.get("codec_version", "1.0") + + # Handle version-specific decoding + if version == "2.0": + return z # Return Zarr array directly + else: + return self._migrate_v1_to_v2(z) +``` + +### Migration Strategies + +**Strategy 1: Lazy migration (recommended)** + +Old data is migrated when accessed: + +```python +def decode(self, stored, *, key=None): + version = stored.get("codec_version", "1.0") + + if version == "1.0": + # Decode old format + data = self._decode_v1(stored) + + # Optionally: re-encode to new format in background + # (requires database write access) + return data + + return self._decode_current(stored) +``` + +**Strategy 2: Explicit migration script** + +For breaking changes, provide migration tools: + +```python +# migration_tool.py +def migrate_table_to_v2(table, field_name): + """Migrate all rows to codec version 2.0.""" + for key in table.fetch("KEY"): + # Fetch with old codec + data = (table & key).fetch1(field_name) + + # Re-insert with new codec (triggers encode) + table.update1({**key, field_name: data}) +``` + +**Strategy 3: Deprecation warnings** + +```python +def decode(self, stored, *, key=None): + version = stored.get("codec_version", "1.0") + + if version == "1.0": + import warnings + warnings.warn( + f"Reading {self.name} v1.0 data. Support will be removed in v3.0. " + f"Please migrate: pip install {self.name}-migrate && migrate-data", + DeprecationWarning + ) + return self._decode_v1(stored) +``` + +### Real-World Example: dj-photon-codecs Evolution + +**Version 1.0** (current): +- Stores Anscombe-transformed data +- Fixed compression (Blosc zstd level 5) +- Fixed chunking (100 frames) + +**Hypothetical Version 2.0** (backward compatible): +```python +def encode(self, value, *, key=None, store_name=None): + # New: configurable compression + compression_level = getattr(self, 'compression_level', 5) + + zarr.save_array( + store_map, + transformed, + compressor=zarr.Blosc(cname="zstd", clevel=compression_level), + ) + + z = zarr.open(store_map, mode="r+") + z.attrs["codec_version"] = "2.0" + z.attrs["compression_level"] = compression_level + + return { + "path": path, + "codec_version": "2.0", # <-- NEW + # ... rest same ... + } + +def decode(self, stored, *, key=None): + z = zarr.open(store_map, mode="r") + version = z.attrs.get("codec_version", "1.0") + + # Both versions return zarr.Array - fully compatible! + if version in ("1.0", "2.0"): + return z + else: + raise DataJointError(f"Unsupported version: {version}") +``` + +### Testing Version Compatibility + +Include tests for version compatibility: + +```python +def test_decode_v1_data(): + """Ensure new codec can read old data.""" + # Load fixture with v1.0 data + old_data = load_v1_fixture() + + # Decode with current codec + codec = PhotonCodec() + result = codec.decode(old_data) + + assert result.shape == (1000, 512, 512) + assert result.dtype == np.float64 +``` + +### Package Version Guidelines + +Follow semantic versioning for codec packages: + +- **Patch (0.1.0 → 0.1.1)**: Bug fixes, no data format changes +- **Minor (0.1.0 → 0.2.0)**: New features, backward compatible +- **Major (0.1.0 → 1.0.0)**: Breaking changes (may require migration) + +**Example changelog:** + +``` +v2.0.0 (2026-02-01) - BREAKING + - Changed default compression from zstd-5 to zstd-3 + - Data format v2.0 (can still read v1.0) + - Migration guide: docs/migration-v2.md + +v1.1.0 (2026-01-15) + - Added configurable chunk sizes (backward compatible) + - Data format still v1.0 + +v1.0.1 (2026-01-10) + - Fixed edge case in Anscombe inverse transform + - Data format unchanged (v1.0) +``` + ## Creating Your Own Codecs If you need a codec that doesn't exist yet, see: From 656bceda9f4a15f4b4b6f7eea7a75f338ccfb73c Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:37:01 -0600 Subject: [PATCH 7/9] docs: Clarify built-in vs plugin codec versioning Add section explaining why built-in codecs don't need explicit versioning: - Built-in codecs versioned with DataJoint releases - Plugin codecs have independent lifecycles and need codec_version - DataJoint's semantic versioning handles built-in codec evolution - Plugin versioning protects against independent evolution Key distinction: Built-in codecs are part of DataJoint's API surface (versioned by framework), while plugin codecs are independent packages (need self-versioning). --- src/how-to/use-plugin-codecs.md | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md index 4c5bb291..3bd5ce54 100644 --- a/src/how-to/use-plugin-codecs.md +++ b/src/how-to/use-plugin-codecs.md @@ -243,9 +243,28 @@ pip install --upgrade datajoint Plugin codecs evolve over time. Following versioning best practices ensures your data remains accessible across codec updates. +### Built-in vs Plugin Codec Versioning + +**Built-in codecs** (``, ``, ``, etc.) are versioned with DataJoint: +- ✅ Shipped with datajoint-python +- ✅ Versioned by DataJoint release (2.0.0, 2.1.0, 3.0.0) +- ✅ Upgraded when you upgrade DataJoint +- ✅ Stability guaranteed by DataJoint's semantic versioning +- ❌ **No explicit codec_version field needed** - DataJoint version is the codec version + +**Plugin codecs** (dj-zarr-codecs, dj-photon-codecs, etc.) have independent lifecycles: +- ✅ Installed separately from DataJoint +- ✅ Independent version numbers (0.1.0 → 1.0.0 → 2.0.0) +- ✅ Users choose when to upgrade +- ✅ **Must include explicit codec_version field** for backward compatibility + +**Why the difference?** + +Plugin codecs evolve independently and need to handle data encoded by different plugin versions. Built-in codecs are part of DataJoint's API surface and evolve with the framework itself. When you upgrade DataJoint 2.0 → 3.0, you expect potential breaking changes. When you upgrade a plugin 1.0 → 2.0 while keeping DataJoint 2.0, backward compatibility is critical. + ### Version Strategy -**Two version numbers matter:** +**Two version numbers matter for plugin codecs:** 1. **Package version** (semantic versioning: `0.1.0`, `1.0.0`, `2.0.0`) - For codec package releases From 78047a97c4351185e405e01df20a358e0292a47d Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:42:38 -0600 Subject: [PATCH 8/9] docs: Add detailed serialization format documentation Add comprehensive documentation of DataJoint's custom blob serialization: Explanation docs (type-system.md): - Protocol headers (mYm for MATLAB compat, dj0 for Python-extended) - Optional zlib compression for data > 1KB - Type-specific encoding with serialization codes - Version detection via embedded protocol headers - Supported types list - Storage modes ( vs ) Reference docs (type-system.md): - Detailed type code mapping for all supported Python types - Protocol header format (mYm\0, dj0\0) - Version detection mechanism - MD5 deduplication for Clarifies that does NOT use pickle - it uses DataJoint's custom binary format with intrinsic versioning via protocol headers. --- src/explanation/type-system.md | 19 ++++++++++++++++- src/reference/specs/type-system.md | 34 +++++++++++++++++++++++++++--- 2 files changed, 49 insertions(+), 4 deletions(-) diff --git a/src/explanation/type-system.md b/src/explanation/type-system.md index 4e2c4bb2..74b76d76 100644 --- a/src/explanation/type-system.md +++ b/src/explanation/type-system.md @@ -111,7 +111,20 @@ Codecs provide `encode()`/`decode()` semantics for complex Python objects. ### `` — Serialized Python Objects -Stores NumPy arrays, dicts, lists, and other Python objects. +Stores NumPy arrays, dicts, lists, and other Python objects using DataJoint's custom binary serialization format. + +**Serialization format:** +- **Protocol headers**: `mYm` (MATLAB-compatible) or `dj0` (Python-extended) +- **Optional compression**: zlib compression for data > 1KB +- **Type-specific encoding**: Each Python type has a specific serialization code +- **Version detection**: Protocol header embedded in blob enables format detection + +**Supported types:** +- NumPy arrays (numeric, structured, recarrays) +- Collections (dict, list, tuple, set) +- Scalars (int, float, bool, complex, str, bytes) +- Date/time objects (datetime, date, time) +- UUID, Decimal ```python class Results(dj.Computed): @@ -124,6 +137,10 @@ class Results(dj.Computed): """ ``` +**Storage modes:** +- `` — Stored in database as LONGBLOB (up to ~1GB depending on MySQL config) +- `` — Stored externally via `` with MD5 deduplication + ### `` — File Attachments Stores files with filename preserved. diff --git a/src/reference/specs/type-system.md b/src/reference/specs/type-system.md index 80c6bd05..cb3d483a 100644 --- a/src/reference/specs/type-system.md +++ b/src/reference/specs/type-system.md @@ -498,11 +498,39 @@ The `json` database type: **Supports both internal and external storage.** -Serializes Python objects (NumPy arrays, dicts, lists, etc.) using DataJoint's -blob format. Compatible with MATLAB. +Serializes Python objects using DataJoint's custom binary serialization format. The format uses protocol headers and type-specific encoding to serialize complex Python objects efficiently. + +**Serialization format:** + +- **Protocol headers**: + - `mYm` — Original MATLAB-compatible format for numeric arrays, structs, cells + - `dj0` — Extended format supporting Python-specific types (UUID, Decimal, datetime, etc.) +- **Compression**: Automatic zlib compression for data > 1KB +- **Type codes**: Each Python type has a specific serialization code: + - `'A'` — NumPy arrays (numeric) + - `'F'` — NumPy recarrays (structured arrays with fields) + - `'\x01'` — Tuples + - `'\x02'` — Lists + - `'\x03'` — Sets + - `'\x04'` — Dicts + - `'\x05'` — Strings (UTF-8) + - `'\x06'` — Bytes + - `'\x0a'` — Unbounded integers + - `'\x0b'` — Booleans + - `'\x0c'` — Complex numbers + - `'\x0d'` — Floats + - `'d'` — Decimal + - `'t'` — Datetime/date/time + - `'u'` — UUID + - `'S'` — MATLAB structs + - `'C'` — MATLAB cell arrays + +**Version detection**: The protocol header (`mYm\0` or `dj0\0`) is embedded at the start of the blob, enabling automatic format detection and backward compatibility. + +**Storage modes:** - **``**: Stored in database (`bytes` → `LONGBLOB`/`BYTEA`) -- **``**: Stored externally via `` with deduplication +- **``**: Stored externally via `` with MD5 deduplication - **``**: Stored in specific named store ```python From b341385e0aa5d49532f08ec775ad4812417f7b94 Mon Sep 17 00:00:00 2001 From: Dimitri Yatsenko Date: Fri, 16 Jan 2026 12:43:32 -0600 Subject: [PATCH 9/9] docs: Add mYm references and intrinsic versioning explanation Add references to mYm format documentation: - MATLAB FileExchange: https://www.mathworks.com/matlabcentral/fileexchange/81208-mym - GitHub repository: https://github.com/datajoint/mym Add intrinsic versioning explanation to plugin codecs guide: - How built-in codecs embed version in data format - Protocol headers in (mYm\0, dj0\0) - NumPy format version in headers - Self-describing structure in - Why built-in codecs don't need explicit codec_version field Clarifies the distinction between built-in codecs (intrinsic versioning) and plugin codecs (explicit codec_version field). --- src/explanation/type-system.md | 4 +++- src/how-to/use-plugin-codecs.md | 11 +++++++++++ src/reference/specs/type-system.md | 2 +- 3 files changed, 15 insertions(+), 2 deletions(-) diff --git a/src/explanation/type-system.md b/src/explanation/type-system.md index 74b76d76..3b84dbd5 100644 --- a/src/explanation/type-system.md +++ b/src/explanation/type-system.md @@ -114,7 +114,9 @@ Codecs provide `encode()`/`decode()` semantics for complex Python objects. Stores NumPy arrays, dicts, lists, and other Python objects using DataJoint's custom binary serialization format. **Serialization format:** -- **Protocol headers**: `mYm` (MATLAB-compatible) or `dj0` (Python-extended) +- **Protocol headers**: + - `mYm` — MATLAB-compatible format (see [mYm on MATLAB FileExchange](https://www.mathworks.com/matlabcentral/fileexchange/81208-mym) and [mym on GitHub](https://github.com/datajoint/mym)) + - `dj0` — Python-extended format supporting additional types - **Optional compression**: zlib compression for data > 1KB - **Type-specific encoding**: Each Python type has a specific serialization code - **Version detection**: Protocol header embedded in blob enables format detection diff --git a/src/how-to/use-plugin-codecs.md b/src/how-to/use-plugin-codecs.md index 3bd5ce54..d59fa45d 100644 --- a/src/how-to/use-plugin-codecs.md +++ b/src/how-to/use-plugin-codecs.md @@ -262,6 +262,17 @@ Plugin codecs evolve over time. Following versioning best practices ensures your Plugin codecs evolve independently and need to handle data encoded by different plugin versions. Built-in codecs are part of DataJoint's API surface and evolve with the framework itself. When you upgrade DataJoint 2.0 → 3.0, you expect potential breaking changes. When you upgrade a plugin 1.0 → 2.0 while keeping DataJoint 2.0, backward compatibility is critical. +**How built-in codecs handle versioning:** + +Built-in formats have **intrinsic versioning** - the format version is embedded in the data itself: + +- `` — Protocol header (`mYm\0` or `dj0\0`) at start of blob +- `` — NumPy format version in `.npy` file header +- `` — Self-describing directory structure +- `` — Filename + content (format-agnostic) + +When DataJoint needs to change a built-in codec's format, it can detect the old format from the embedded version information and handle migration transparently. This is why built-in codecs don't need an explicit `codec_version` field in database metadata. + ### Version Strategy **Two version numbers matter for plugin codecs:** diff --git a/src/reference/specs/type-system.md b/src/reference/specs/type-system.md index cb3d483a..0efa0e71 100644 --- a/src/reference/specs/type-system.md +++ b/src/reference/specs/type-system.md @@ -503,7 +503,7 @@ Serializes Python objects using DataJoint's custom binary serialization format. **Serialization format:** - **Protocol headers**: - - `mYm` — Original MATLAB-compatible format for numeric arrays, structs, cells + - `mYm` — Original MATLAB-compatible format for numeric arrays, structs, cells (see [mYm on MATLAB FileExchange](https://www.mathworks.com/matlabcentral/fileexchange/81208-mym) and [mym on GitHub](https://github.com/datajoint/mym)) - `dj0` — Extended format supporting Python-specific types (UUID, Decimal, datetime, etc.) - **Compression**: Automatic zlib compression for data > 1KB - **Type codes**: Each Python type has a specific serialization code: