Navigate DataJoint's object storage documentation to find what you need.
I want to...
| Task | Guide | Est. Time |
|---|---|---|
| ✅ Decide which storage type to use | Choose a Storage Type | 5-10 min |
| ✅ Set up S3, MinIO, or file storage | Configure Object Storage | 10-15 min |
| ✅ Store and retrieve large data | Use Object Storage | 15-20 min |
| ✅ Work with NumPy arrays efficiently | Use NPY Codec | 10 min |
| ✅ Create domain-specific types | Create Custom Codec | 30-45 min |
| ✅ Optimize storage performance | Manage Large Data | 20 min |
| ✅ Clean up unused data | Garbage Collection | 10 min |
Why does DataJoint have object storage?
Traditional databases excel at structured, relational data but struggle with large arrays, files, and streaming data. DataJoint's Object-Augmented Schema (OAS) unifies relational tables with object storage into a single coherent system:
- Relational database: Metadata, keys, relationships (structured data < 1 MB)
- Object storage: Arrays, files, datasets (large data > 1 MB)
- Full referential integrity: Maintained across both layers
Read: Object-Augmented Schemas for conceptual overview.
What: Data stored directly in database column When: Small objects < 1 MB (JSON, thumbnails, small arrays) Why: Fast access, transactional consistency, no store setup
metadata : <blob> # Stored in MySQLGuide: Use Object Storage
What: DataJoint-managed storage in S3, file systems, or cloud storage When: Large data (arrays, files, datasets) needing lifecycle management Why: Deduplication, garbage collection, transaction safety, referential integrity
Two addressing schemes:
- Content-based paths (MD5 hash)
- Automatic deduplication
- Best for: Write-once data, attachments
waveform : <blob@> # Hash: _hash/{schema}/{hash}
document : <attach@> # Hash: _hash/{schema}/{hash}- Key-based paths (browsable)
- Streaming access, partial reads
- Best for: Zarr, HDF5, large arrays
traces : <npy@> # Schema: _schema/{schema}/{table}/{key}/
volume : <object@> # Schema: _schema/{schema}/{table}/{key}/Guides:
- Choose a Storage Type — Decision criteria
- Use Object Storage — How to use codecs
What: User-managed file paths (DataJoint stores path string only) When: Existing data archives, externally managed files Why: No file lifecycle management, no deduplication, user controls paths
raw_data : <filepath@> # User-managed pathGuide: Use Object Storage
-
Choose a Storage Type — Start here
- Quick decision tree (5 minutes)
- Size guidelines (< 1 MB, 1-100 MB, > 100 MB)
- Access pattern considerations
- Lifecycle management options
-
Configure Object Storage — Setup
- File system, S3, MinIO configuration
- Single vs multiple stores
- Credentials management
- Store verification
-
Use Object Storage — Basic usage
- Insert/fetch patterns
- In-table vs object store
- Addressing schemes (hash vs schema)
- ObjectRef for lazy access
-
Use NPY Codec — NumPy arrays
- Lazy loading (doesn't load until accessed)
- Efficient slicing (fetch subsets)
- Shape/dtype metadata
- When to use
<npy@>vs<blob@>
-
Manage Large Data — Optimization
- Storage tiers (hot/warm/cold)
- Compression strategies
- Batch operations
- Performance tuning
-
Garbage Collection — Cleanup
- Automatic cleanup for integrated storage
- Manual cleanup for filepath references
- Orphan detection
- Recovery procedures
- Create Custom Codec — Extensibility
- Domain-specific types
- Codec API (encode/decode)
- HashCodec vs SchemaCodec patterns
- Integration with existing formats
For implementation details and specifications:
- Type System Spec — Three-layer architecture
- Codec API Spec — Custom codec interface
- NPY Codec Spec — NumPy array storage
- Object Store Configuration Spec — Store config details
- Type System — Conceptual overview
- Data Pipelines (OAS section) — Why OAS exists
- Custom Codecs — Design patterns
- Configure Object Storage — Set up store
- Choose a Storage Type — Select codec
- Update table definitions with
@modifier - Use Object Storage — Insert/fetch patterns
Estimate: 30-60 minutes
- Choose a Storage Type — Determine new codec
- Add new column with object storage codec
- Migrate data (see Use Object Storage)
- Verify data integrity
- Drop old column (see Alter Tables)
Estimate: 1-2 hours for small datasets
- Use
<object@>or<npy@>(not<blob@>) - Configure Object Storage — Ensure adequate storage
- For Zarr: Store as
<object@>with.zarrextension - For streaming: Use
ObjectRef.fsmap(see Use Object Storage)
Key advantage: No need to download full dataset into memory
- Read Custom Codecs — Understand patterns
- Create Custom Codec — Implementation guide
- Codec API Spec — Technical reference
- Test with small dataset
- Deploy to production
Estimate: 2-4 hours for simple codecs
Is data < 1 MB per row?
├─ YES → <blob> (in-table)
└─ NO → Continue...
Is data managed externally?
├─ YES → <filepath@> (user-managed reference)
└─ NO → Continue...
Need streaming or partial reads?
├─ YES → <object@> or <npy@> (schema-addressed)
└─ NO → <blob@> (hash-addressed, full download)
Full guide: Choose a Storage Type
NumPy arrays that benefit from lazy loading?
├─ YES → <npy@>
└─ NO → Continue...
Large files (> 100 MB) needing streaming?
├─ YES → <object@>
└─ NO → Continue...
Write-once data with potential duplicates?
├─ YES → <blob@> (deduplication via hashing)
└─ NO → <blob@> or <object@> (choose based on access pattern)
Full guide: Choose a Storage Type
| Problem | Likely Cause | Solution Guide |
|---|---|---|
| "Store not configured" | Missing stores config | Configure Object Storage |
| Out of memory loading array | Using <blob@> for huge data |
Choose a Storage Type → Use <object@> |
| Slow fetches | Wrong codec choice | Manage Large Data |
| Data not deduplicated | Using wrong codec | Choose a Storage Type |
| Path conflicts with reserved | <filepath@> using _hash/ or _schema/ |
Use Object Storage |
| Missing files after delete | Expected behavior for integrated storage | Garbage Collection |
- Check FAQ for common questions
- Search GitHub Discussions
- Review specification for exact behavior
- Type System — Three-layer type architecture
- Data Pipelines — Object-Augmented Schemas
- Manage Secrets — Credentials for S3/cloud storage
- Define Tables — Table definition syntax
- Insert Data — Data insertion patterns
- Object Storage Tutorial — Hands-on learning
- Custom Codecs Tutorial — Build your own codec