Object Storage Overview

Navigate DataJoint's object storage documentation to find what you need.

Quick Navigation by Task

I want to...

Task	Guide	Est. Time
✅ Decide which storage type to use	Choose a Storage Type	5-10 min
✅ Set up S3, MinIO, or file storage	Configure Object Storage	10-15 min
✅ Store and retrieve large data	Use Object Storage	15-20 min
✅ Work with NumPy arrays efficiently	Use NPY Codec	10 min
✅ Create domain-specific types	Create Custom Codec	30-45 min
✅ Optimize storage performance	Manage Large Data	20 min
✅ Clean up unused data	Garbage Collection	10 min

Conceptual Understanding

Why does DataJoint have object storage?

Traditional databases excel at structured, relational data but struggle with large arrays, files, and streaming data. DataJoint's Object-Augmented Schema (OAS) unifies relational tables with object storage into a single coherent system:

Relational database: Metadata, keys, relationships (structured data < 1 MB)
Object storage: Arrays, files, datasets (large data > 1 MB)
Full referential integrity: Maintained across both layers

Read: Object-Augmented Schemas for conceptual overview.

Three Storage Modes

In-Table Storage (`<blob>`)

What: Data stored directly in database column When: Small objects < 1 MB (JSON, thumbnails, small arrays) Why: Fast access, transactional consistency, no store setup

metadata : <blob>         # Stored in MySQL

Guide: Use Object Storage

Object Store (Integrated)

What: DataJoint-managed storage in S3, file systems, or cloud storage When: Large data (arrays, files, datasets) needing lifecycle management Why: Deduplication, garbage collection, transaction safety, referential integrity

Two addressing schemes:

Hash-Addressed (`<blob@>`, `<attach@>`)

Content-based paths (MD5 hash)
Automatic deduplication
Best for: Write-once data, attachments

waveform : <blob@>        # Hash: _hash/{schema}/{hash}
document : <attach@>      # Hash: _hash/{schema}/{hash}

Schema-Addressed (`<npy@>`, `<object@>`)

Key-based paths (browsable)
Streaming access, partial reads
Best for: Zarr, HDF5, large arrays

traces : <npy@>           # Schema: _schema/{schema}/{table}/{key}/
volume : <object@>        # Schema: _schema/{schema}/{table}/{key}/

Guides:

Choose a Storage Type — Decision criteria
Use Object Storage — How to use codecs

Filepath References (`<filepath@>`)

What: User-managed file paths (DataJoint stores path string only) When: Existing data archives, externally managed files Why: No file lifecycle management, no deduplication, user controls paths

raw_data : <filepath@>    # User-managed path

Guide: Use Object Storage

Documentation by Level

Getting Started

Choose a Storage Type — Start here
- Quick decision tree (5 minutes)
- Size guidelines (< 1 MB, 1-100 MB, > 100 MB)
- Access pattern considerations
- Lifecycle management options
Configure Object Storage — Setup
- File system, S3, MinIO configuration
- Single vs multiple stores
- Credentials management
- Store verification
Use Object Storage — Basic usage
- Insert/fetch patterns
- In-table vs object store
- Addressing schemes (hash vs schema)
- ObjectRef for lazy access

Intermediate

Use NPY Codec — NumPy arrays
- Lazy loading (doesn't load until accessed)
- Efficient slicing (fetch subsets)
- Shape/dtype metadata
- When to use <npy@> vs <blob@>
Manage Large Data — Optimization
- Storage tiers (hot/warm/cold)
- Compression strategies
- Batch operations
- Performance tuning
Garbage Collection — Cleanup
- Automatic cleanup for integrated storage
- Manual cleanup for filepath references
- Orphan detection
- Recovery procedures

Advanced

Create Custom Codec — Extensibility
- Domain-specific types
- Codec API (encode/decode)
- HashCodec vs SchemaCodec patterns
- Integration with existing formats

Technical Reference

For implementation details and specifications:

Specifications

Type System Spec — Three-layer architecture
Codec API Spec — Custom codec interface
NPY Codec Spec — NumPy array storage
Object Store Configuration Spec — Store config details

Explanations

Type System — Conceptual overview
Data Pipelines (OAS section) — Why OAS exists
Custom Codecs — Design patterns

Common Workflows

Workflow 1: Adding Object Storage to Existing Pipeline

Configure Object Storage — Set up store
Choose a Storage Type — Select codec
Update table definitions with @ modifier
Use Object Storage — Insert/fetch patterns

Estimate: 30-60 minutes

Workflow 2: Migrating from In-Table to Object Store

Choose a Storage Type — Determine new codec
Add new column with object storage codec
Migrate data (see Use Object Storage)
Verify data integrity
Drop old column (see Alter Tables)

Estimate: 1-2 hours for small datasets

Workflow 3: Working with Very Large Arrays (> 1 GB)

Use <object@> or <npy@> (not <blob@>)
Configure Object Storage — Ensure adequate storage
For Zarr: Store as <object@> with .zarr extension
For streaming: Use ObjectRef.fsmap (see Use Object Storage)

Key advantage: No need to download full dataset into memory

Workflow 4: Building Custom Domain Types

Read Custom Codecs — Understand patterns
Create Custom Codec — Implementation guide
Codec API Spec — Technical reference
Test with small dataset
Deploy to production

Estimate: 2-4 hours for simple codecs

Decision Trees

"Which storage mode?"

Is data < 1 MB per row?
├─ YES → <blob> (in-table)
└─ NO  → Continue...

Is data managed externally?
├─ YES → <filepath@> (user-managed reference)
└─ NO  → Continue...

Need streaming or partial reads?
├─ YES → <object@> or <npy@> (schema-addressed)
└─ NO  → <blob@> (hash-addressed, full download)

Full guide: Choose a Storage Type

"Which codec for object storage?"

NumPy arrays that benefit from lazy loading?
├─ YES → <npy@>
└─ NO  → Continue...

Large files (> 100 MB) needing streaming?
├─ YES → <object@>
└─ NO  → Continue...

Write-once data with potential duplicates?
├─ YES → <blob@> (deduplication via hashing)
└─ NO  → <blob@> or <object@> (choose based on access pattern)

Full guide: Choose a Storage Type

Troubleshooting

Common Issues

Problem	Likely Cause	Solution Guide
"Store not configured"	Missing stores config	Configure Object Storage
Out of memory loading array	Using `<blob@>` for huge data	Choose a Storage Type → Use `<object@>`
Slow fetches	Wrong codec choice	Manage Large Data
Data not deduplicated	Using wrong codec	Choose a Storage Type
Path conflicts with reserved	`<filepath@>` using `_hash/` or `_schema/`	Use Object Storage
Missing files after delete	Expected behavior for integrated storage	Garbage Collection

Getting Help

Check FAQ for common questions
Search GitHub Discussions
Review specification for exact behavior

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object Storage Overview

Quick Navigation by Task

Conceptual Understanding

Three Storage Modes

In-Table Storage (`<blob>`)

Object Store (Integrated)

Hash-Addressed (`<blob@>`, `<attach@>`)

Schema-Addressed (`<npy@>`, `<object@>`)

Filepath References (`<filepath@>`)

Documentation by Level

Getting Started

Intermediate

Advanced

Technical Reference

Specifications

Explanations

Common Workflows

Workflow 1: Adding Object Storage to Existing Pipeline

Workflow 2: Migrating from In-Table to Object Store

Workflow 3: Working with Very Large Arrays (> 1 GB)

Workflow 4: Building Custom Domain Types

Decision Trees

"Which storage mode?"

"Which codec for object storage?"

Troubleshooting

Common Issues

Getting Help

See Also

Related Concepts

Related How-Tos

Related Tutorials

FilesExpand file tree

object-storage-overview.md

Latest commit

History

object-storage-overview.md

File metadata and controls

Object Storage Overview

Quick Navigation by Task

Conceptual Understanding

Three Storage Modes

In-Table Storage (<blob>)

Object Store (Integrated)

Hash-Addressed (<blob@>, <attach@>)

Schema-Addressed (<npy@>, <object@>)

Filepath References (<filepath@>)

Documentation by Level

Getting Started

Intermediate

Advanced

Technical Reference

Specifications

Explanations

Common Workflows

Workflow 1: Adding Object Storage to Existing Pipeline

Workflow 2: Migrating from In-Table to Object Store

Workflow 3: Working with Very Large Arrays (> 1 GB)

Workflow 4: Building Custom Domain Types

Decision Trees

"Which storage mode?"

"Which codec for object storage?"

Troubleshooting

Common Issues

Getting Help

See Also

Related Concepts

Related How-Tos

Related Tutorials

In-Table Storage (`<blob>`)

Hash-Addressed (`<blob@>`, `<attach@>`)

Schema-Addressed (`<npy@>`, `<object@>`)

Filepath References (`<filepath@>`)