DataJoint Data Manipulation Specification

Overview

This document specifies data manipulation operations in DataJoint Python: insert, update, and delete. These operations maintain referential integrity across the pipeline while supporting the workflow normalization paradigm.

1. Workflow Normalization Philosophy

1.1 Insert and Delete as Primary Operations

DataJoint pipelines are designed around insert and delete as the primary data manipulation operations:

Insert: Add complete entities (rows) to tables
Delete: Remove entities and all dependent data (cascading)

This design maintains referential integrity at the entity level—each row represents a complete, self-consistent unit of data.

1.2 Updates as Surgical Corrections

Updates are intentionally limited to the update1() method, which modifies a single row at a time. This is by design:

Updates bypass the normal workflow
They can create inconsistencies with derived data
They should be used sparingly for corrective operations

Appropriate uses of update1():

Fixing data entry errors
Correcting metadata after the fact
Administrative annotations

Inappropriate uses:

Regular workflow operations
Batch modifications
Anything that should trigger recomputation

1.3 The Recomputation Pattern

When source data changes, the correct pattern is:

# 1. Delete the incorrect data (cascades to all derived tables)
(SourceTable & {"key": value}).delete()

# 2. Insert the corrected data
SourceTable.insert1(corrected_row)

# 3. Recompute derived tables
DerivedTable.populate()

This ensures all derived data remains consistent with its sources.

2. Insert Operations

2.1 `insert()` Method

Signature:

def insert(
    self,
    rows,
    replace=False,
    skip_duplicates=False,
    ignore_extra_fields=False,
    allow_direct_insert=None,
    chunk_size=None,
)

Parameters:

Parameter	Type	Default	Description
`rows`	iterable	—	Data to insert
`replace`	bool	`False`	Replace existing rows with matching PK
`skip_duplicates`	bool	`False`	Silently skip duplicate keys
`ignore_extra_fields`	bool	`False`	Ignore fields not in table
`allow_direct_insert`	bool	`None`	Allow insert into auto-populated tables
`chunk_size`	int	`None`	Insert in batches of this size

2.2 Accepted Input Formats

Format	Example
List of dicts	`[{"id": 1, "name": "Alice"}, ...]`
pandas DataFrame	`pd.DataFrame({"id": [1, 2], "name": ["A", "B"]})`
polars DataFrame	`pl.DataFrame({"id": [1, 2], "name": ["A", "B"]})`
numpy structured array	`np.array([(1, "A")], dtype=[("id", int), ("name", "U10")])`
QueryExpression	`OtherTable.proj(...)` (INSERT...SELECT)
Path to CSV	`Path("data.csv")`

2.3 Basic Usage

# Single row
Subject.insert1({"subject_id": 1, "name": "Mouse001", "dob": "2024-01-15"})

# Multiple rows
Subject.insert([
    {"subject_id": 1, "name": "Mouse001", "dob": "2024-01-15"},
    {"subject_id": 2, "name": "Mouse002", "dob": "2024-01-16"},
])

# From DataFrame
df = pd.DataFrame({"subject_id": [1, 2], "name": ["M1", "M2"], "dob": ["2024-01-15", "2024-01-16"]})
Subject.insert(df)

# From query (INSERT...SELECT)
ActiveSubjects.insert(Subject & "status = 'active'")

2.4 Handling Duplicates

# Error on duplicate (default)
Subject.insert1({"subject_id": 1, ...})  # Raises DuplicateError if exists

# Skip duplicates silently
Subject.insert(rows, skip_duplicates=True)

# Replace existing rows
Subject.insert(rows, replace=True)

Difference between skip and replace:

skip_duplicates: Keeps existing row unchanged
replace: Overwrites existing row with new values

2.5 Extra Fields

# Error on extra fields (default)
Subject.insert1({"subject_id": 1, "unknown_field": "x"})  # Raises error

# Ignore extra fields
Subject.insert1({"subject_id": 1, "unknown_field": "x"}, ignore_extra_fields=True)

2.6 Auto-Populated Tables

Computed and Imported tables normally only accept inserts from their make() method:

# Raises DataJointError by default
ComputedTable.insert1({"key": 1, "result": 42})

# Explicit override
ComputedTable.insert1({"key": 1, "result": 42}, allow_direct_insert=True)

2.7 Chunked Insertion

For large datasets, insert in batches:

# Insert 10,000 rows at a time
Subject.insert(large_dataset, chunk_size=10000)

Each chunk is a separate transaction. If interrupted, completed chunks persist.

2.8 `insert1()` Method

Convenience wrapper for single-row inserts:

def insert1(self, row, **kwargs)

Equivalent to insert((row,), **kwargs).

2.9 Staged Insert for Large Objects

For large objects (Zarr arrays, HDF5 files), use staged insert to write directly to object storage:

with table.staged_insert1 as staged:
    # Set primary key and metadata
    staged.rec["session_id"] = 123
    staged.rec["timestamp"] = datetime.now()

    # Write large data directly to storage
    zarr_path = staged.store("raw_data", ".zarr")
    z = zarr.open(zarr_path, mode="w")
    z[:] = large_array
    staged.rec["raw_data"] = z

# Row automatically inserted on successful exit
# Storage cleaned up if exception occurs

3. Update Operations

3.1 `update1()` Method

Signature:

def update1(self, row: dict) -> None

Parameters:

row: Dictionary containing all primary key values plus attributes to update

3.2 Basic Usage

# Update a single attribute
Subject.update1({"subject_id": 1, "name": "NewName"})

# Update multiple attributes
Subject.update1({
    "subject_id": 1,
    "name": "NewName",
    "notes": "Updated on 2024-01-15"
})

3.3 Requirements

Complete primary key: All PK attributes must be provided
Exactly one match: Must match exactly one existing row
No restrictions: Cannot call on restricted table

# Error: incomplete primary key
Subject.update1({"name": "NewName"})

# Error: row doesn't exist
Subject.update1({"subject_id": 999, "name": "Ghost"})

# Error: cannot update restricted table
(Subject & "subject_id > 10").update1({...})

3.4 Resetting to Default

Setting an attribute to None resets it to its default value:

# Reset 'notes' to its default (NULL if nullable)
Subject.update1({"subject_id": 1, "notes": None})

3.5 When to Use Updates

Appropriate:

# Fix a typo in metadata
Subject.update1({"subject_id": 1, "name": "Mouse001"})  # Was "Mous001"

# Add a note to an existing record
Session.update1({"session_id": 5, "notes": "Excluded from analysis"})

Inappropriate (use delete + insert + populate instead):

# DON'T: Update source data that affects computed results
Trial.update1({"trial_id": 1, "stimulus": "new_stim"})  # Computed tables now stale!

# DO: Delete and recompute
(Trial & {"trial_id": 1}).delete()  # Cascades to computed tables
Trial.insert1({"trial_id": 1, "stimulus": "new_stim"})
ComputedResults.populate()

3.6 Why No Bulk Update?

DataJoint intentionally does not provide update() for multiple rows:

Consistency: Bulk updates easily create inconsistencies with derived data
Auditability: Single-row updates are explicit and traceable
Workflow: The insert/delete pattern maintains referential integrity

If you need to update many rows, iterate explicitly:

for key in (Subject & condition).keys():
    Subject.update1({**key, "status": "archived"})

4. Delete Operations

4.1 `delete()` Method

Signature:

def delete(
    self,
    transaction: bool = True,
    prompt: bool | None = None,
    part_integrity: str = "enforce",
) -> int

Parameters:

Parameter	Type	Default	Description
`transaction`	bool	`True`	Wrap in atomic transaction
`prompt`	bool	`None`	Prompt for confirmation (default: config setting)
`part_integrity`	str	`"enforce"`	Master-part integrity policy (see below)

part_integrity values:

Value	Behavior
`"enforce"`	Error if parts would be deleted without masters
`"ignore"`	Allow deleting parts without masters (breaks integrity)
`"cascade"`	Also delete masters when parts are deleted

Returns: Number of deleted rows from the primary table.

4.2 Cascade Behavior

Delete automatically cascades to all dependent tables:

# Deleting a subject deletes all their sessions, trials, and computed results
(Subject & {"subject_id": 1}).delete()

Cascade order:

Identify all tables with foreign keys referencing target
Recursively delete matching rows in child tables
Delete rows in target table

!!! version-added "New in 2.2" Table.delete() now uses graph-driven cascade internally via dj.Diagram. User-facing behavior is unchanged — the same parameters and return values apply. For direct control over the cascade (preview, multi-schema operations), use the Diagram operational methods.

4.3 Basic Usage

# Delete specific rows
(Subject & {"subject_id": 1}).delete()

# Delete matching a condition
(Session & "session_date < '2024-01-01'").delete()

# Delete all rows (use with caution!)
Subject.delete()

4.4 Safe Mode

When prompt=True (default from config):

About to delete:
  Subject: 1 rows
  Session: 5 rows
  Trial: 150 rows
  ProcessedData: 150 rows

Commit deletes? [yes, No]:

Disable for automated scripts:

Subject.delete(prompt=False)

4.5 Transaction Control

# Atomic delete (default) - all or nothing
(Subject & condition).delete(transaction=True)

# Non-transactional (for nested transactions)
(Subject & condition).delete(transaction=False)

4.6 Part Table Constraints

Cannot delete from part tables without deleting from master (by default):

# Error: cannot delete part without master
Session.Recording.delete()

# Allow breaking master-part integrity
Session.Recording.delete(part_integrity="ignore")

# Delete parts AND cascade up to delete master
Session.Recording.delete(part_integrity="cascade")

part_integrity parameter:

Value	Behavior
`"enforce"`	(default) Error if parts would be deleted without masters
`"ignore"`	Allow deleting parts without masters (breaks integrity)
`"cascade"`	Also delete masters when parts are deleted (maintains integrity)

4.7 `delete_quick()` Method

Fast delete without cascade or confirmation:

def delete_quick(self, get_count: bool = False) -> int | None

Use cases:

Internal cleanup
Tables with no dependents
When you've already handled dependencies

Behavior:

No cascade to child tables
No user confirmation
Fails on FK constraint violation

# Quick delete (fails if has dependents)
(TempTable & condition).delete_quick()

# Get count of deleted rows
n = (TempTable & condition).delete_quick(get_count=True)

5. Validation

5.1 `validate()` Method

Pre-validate rows before insertion:

def validate(self, rows, *, ignore_extra_fields=False) -> ValidationResult

Returns: ValidationResult with:

is_valid: Boolean indicating all rows passed
errors: List of (row_idx, field_name, error_message)
rows_checked: Number of rows validated

5.2 Usage

result = Subject.validate(rows)

if result:
    Subject.insert(rows)
else:
    print(result.summary())
    # Row 3, field 'dob': Invalid date format
    # Row 7, field 'subject_id': Missing required field

5.3 Validations Performed

Check	Description
Field existence	All fields must exist in table
NULL constraints	Required fields must have values
Primary key completeness	All PK fields must be present
UUID format	Valid UUID string or object
JSON serializability	JSON fields must be serializable
Codec validation	Custom type validation via codecs

5.4 Limitations

These constraints are only checked at database level:

Foreign key references
Unique constraints (beyond PK)
Custom CHECK constraints

6. Part Tables

6.1 Inserting into Part Tables

Part tables are inserted via their master:

@schema
class Session(dj.Manual):
    definition = """
    session_id : int
    ---
    date : date
    """

    class Recording(dj.Part):
        definition = """
        -> master
        recording_id : int
        ---
        duration : float
        """

# Insert master with parts
Session.insert1({"session_id": 1, "date": "2024-01-15"})
Session.Recording.insert([
    {"session_id": 1, "recording_id": 1, "duration": 60.0},
    {"session_id": 1, "recording_id": 2, "duration": 45.5},
])

6.2 Deleting with Part Tables

Deleting master cascades to parts:

# Deletes session AND all its recordings
(Session & {"session_id": 1}).delete()

Cannot delete parts independently (by default):

# Error
Session.Recording.delete()

# Allow breaking master-part integrity
Session.Recording.delete(part_integrity="ignore")

# Or cascade up to also delete master
Session.Recording.delete(part_integrity="cascade")

7. Transaction Handling

7.1 Implicit Transactions

Single operations are atomic:

Subject.insert1(row)  # Atomic
Subject.update1(row)  # Atomic
Subject.delete()      # Atomic (by default)

7.2 Explicit Transactions

For multi-table operations:

with dj.conn().transaction:
    Parent.insert1(parent_row)
    Child.insert(child_rows)
    # Commits on successful exit
    # Rolls back on exception

7.3 Chunked Inserts and Transactions

With chunk_size, each chunk is a separate transaction:

# Each chunk of 1000 rows commits independently
Subject.insert(large_dataset, chunk_size=1000)

If interrupted, completed chunks persist.

8. Error Handling

8.1 Common Errors

Error	Cause	Resolution
`DuplicateError`	Primary key already exists	Use `skip_duplicates=True` or `replace=True`
`IntegrityError`	Foreign key constraint violated	Insert parent rows first
`MissingAttributeError`	Required field not provided	Include all required fields
`UnknownAttributeError`	Field not in table	Use `ignore_extra_fields=True` or fix field name
`DataJointError`	Various validation failures	Check error message for details

8.2 Error Recovery Pattern

try:
    Subject.insert(rows)
except dj.errors.DuplicateError as e:
    # Handle specific duplicate
    print(f"Duplicate: {e}")
except dj.errors.IntegrityError as e:
    # Missing parent reference
    print(f"Missing parent: {e}")
except dj.DataJointError as e:
    # Other DataJoint errors
    print(f"Error: {e}")

9. Best Practices

9.1 Prefer Insert/Delete Over Update

# Good: Delete and reinsert
(Trial & key).delete()
Trial.insert1(corrected_trial)
DerivedTable.populate()

# Avoid: Update that creates stale derived data
Trial.update1({**key, "value": new_value})  # Derived tables now inconsistent!

9.2 Validate Before Insert

result = Subject.validate(rows)
if not result:
    raise ValueError(result.summary())
Subject.insert(rows)

9.3 Use Transactions for Related Inserts

with dj.conn().transaction:
    session_key = Session.insert1(session_data, skip_duplicates=True)
    Session.Recording.insert(recordings)
    Session.Stimulus.insert(stimuli)

9.4 Batch Inserts for Performance

# Good: Single insert call
Subject.insert(all_rows)

# Avoid: Loop of insert1 calls
for row in all_rows:
    Subject.insert1(row)  # Slow!

9.5 Safe Deletion in Production

# Always use prompt in interactive sessions
(Subject & condition).delete(prompt=True)

# Disable only in tested automated scripts
(Subject & condition).delete(prompt=False)

10. Quick Reference

Operation	Method	Cascades	Transaction	Typical Use
Insert one	`insert1()`	—	Implicit	Adding single entity
Insert many	`insert()`	—	Per-chunk	Bulk data loading
Insert large object	`staged_insert1`	—	On exit	Zarr, HDF5 files
Update one	`update1()`	—	Implicit	Surgical corrections
Delete	`delete()`	Yes	Optional	Removing entities
Delete quick	`delete_quick()`	No	No	Internal cleanup
Validate	`validate()`	—	—	Pre-insert check

FilesExpand file tree

data-manipulation.md

Latest commit

History

data-manipulation.md

File metadata and controls

DataJoint Data Manipulation Specification

Overview

1. Workflow Normalization Philosophy

1.1 Insert and Delete as Primary Operations

1.2 Updates as Surgical Corrections

1.3 The Recomputation Pattern

2. Insert Operations

2.1 insert() Method

2.2 Accepted Input Formats

2.3 Basic Usage

2.4 Handling Duplicates

2.5 Extra Fields

2.6 Auto-Populated Tables

2.7 Chunked Insertion

2.8 insert1() Method

2.9 Staged Insert for Large Objects

3. Update Operations

3.1 update1() Method

3.2 Basic Usage

3.3 Requirements

3.4 Resetting to Default

3.5 When to Use Updates

3.6 Why No Bulk Update?

4. Delete Operations

4.1 delete() Method

4.2 Cascade Behavior

4.3 Basic Usage

4.4 Safe Mode

4.5 Transaction Control

4.6 Part Table Constraints

4.7 delete_quick() Method

5. Validation

5.1 validate() Method

5.2 Usage

5.3 Validations Performed

5.4 Limitations

6. Part Tables

6.1 Inserting into Part Tables

6.2 Deleting with Part Tables

7. Transaction Handling

7.1 Implicit Transactions

7.2 Explicit Transactions

7.3 Chunked Inserts and Transactions

8. Error Handling

8.1 Common Errors

8.2 Error Recovery Pattern

9. Best Practices

9.1 Prefer Insert/Delete Over Update

9.2 Validate Before Insert

9.3 Use Transactions for Related Inserts

9.4 Batch Inserts for Performance

9.5 Safe Deletion in Production

10. Quick Reference

2.1 `insert()` Method

2.8 `insert1()` Method

3.1 `update1()` Method

4.1 `delete()` Method

4.7 `delete_quick()` Method

5.1 `validate()` Method