Entity integrity ensures a one-to-one correspondence between real-world entities and their database records. This is the foundation of reliable data management.
- Each real-world entity → exactly one database record
- Each database record → exactly one real-world entity
Without entity integrity, databases become unreliable:
| Integrity Failure | Consequence |
|---|---|
| Same entity, multiple records | Fragmented data, conflicting information |
| Multiple entities, same record | Mixed data, privacy violations |
| Cannot match entity to record | Lost data, broken workflows |
When designing a primary key, answer these three questions:
Ensure the same entity cannot appear twice in the table.
Ensure different entities cannot share the same record.
When an entity arrives, how do I find its corresponding record?
Consider a neuroscience lab tracking mice:
| Question | Answer |
|---|---|
| Prevent duplicates? | Each mouse gets a unique ear tag at arrival; database rejects duplicate tags |
| Prevent sharing? | Ear tags are never reused; retired tags are archived |
| Match entities? | Read the ear tag → look up record by primary key |
@schema
class Mouse(dj.Manual):
definition = """
ear_tag : char(6) # unique ear tag (e.g., 'M00142')
---
date_of_birth : date
sex : enum('M', 'F', 'U')
strain : varchar(50)
"""The database enforces the first two questions through the primary key constraint. The third question requires a physical identification system—ear tags, barcodes, or RFID chips that link physical entities to database records.
In DataJoint, every table must have a primary key. Primary key attributes:
- Cannot be NULL — Every entity must be identifiable
- Must be unique — No two entities share the same key
- Cannot be changed — Keys are immutable after insertion
- Declared above the
---line — Syntactic convention
Use attributes that naturally identify entities in your domain:
@schema
class Gene(dj.Lookup):
definition = """
gene_symbol : varchar(20) # Official gene symbol (e.g., 'BRCA1')
---
full_name : varchar(200)
chromosome : varchar(5)
"""Advantages:
- Meaningful to humans
- Self-documenting
- No additional lookup needed
A surrogate key is an identifier used primarily inside the database, with minimal or no exposure to end users. Users typically don't search for entities by surrogate keys or use them in conversation.
@schema
class InternalRecord(dj.Manual):
definition = """
record_id : uuid # internal identifier, not exposed to users
---
created_timestamp : datetime(3)
data : <blob>
"""Key distinction from natural keys: Surrogate keys don't require external identification systems because users don't need to match physical entities to records by these keys.
When surrogate keys are appropriate:
- Entities that exist only within the system (no physical counterpart)
- Privacy-sensitive contexts where natural identifiers shouldn't be stored
- Internal system records that users never reference directly
Generating surrogate keys: DataJoint requires explicit key values rather than database-generated auto-increment. This is intentional:
- Auto-increment encourages treating keys as "row numbers" rather than entity identifiers
- It's incompatible with composite keys, which DataJoint uses extensively
- It breaks reproducibility (different IDs when rebuilding pipelines)
- It prevents the client-server handshake needed for proper entity integrity
Use client-side generation instead:
- UUIDs — Generate with
uuid.uuid4()before insertion - ULIDs — Sortable unique IDs
- Client-side counters — Query max value and increment
DataJoint recommendation: Prefer natural keys when they're stable and meaningful. Use surrogates only when no natural identifier exists or for privacy-sensitive contexts.
When no single attribute uniquely identifies an entity, combine multiple attributes:
@schema
class Recording(dj.Manual):
definition = """
-> Session
recording_idx : uint16 # Recording number within session
---
duration : float32 # seconds
"""Here, (subject_id, session_idx, recording_idx) together form the primary key.
Neither alone would be unique.
Foreign keys in DataJoint serve dual purposes:
- Referential integrity — Ensures referenced entities exist
- Workflow dependency — Declares that this entity depends on another
@schema
class Segmentation(dj.Computed):
definition = """
-> Scan # Depends on Scan
---
num_cells : uint32
"""The arrow -> inherits the primary key from Scan and establishes both
referential integrity and workflow dependency.
A dimension is an independent axis of variation in your data. The fundamental principle:
Any table that introduces a new primary key attribute introduces a new dimension.
This is true whether the table has only new attributes or also inherits attributes from foreign keys. The key is simply: new primary key attribute = new dimension.
@schema
class Subject(dj.Manual):
definition = """
subject_id : varchar(16) # NEW dimension: subject_id
---
species : varchar(50)
"""
@schema
class Session(dj.Manual):
definition = """
-> Subject # Inherits subject_id
session_idx : uint16 # NEW dimension: session_idx
---
session_date : date
"""
@schema
class Trial(dj.Manual):
definition = """
-> Session # Inherits subject_id, session_idx
trial_idx : uint16 # NEW dimension: trial_idx
---
outcome : enum('success', 'fail')
"""All three tables introduce dimensions:
Subjectintroducessubject_iddimensionSessionintroducessession_idxdimension (even though it also inheritssubject_id)Trialintroducestrial_idxdimension (even though it also inheritssubject_id,session_idx)
In schema diagrams, tables that introduce at least one new dimension have underlined names.
A table introduces no dimensions when its entire primary key comes from foreign keys:
@schema
class SubjectProfile(dj.Manual):
definition = """
-> Subject # Inherits subject_id only
---
weight : float32
"""SubjectProfile doesn't introduce any new primary key attribute—it extends the Subject dimension with additional attributes. There's exactly one profile per subject.
In schema diagrams, these tables have non-underlined names.
Computed tables never introduce dimensions. Their primary key is entirely inherited from their dependencies:
@schema
class SessionSummary(dj.Computed):
definition = """
-> Session # PK = (subject_id, session_idx)
---
num_trials : uint32
accuracy : float32
"""This makes sense—computed tables derive data from existing entities rather than introducing new ones.
Unlike computed tables, part tables can introduce new dimensions:
@schema
class Detection(dj.Computed):
definition = """
-> Image # Inherits image_id
-> DetectionParams # Inherits params_id
---
num_blobs : uint32
"""
class Blob(dj.Part):
definition = """
-> master # Inherits (image_id, params_id)
blob_idx : uint16 # NEW dimension within detection
---
x : float32
y : float32
"""Detection inherits dimensions (no underline in diagram), but Detection.Blob
introduces a new dimension (blob_idx) for individual blobs within each
detection.
Every foreign key attribute traces back to the dimension where it was first defined. This is called attribute lineage:
Subject.subject_id → myschema.subject.subject_id (origin)
Session.subject_id → myschema.subject.subject_id (inherited via foreign key)
Session.session_idx → myschema.session.session_idx (origin)
Trial.subject_id → myschema.subject.subject_id (inherited via foreign key)
Trial.session_idx → myschema.session.session_idx (inherited via foreign key)
Trial.trial_idx → myschema.trial.trial_idx (origin)
Lineage enables semantic matching—DataJoint only joins attributes that
trace back to the same dimension. Two attributes named id from different
dimensions cannot be accidentally joined.
See Semantic Matching for details.
In schema diagrams:
| Visual | Meaning |
|---|---|
| Underlined name | Introduces at least one new dimension |
| Non-underlined name | All PK attributes inherited (no new dimensions) |
| Thick solid line | One-to-one extension (no new dimension) |
| Thin solid line | Containment (may introduce dimension) |
Common dimensions in neuroscience:
- Subject — Who/what is being studied
- Session — When data was collected
- Trial — Individual experimental unit
- Modality — Type of data (ephys, imaging, behavior)
- Parameter set — Configuration for analysis
Understanding dimensions helps design schemas that naturally express your experimental structure and ensures correct joins through semantic matching.
- Answer the three questions before designing any table
- Choose stable identifiers that won't need to change
- Keep keys minimal — Include only what's necessary for uniqueness
- Document key semantics — Explain what the key represents
- Consider downstream queries — Keys affect join performance
# Wrong: experiment_id alone isn't unique
class Trial(dj.Manual):
definition = """
experiment_id : uint32
---
trial_number : uint16 # Should be part of key!
result : float32
"""# Wrong: timestamp makes every row unique, losing entity semantics
class Measurement(dj.Manual):
definition = """
subject_id : uint32
timestamp : datetime(6) # Microsecond precision
---
value : float32
"""# Risky: names can change
class Patient(dj.Manual):
definition = """
patient_name : varchar(100) # What if they change their name?
---
date_of_birth : date
"""Entity integrity is maintained by:
- Primary keys that uniquely identify each entity
- Foreign keys that establish valid references
- Physical systems that link real-world entities to records
The three questions framework ensures your primary keys provide meaningful, stable identification for your domain entities.