Summary
Roadmap capability from the strict-provenance adoption analysis (#1474 discussion, @ttngu207): a first-class record of where data came from, kept separate from the FK graph.
Why
DataJoint has one graph, used for two different things:
- what is derived from what — durable, invariant, belongs in the FK graph;
- how a row arrived — transient, operational (a file today, an API next quarter, a sync after that).
Forcing the second into the FK graph (the "landing-table" pattern: entities downstream of a ParsedFile ancestor) marries the slow-moving domain model to fast-moving infrastructure — every ingestion change becomes a schema migration, and historical rows permanently carry whichever ingestion ancestor existed at creation time. That is why the strict-provenance write boundary has no clean answer for the common fat ingestion make() (one file parsed into many top-level entity tables): every compliant restructuring corrupts the domain model or abandons DataJoint orchestration.
Shape
- Append-only ingestion-lineage records, per row or per insert batch:
(entity key, source descriptor, timestamp, actor) — multiple sources over an entity's lifetime, zero schema changes.
- Queryable alongside (not through) the FK graph; surfaced in row-lineage views.
- Candidate capture points: the gated insert path (already intercepts every write under strict mode),
staged_insert1, and populate job metadata.
Related use cases to fold in
- Spyglass Exports (discussion Cascading Restrictions #1232, @CBroz1): monitor
fetch/restrict/join calls, log accessed leaves, drive an upward trace for mysqldump/DANDI publication — the same "record what was touched, separate from the FK graph" mechanism from the read side.
- Platform row-lineage / compute-provenance work (branch identity ↔ per-row code-version tagging) — ingestion lineage is the missing third leg beside code-version and FK derivation.
Summary
Roadmap capability from the strict-provenance adoption analysis (#1474 discussion, @ttngu207): a first-class record of where data came from, kept separate from the FK graph.
Why
DataJoint has one graph, used for two different things:
Forcing the second into the FK graph (the "landing-table" pattern: entities downstream of a
ParsedFileancestor) marries the slow-moving domain model to fast-moving infrastructure — every ingestion change becomes a schema migration, and historical rows permanently carry whichever ingestion ancestor existed at creation time. That is why the strict-provenance write boundary has no clean answer for the common fat ingestionmake()(one file parsed into many top-level entity tables): every compliant restructuring corrupts the domain model or abandons DataJoint orchestration.Shape
(entity key, source descriptor, timestamp, actor)— multiple sources over an entity's lifetime, zero schema changes.staged_insert1, and populate job metadata.Related use cases to fold in
fetch/restrict/joincalls, log accessed leaves, drive an upward trace formysqldump/DANDI publication — the same "record what was touched, separate from the FK graph" mechanism from the read side.