Skip to content

First-class ingestion lineage, separate from the FK graph #1483

Description

@dimitri-yatsenko

Summary

Roadmap capability from the strict-provenance adoption analysis (#1474 discussion, @ttngu207): a first-class record of where data came from, kept separate from the FK graph.

Why

DataJoint has one graph, used for two different things:

  • what is derived from what — durable, invariant, belongs in the FK graph;
  • how a row arrived — transient, operational (a file today, an API next quarter, a sync after that).

Forcing the second into the FK graph (the "landing-table" pattern: entities downstream of a ParsedFile ancestor) marries the slow-moving domain model to fast-moving infrastructure — every ingestion change becomes a schema migration, and historical rows permanently carry whichever ingestion ancestor existed at creation time. That is why the strict-provenance write boundary has no clean answer for the common fat ingestion make() (one file parsed into many top-level entity tables): every compliant restructuring corrupts the domain model or abandons DataJoint orchestration.

Shape

  • Append-only ingestion-lineage records, per row or per insert batch: (entity key, source descriptor, timestamp, actor) — multiple sources over an entity's lifetime, zero schema changes.
  • Queryable alongside (not through) the FK graph; surfaced in row-lineage views.
  • Candidate capture points: the gated insert path (already intercepts every write under strict mode), staged_insert1, and populate job metadata.

Related use cases to fold in

  • Spyglass Exports (discussion Cascading Restrictions #1232, @CBroz1): monitor fetch/restrict/join calls, log accessed leaves, drive an upward trace for mysqldump/DANDI publication — the same "record what was touched, separate from the FK graph" mechanism from the read side.
  • Platform row-lineage / compute-provenance work (branch identity ↔ per-row code-version tagging) — ingestion lineage is the missing third leg beside code-version and FK derivation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementIndicates new improvements

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions