Skip to content

Trustworthy garbage collection: complete reference discovery (#1469) + race-safe deletion (#1445) #1478

Description

@dimitri-yatsenko

Umbrella tracking the invariant: garbage collection must never delete an object that is still referenced. Today it can, via two independent failure modes that look similar but have different root causes:

  1. Incomplete discoveryGarbage collection deletes external-store files belonging to existing rows (custom codecs) #1469: dj.gc.scan hardcodes the built-in codec names (hash/blob/attach, object/npy), so files referenced by a custom codec are never enumerated. The store scan then reports live files as orphans and collect() deletes them. (Row delete also fails to remove custom-codec files.)
  2. Stale discovery / TOCTOU raceTwo-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445: the scan is correct, but an insert during the scan→delete window has its file deleted out from under it.

They are complementary layers of one goal, not duplicates — and must be fixed in order:

Why the sequence matters

#1445's purge() re-check reruns the scan. On a codec-blind scan (pre-#1469) it would still classify a live custom-codec file as an orphan and purge it after the grace window — so the grace window can't save you from a scan that never sees the reference. Complete discovery (#1469) is the precondition for safe deletion (#1445).

Documentation (datajoint-docs) — to land with the implementation

Add

  • reference/specs/garbage-collection.md (new normative spec — none exists today). The orphan-determination model, the referenced_paths codec contract, the two-phase quarantine/grace/purge state machine, config keys (gc.grace_seconds), re-check/concurrency semantics, backend atomic-move requirements, and restore. (Two-phase, transaction-safe garbage collection (quarantine -> grace -> purge) #1445 explicitly asks for a written spec.)

Update

  • how-to/garbage-collection.md ("Clean Up Object Storage") — document the two-phase workflow (quarantine / purge / restore, grace_seconds); note custom-codec external files are now handled; revise the "single-pass, best-effort" admonition added in fix issue #186 #189 once two-phase lands.
  • reference/specs/codec-api.md — document referenced_paths as part of the Codec contract (required for any codec that owns external artifacts).
  • how-to/create-custom-codec.md, explanation/custom-codecs.md, how-to/use-plugin-codecs.md — author guidance: if your codec writes external files, implement referenced_paths so delete + GC see them (otherwise files leak or, worse, get misclassified as orphans and deleted).
  • reference/specs/provenance.md — the GC concurrency wording references single-pass semantics; align once two-phase ships (minor).
  • Optionally a short explainer (e.g. in explanation/object-storage-overview.md) on how DataJoint tracks external references — codecs own their paths; delete and GC consult them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementIndicates new improvements

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions