Skip to content

workspace worktree cleanup: footguns + the conservatism/accumulation tension (must not hoard dead worktrees) #789

Description

@chubes4

Context

Hit several papercuts running wp datamachine-code workspace worktree cleanup + workspace remove during a heavy multi-minion session (2026-06-21, ~14 worktrees in one repo, disk at 81%). The cleanup machinery is powerful but has UX footguns AND a deeper design tension worth resolving deliberately.

Footguns hit (concrete)

  1. Flag/syntax inconsistency across cleanup commands.

    • worktree cleanup rejected --merged / --yes (unknown params).
    • worktree cleanup run --mode=retention rejected --mode (unknown param) — yet the tool's OWN suggested-alternative output literally tells you to run cleanup run --mode=retention .... The suggested remediation command doesn't match the actual accepted surface.
    • workspace remove <repo> --force rejected — it's --yes, but worktree remove <handle> --force DOES accept --force. So --force vs --yes is inconsistent between the two sibling commands.
    • Net: discovering the working invocation took several trial-and-error round trips. The accepted flags per subcommand should be consistent (or at least the error should suggest the correct flag for THAT command).
  2. safe_to_remove_now judges by git state, not by "is a process actively using this worktree." The default dry-run flagged worktrees that ACTIVE minions were mid-edit in (branches looked merged/clean) as safe_to_remove_now. An operator running the default cleanup would have deleted out from under live agent sessions. There is no liveness/lock signal that says "a runtime is currently attached to this worktree — do not remove." The liveness:live bucket exists but safe_to_remove_now doesn't appear to respect an active-attachment lock.

  3. --older-than=2d is the only thing that saved us — it correctly excluded all same-day worktrees (age_filter:excluded: 66). But that's a blunt instrument: it protects active work ONLY because active work happens to be recent. A burst of minions all <2d old is invisible to an age filter as "active."

The DEEPER tension (the important part — do NOT over-correct)

The fixes above (liveness locks, active-attachment guards) push toward MORE conservatism. But the system must NOT become so conservative that it hoards dead worktrees and fills the disk. We literally had ~14 worktrees in one repo and a homeboy checkout set eating ~470MB; on a 150GB VPS that churns hard with heavy merge fleets, accumulation is a REAL failure mode (we've had disk-pressure incidents before — see the Action Scheduler / disk reclaim history).

So the design target is a balance, explicitly:

  • Must aggressively reclaim genuinely-dead worktrees (merged + pushed + no live runtime attached + past a short grace window). Dead worktrees accumulating to disk-pressure is a failure.
  • Must never remove a worktree with a live runtime attached, unpushed commits, or dirty state — regardless of age.
  • The grace window should be SHORT (hours, not days) for merged+clean+unattached worktrees — the --older-than=2d blanket guard is too conservative as a permanent default (it would hoard 2 days of dead worktrees).

The right primitive is liveness-aware, not age-aware: "merged + pushed + no attached runtime + dirty-free" should be reclaimable within a short grace period (e.g. 1-2h), while "active runtime attached" is protected at ANY age. Age should be a tiebreaker/safety-net, not the primary guard.

Asks

  1. Add a liveness/attachment signal to the cleanup classifier so safe_to_remove_now excludes worktrees with a live runtime attached (the orchestrator knows which sessions map to which worktrees — surface that as a lock). This fixes the "delete out from under a live minion" footgun WITHOUT relying on age.
  2. Reconcile the flag surface across worktree cleanup / cleanup run / workspace remove / worktree remove so --yes/--force/--mode/--merged are consistent (or errors suggest the correct flag for that specific command). The self-suggested remediation commands must actually work.
  3. Tune the default reclaim policy to the balance above: aggressive on dead-and-unattached (short grace), absolute-protect on live/dirty/unpushed. Explicitly avoid an age-only default that hoards dead worktrees — accumulation to disk-pressure is the failure mode we must prevent.
  4. (Nice-to-have) a one-shot emergency-cleanup that's safe-by-construction (only removes merged+pushed+clean+unattached) so an operator under disk pressure can reclaim without trial-and-error flag archaeology.

Severity

Medium. Not blocking (the --older-than guard + manual workspace remove got the job done), but the "could delete a live minion's worktree by default" footgun is sharp, and the accumulation-vs-conservatism balance is a real design decision the tool should make deliberately rather than defaulting to either extreme.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions