Orchestrate eject: preflight, config strip, Mimir cleanup by timtalbot · Pull Request #265 · posit-dev/ptd

timtalbot · 2026-04-22T00:58:47Z

Summary

Wires up the full ptd eject --dry-run=false flow, turning the eject command from a documentation-only tool into one that actually disconnects the workload from the control room.

Preflight checks — validates workload credentials, control room credentials, and that control_room_* fields exist in config before proceeding
Config snapshot + strip — snapshots control room fields for the audit record, then zeroes them in ptd.yaml
Mimir password deletion — removes the workload's entry from the control room's mimir-auth secret (fault-tolerant: logs and continues on failure)
Eject record — writes eject-record.json capturing what happened (snapshot, timestamps, success/failure of each step)
Terminology cleanup — sever → eject/removal throughout; sever.go → mimir.go
Connection inventory trimmed — removed Loki and Team Operator connections (no longer applicable)
Secret name fix — mimir-auth secret resource in connection inventory now correctly uses the control room target name (matching actual deletion logic in mimir.go)

Test plan

All existing eject tests pass
New test: full happy path strips config and deletes Mimir entry
New test: preflight fails on empty config (already-ejected workload)
New test: graceful degradation when no control room target available
New test: Mimir deletion failure continues with partial success record
New test: dry-run skips all destructive steps (config, Mimir, eject record)
Manual test against a real workload

Closes #248

…r cleanup Wire up the full eject flow for --dry-run=false: preflight validation, control room config snapshot, ptd.yaml field stripping, Mimir password deletion from the control room secret store, and eject record generation. Closes #248

# Conflicts: # lib/eject/eject.go

timtalbot · 2026-04-22T01:41:20Z

@claude review

claude · 2026-04-22T01:41:32Z

Claude finished @timtalbot's task —— View job

Code Review: `orchestrate-eject`

Overall this is a well-structured PR. The step ordering (snapshot → strip → mimir delete → record) is sound, the fault-tolerance on Mimir deletion is the right call, and the test coverage is thorough. A few observations:

Potential Issues

bundle is unused (eject.go:70): bundle is allocated and its ControlRoom field is set, but it's never serialized or returned. This is dead code carried from the previous PR. Either serialize it into the output directory or remove it to avoid confusion.
eject-record.json file permissions vs metadata.json (eject.go:209): writeEjectRecord uses 0600 while other artifact files (e.g., metadata.json) use 0644. The eject record does contain the snapshot of control room fields (account IDs, domains) — so restricted permissions make sense — but the inconsistency is worth being intentional about. If the whole bundle is meant to be shared with the customer, 0644 might be more appropriate.
Preflight RunPreflightChecks never returns an error (preflight.go:52-77): The function signature returns (*PreflightResult, error) but the error path is never hit — all failures are recorded as CheckFail results. This is fine as an extensibility hook, but the caller in eject.go:79-88 checks both err and !pfResult.Passed, which means the err branch is currently dead code. Not a bug, just something to be aware of.
computePassed() on empty checks returns true (preflight.go:36-43): If no checks run (e.g., all check functions are somehow skipped), preflight passes vacuously. The test TestPreflightResult_ComputePassed_Empty explicitly validates this, so it's intentional, but worth a comment on whether that's the desired behavior for safety.

Minor Nits

Error return from os.WriteFile in tests is ignored (eject_test.go:177,246,268,308,383): Several os.WriteFile calls in tests don't check the error return. While unlikely to fail with t.TempDir(), it's cleaner to use require.NoError.
controlRoomName could be derived inside Run (eject.go:92-95): The pattern of if opts.ControlRoomTarget != nil { controlRoomName = opts.ControlRoomTarget.Name() } could be a method on Options for consistency with configLoader() and ptdYamlPath().

What Looks Good

The snapshot-before-strip ordering guarantees the audit record captures the original values even if something goes wrong mid-eject.
Mimir deletion failure being non-fatal (log-and-continue) is exactly right — the config is already stripped at that point, so failing hard would leave the workload in a confusing half-state with no record.
The terminology cleanup (sever → eject/removal) is consistent throughout all 16 files.
The connection inventory trim (removing Loki and Team Operator) simplifies the model and the mimir-auth secret name fix (using control room name instead of workload name) is a correctness fix.
Test coverage is comprehensive: happy path, preflight failure, graceful degradation, partial success, and dry-run isolation.

Address review feedback on #265: remove unused Bundle type and variable from eject.Run, and wrap os.WriteFile calls in tests with require.NoError for fail-fast on setup errors.

…anup GetSecretValue now wraps cloud-specific not-found errors with a types.ErrSecretNotFound sentinel. RemoveWorkloadMimirPassword reads the secret directly and treats only a genuine not-found as a no-op success, returning any other error (e.g. AccessDenied) so the eject record reflects mimir_secret_removed:false instead of a false success.

StripControlRoomFields and writeEjectRecord now write to a temp file in the same directory and os.Rename it into place, so a crash mid-write leaves the original file intact instead of a truncated/corrupt one. File permissions are preserved (ptd.yaml 0644, eject-record 0600).

Write the eject-record (containing the control-room snapshot) to disk right after taking the snapshot, before stripping ptd.yaml, then rewrite it with final per-step outcomes once the steps complete. If the process dies after the strip but before the final write, the snapshot survives on disk so a re-run can recover instead of failing preflight.

buildConnections derived the mimir-auth secret name from a separately threaded control room target name, which is empty in the dry-run path (the default), producing a malformed '.mimir-auth.posit.team' in the customer handoff doc. Derive it from ControlRoomClusterName instead, which is populated in both dry-run and live paths, and drop the now redundant controlRoomName parameter.

…ct rename Target.Credentials already refreshes internally, so the explicit creds.Refresh call in the preflight credential checks was a guaranteed no-op; remove it. Also complete the severance->eject terminology rename in the handoff doc: '## Eject Plan' header and ejectPlanDryRun/ ejectPlanLive template funcs and text helpers.

timtalbot · 2026-06-22T15:45:06Z

@claude review

claude · 2026-06-22T15:45:28Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

timtalbot · 2026-06-22T16:09:08Z

@claude review

claude · 2026-06-22T16:09:33Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

timtalbot · 2026-06-22T21:21:37Z

@claude review

claude · 2026-06-22T21:22:00Z

Claude finished @timtalbot's task in 3m 33s —— View job

Code Review: `orchestrate-eject`

The overall design is sound: snapshot-before-strip guarantees recovery data survives destructive steps, atomic writes protect against torn files, and the fault-tolerant Mimir deletion is the right call. A few issues worth addressing before merge:

Issues

1. Incomplete intermediate record write — config_stripped: false survives a mid-run crash (eject.go:163-167)

The initial record is written with ConfigStripped: false. After StripControlRoomFields succeeds (line 167), record.ConfigStripped = true is set in memory but never flushed to disk. If the process is killed between the strip and the final writeEjectRecord (line 188), the on-disk record shows config_stripped: false even though the config was already stripped. This contradicts the comment at line 148 ("The steps are marked not-yet-done; the record is rewritten with final outcomes once the steps complete") — the record is not rewritten after each step. A second writeEjectRecord call after line 167 would close the gap.

Fix this →

2. Domain-only config produces ".mimir-auth.posit.team" in the connection inventory (control_room.go:79)

secretName is derived from details.ClusterName, which can be empty if only control_room_domain is set. The comment says this fixes a bug with empty names — but it moves the bug from the live path to the case where domain is set without cluster name. The Secret Sync entry should be guarded by details.ClusterName != "", mirroring how the Mimir deletion itself only fires when ControlRoomTarget != nil. The new test TestBuildConnections_DomainAndClusterName documents the correct case, but the incorrect case (Domain != "", ClusterName == "") goes untested and produces a malformed resource name.

Fix this →

3. Redundant EjectedAt update at step 4 (eject.go:186)

record.EjectedAt is set to time.Now() on line 154 (before any destructive step) and then overwritten to a new time.Now() on line 186. This means the timestamp in the final record reflects when the last step completed, not when eject started. If the intent is to record "when eject began," the line 186 update should be removed. If the intent is "when eject finished," the initial write should keep the first timestamp but the semantics of the field name (ejected_at) should probably be clarified in the struct comment.

4. err variable shadowed in cmd/eject.go (cmd/eject.go:87)

controlRoomTarget, err = legacy.ControlRoomTargetFromName(targetName) uses the same err variable declared at the top of runEject (line 47, from legacy.TargetFromName). This works correctly in Go since the outer err is reused, but it's worth confirming there's no short-variable-declaration (:=) issue here — if controlRoomTarget is also declared at the outer scope, := inside the if !ejectDryRun block would create a new scope-local variable. Looking at the diff, var controlRoomTarget types.Target is declared before the block and = is used, so this is fine. Not a bug, just noting it's worth a quick look.

Minor

5. os.WriteFile errors unchecked in new test assertions (eject_test.go:177, 246, 268, 308, 383)

The first review flagged this. Several os.WriteFile and json.Unmarshal calls in the new tests (e.g., line 232 json.Unmarshal([]byte(store.secrets[secretName]), &remaining)) silently ignore errors. Using require.NoError here would make test failures easier to diagnose.

6. RunPreflightChecks always returns nil error (preflight.go:52)

The signature is (*PreflightResult, error) but all failure paths record a CheckFail result; the error return is always nil. The caller at eject.go:80 checks if err != nil which is dead code today. This is fine as an extension point, but the dead branch could confuse future readers. Either remove the error return or add a comment that it's reserved for future use.

What Looks Good

The two-phase record write (initial record with snapshot before any destructive step) is the right design — the snapshot survives even if the process is killed after the strip.
writeFileAtomic is a clean addition; using it for both StripControlRoomFields and writeEjectRecord is consistent.
ErrSecretNotFound sentinel in lib/types with provider-side wrapping in AWS and Azure is the right pattern. The errors.Is check in mimir.go correctly distinguishes "not found" (success) from permission/transient errors (failure to record).
Multi-error wrapping with two %w verbs (fmt.Errorf("%w: %s: %w", ...)) is valid in Go 1.20+ and this module targets Go 1.26.
fakeSecretStore.getErr field in the new tests lets you simulate transient errors cleanly without needing a full mock.
Test TestCollectControlRoomDetails_DryRunSecretName documents the correctness fix with a regression test — good pattern.

timtalbot added 2 commits April 21, 2026 20:58

Merge remote-tracking branch 'origin/main' into orchestrate-eject

9b5b208

# Conflicts: # lib/eject/eject.go

fix: remove dead Bundle struct and check test WriteFile errors

ecaaf70

Address review feedback on #265: remove unused Bundle type and variable from eject.Run, and wrap os.WriteFile calls in tests with require.NoError for fail-fast on setup errors.

stevenolen force-pushed the orchestrate-eject branch from ebaafc7 to ecaaf70 Compare June 11, 2026 12:35

stevenolen force-pushed the main branch from 473ac02 to f56c461 Compare June 11, 2026 12:35

timtalbot added 5 commits June 22, 2026 11:10

Merge remote-tracking branch 'origin/main' into orchestrate-eject

1e17524

Merge remote-tracking branch 'origin/main' into orchestrate-eject

b685c90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestrate eject: preflight, config strip, Mimir cleanup#265

Orchestrate eject: preflight, config strip, Mimir cleanup#265
timtalbot wants to merge 10 commits into
mainfrom
orchestrate-eject

timtalbot commented Apr 22, 2026

Uh oh!

timtalbot commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timtalbot commented Apr 22, 2026

Summary

Test plan

Uh oh!

timtalbot commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: orchestrate-eject

Potential Issues

Minor Nits

What Looks Good

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timtalbot commented Jun 22, 2026

Uh oh!

claude Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: orchestrate-eject

Issues

Minor

What Looks Good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 22, 2026 •

edited

Loading

Code Review: `orchestrate-eject`

claude Bot commented Jun 22, 2026 •

edited

Loading

claude Bot commented Jun 22, 2026 •

edited

Loading

claude Bot commented Jun 22, 2026 •

edited

Loading

Code Review: `orchestrate-eject`