Skip to content

Orchestrate eject: preflight, config strip, Mimir cleanup#265

Draft
timtalbot wants to merge 10 commits into
mainfrom
orchestrate-eject
Draft

Orchestrate eject: preflight, config strip, Mimir cleanup#265
timtalbot wants to merge 10 commits into
mainfrom
orchestrate-eject

Conversation

@timtalbot

Copy link
Copy Markdown
Contributor

Summary

Wires up the full ptd eject --dry-run=false flow, turning the eject command from a documentation-only tool into one that actually disconnects the workload from the control room.

  • Preflight checks — validates workload credentials, control room credentials, and that control_room_* fields exist in config before proceeding
  • Config snapshot + strip — snapshots control room fields for the audit record, then zeroes them in ptd.yaml
  • Mimir password deletion — removes the workload's entry from the control room's mimir-auth secret (fault-tolerant: logs and continues on failure)
  • Eject record — writes eject-record.json capturing what happened (snapshot, timestamps, success/failure of each step)
  • Terminology cleanupsevereject/removal throughout; sever.gomimir.go
  • Connection inventory trimmed — removed Loki and Team Operator connections (no longer applicable)
  • Secret name fix — mimir-auth secret resource in connection inventory now correctly uses the control room target name (matching actual deletion logic in mimir.go)

Test plan

  • All existing eject tests pass
  • New test: full happy path strips config and deletes Mimir entry
  • New test: preflight fails on empty config (already-ejected workload)
  • New test: graceful degradation when no control room target available
  • New test: Mimir deletion failure continues with partial success record
  • New test: dry-run skips all destructive steps (config, Mimir, eject record)
  • Manual test against a real workload

Closes #248

…r cleanup

Wire up the full eject flow for --dry-run=false: preflight validation,
control room config snapshot, ptd.yaml field stripping, Mimir password
deletion from the control room secret store, and eject record generation.

Closes #248
@timtalbot

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Apr 22, 2026

Copy link
Copy Markdown

Claude finished @timtalbot's task —— View job


Code Review: orchestrate-eject

  • Read all changed files
  • Analyze preflight checks
  • Analyze eject orchestration logic
  • Analyze config strip and Mimir cleanup
  • Review tests
  • Post review feedback

Overall this is a well-structured PR. The step ordering (snapshot → strip → mimir delete → record) is sound, the fault-tolerance on Mimir deletion is the right call, and the test coverage is thorough. A few observations:

Potential Issues

  1. bundle is unused (eject.go:70): bundle is allocated and its ControlRoom field is set, but it's never serialized or returned. This is dead code carried from the previous PR. Either serialize it into the output directory or remove it to avoid confusion.

  2. eject-record.json file permissions vs metadata.json (eject.go:209): writeEjectRecord uses 0600 while other artifact files (e.g., metadata.json) use 0644. The eject record does contain the snapshot of control room fields (account IDs, domains) — so restricted permissions make sense — but the inconsistency is worth being intentional about. If the whole bundle is meant to be shared with the customer, 0644 might be more appropriate.

  3. Preflight RunPreflightChecks never returns an error (preflight.go:52-77): The function signature returns (*PreflightResult, error) but the error path is never hit — all failures are recorded as CheckFail results. This is fine as an extensibility hook, but the caller in eject.go:79-88 checks both err and !pfResult.Passed, which means the err branch is currently dead code. Not a bug, just something to be aware of.

  4. computePassed() on empty checks returns true (preflight.go:36-43): If no checks run (e.g., all check functions are somehow skipped), preflight passes vacuously. The test TestPreflightResult_ComputePassed_Empty explicitly validates this, so it's intentional, but worth a comment on whether that's the desired behavior for safety.

Minor Nits

  1. Error return from os.WriteFile in tests is ignored (eject_test.go:177,246,268,308,383): Several os.WriteFile calls in tests don't check the error return. While unlikely to fail with t.TempDir(), it's cleaner to use require.NoError.

  2. controlRoomName could be derived inside Run (eject.go:92-95): The pattern of if opts.ControlRoomTarget != nil { controlRoomName = opts.ControlRoomTarget.Name() } could be a method on Options for consistency with configLoader() and ptdYamlPath().

What Looks Good

  • The snapshot-before-strip ordering guarantees the audit record captures the original values even if something goes wrong mid-eject.
  • Mimir deletion failure being non-fatal (log-and-continue) is exactly right — the config is already stripped at that point, so failing hard would leave the workload in a confusing half-state with no record.
  • The terminology cleanup (severeject/removal) is consistent throughout all 16 files.
  • The connection inventory trim (removing Loki and Team Operator) simplifies the model and the mimir-auth secret name fix (using control room name instead of workload name) is a correctness fix.
  • Test coverage is comprehensive: happy path, preflight failure, graceful degradation, partial success, and dry-run isolation.

Address review feedback on #265: remove unused Bundle type and
variable from eject.Run, and wrap os.WriteFile calls in tests
with require.NoError for fail-fast on setup errors.
…anup

GetSecretValue now wraps cloud-specific not-found errors with a
types.ErrSecretNotFound sentinel. RemoveWorkloadMimirPassword reads the
secret directly and treats only a genuine not-found as a no-op success,
returning any other error (e.g. AccessDenied) so the eject record
reflects mimir_secret_removed:false instead of a false success.
StripControlRoomFields and writeEjectRecord now write to a temp file in
the same directory and os.Rename it into place, so a crash mid-write
leaves the original file intact instead of a truncated/corrupt one.
File permissions are preserved (ptd.yaml 0644, eject-record 0600).
Write the eject-record (containing the control-room snapshot) to disk
right after taking the snapshot, before stripping ptd.yaml, then rewrite
it with final per-step outcomes once the steps complete. If the process
dies after the strip but before the final write, the snapshot survives
on disk so a re-run can recover instead of failing preflight.
buildConnections derived the mimir-auth secret name from a separately
threaded control room target name, which is empty in the dry-run path
(the default), producing a malformed '.mimir-auth.posit.team' in the
customer handoff doc. Derive it from ControlRoomClusterName instead,
which is populated in both dry-run and live paths, and drop the now
redundant controlRoomName parameter.
…ct rename

Target.Credentials already refreshes internally, so the explicit
creds.Refresh call in the preflight credential checks was a guaranteed
no-op; remove it. Also complete the severance->eject terminology rename
in the handoff doc: '## Eject Plan' header and ejectPlanDryRun/
ejectPlanLive template funcs and text helpers.
@timtalbot

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude encountered an error —— View job


I'll analyze this and get back to you.

@timtalbot

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude encountered an error —— View job


I'll analyze this and get back to you.

@timtalbot

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 22, 2026

Copy link
Copy Markdown

Claude finished @timtalbot's task in 3m 33s —— View job


Code Review: orchestrate-eject

  • Read PR diff and context
  • Analyze new/changed files
  • Review preflight checks
  • Review eject orchestration
  • Review config strip, Mimir cleanup, atomic write
  • Review tests
  • Post review feedback

The overall design is sound: snapshot-before-strip guarantees recovery data survives destructive steps, atomic writes protect against torn files, and the fault-tolerant Mimir deletion is the right call. A few issues worth addressing before merge:

Issues

1. Incomplete intermediate record write — config_stripped: false survives a mid-run crash (eject.go:163-167)

The initial record is written with ConfigStripped: false. After StripControlRoomFields succeeds (line 167), record.ConfigStripped = true is set in memory but never flushed to disk. If the process is killed between the strip and the final writeEjectRecord (line 188), the on-disk record shows config_stripped: false even though the config was already stripped. This contradicts the comment at line 148 ("The steps are marked not-yet-done; the record is rewritten with final outcomes once the steps complete") — the record is not rewritten after each step. A second writeEjectRecord call after line 167 would close the gap.

Fix this →

2. Domain-only config produces ".mimir-auth.posit.team" in the connection inventory (control_room.go:79)

secretName is derived from details.ClusterName, which can be empty if only control_room_domain is set. The comment says this fixes a bug with empty names — but it moves the bug from the live path to the case where domain is set without cluster name. The Secret Sync entry should be guarded by details.ClusterName != "", mirroring how the Mimir deletion itself only fires when ControlRoomTarget != nil. The new test TestBuildConnections_DomainAndClusterName documents the correct case, but the incorrect case (Domain != "", ClusterName == "") goes untested and produces a malformed resource name.

Fix this →

3. Redundant EjectedAt update at step 4 (eject.go:186)

record.EjectedAt is set to time.Now() on line 154 (before any destructive step) and then overwritten to a new time.Now() on line 186. This means the timestamp in the final record reflects when the last step completed, not when eject started. If the intent is to record "when eject began," the line 186 update should be removed. If the intent is "when eject finished," the initial write should keep the first timestamp but the semantics of the field name (ejected_at) should probably be clarified in the struct comment.

4. err variable shadowed in cmd/eject.go (cmd/eject.go:87)

controlRoomTarget, err = legacy.ControlRoomTargetFromName(targetName) uses the same err variable declared at the top of runEject (line 47, from legacy.TargetFromName). This works correctly in Go since the outer err is reused, but it's worth confirming there's no short-variable-declaration (:=) issue here — if controlRoomTarget is also declared at the outer scope, := inside the if !ejectDryRun block would create a new scope-local variable. Looking at the diff, var controlRoomTarget types.Target is declared before the block and = is used, so this is fine. Not a bug, just noting it's worth a quick look.

Minor

5. os.WriteFile errors unchecked in new test assertions (eject_test.go:177, 246, 268, 308, 383)

The first review flagged this. Several os.WriteFile and json.Unmarshal calls in the new tests (e.g., line 232 json.Unmarshal([]byte(store.secrets[secretName]), &remaining)) silently ignore errors. Using require.NoError here would make test failures easier to diagnose.

6. RunPreflightChecks always returns nil error (preflight.go:52)

The signature is (*PreflightResult, error) but all failure paths record a CheckFail result; the error return is always nil. The caller at eject.go:80 checks if err != nil which is dead code today. This is fine as an extension point, but the dead branch could confuse future readers. Either remove the error return or add a comment that it's reserved for future use.

What Looks Good

  • The two-phase record write (initial record with snapshot before any destructive step) is the right design — the snapshot survives even if the process is killed after the strip.
  • writeFileAtomic is a clean addition; using it for both StripControlRoomFields and writeEjectRecord is consistent.
  • ErrSecretNotFound sentinel in lib/types with provider-side wrapping in AWS and Azure is the right pattern. The errors.Is check in mimir.go correctly distinguishes "not found" (success) from permission/transient errors (failure to record).
  • Multi-error wrapping with two %w verbs (fmt.Errorf("%w: %s: %w", ...)) is valid in Go 1.20+ and this module targets Go 1.26.
  • fakeSecretStore.getErr field in the new tests lets you simulate transient errors cleanly without needing a full mock.
  • Test TestCollectControlRoomDetails_DryRunSecretName documents the correctness fix with a regression test — good pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestrate the severance

1 participant