Add EFT-3295 Tentacle script abandon design doc#1226
Conversation
| - `source/Octopus.Tentacle.Tests.Integration/Util/SilentProcessRunnerFixture.cs` | ||
| - `source/Octopus.Tentacle.Tests.Integration/Support/TentacleFetchers/LinuxTentacleFetcher.cs` | ||
|
|
||
| **What happens to stdout/stderr after abandon.** The OS process is still running and may still write to its stdout/stderr pipes. The `OutputDataReceived` handler we registered is still wired. On the abandon catch, we let the Process object's handles release through normal GC — the OS process eventually blocks on its next stdout write when the pipe buffer fills (~64KB), which is acceptable per the ticket's "we are not interfering" framing. Tentacle's resources are bounded. |
There was a problem hiding this comment.
Claude is this up to date? If we trigger the abandon token the very next thing looks like it call methods that stops process writing to its pipes
There was a problem hiding this comment.
Caught. Spec was lying. Fixed in the latest commit. The outer using (var process = new Process()) disposes the Process on method unwind, which closes our pipe handles. OS process may get EPIPE on its next stdout/stderr write. Consistent with the ticket because we're closing our own handles, not killing the process. Updated paragraph explains it honestly.
|
|
||
| **Normal-path timing regression check.** Run a 100-iteration benchmark of normal short-script execution (`Write-Host "x"`); compare median wall-clock time vs. a baseline build without the changes. **Verify:** median delta within margin of error. The async swap should not measurably slow normal script execution. | ||
|
|
||
| ### 4.4 Feature flag verification |
There was a problem hiding this comment.
@LukeButters we can't really effectively feature flag this on tentacle can we? Our only option would be to create a script service v2 which we said isn't worth the overhead of supporting that version forever in the wild right?
There was a problem hiding this comment.
Agreed. Tentacle-side flag dropped in the latest commit. Capability advertisement is binary on build version. Server-side AbandonTentacleScriptOnCancellationTimeoutFeatureToggle is the only off-switch. @LukeButters if you disagree, this is the load-bearing comment to push back on.
|
|
||
| ## Section 4 — Automated test strategy | ||
|
|
||
| ### 4.1 `SilentProcessRunner` unit tests (both options) |
There was a problem hiding this comment.
We have selected an option now. So check the whole spec for optionality like this where its not needed. Also make things a bit more succinct
There was a problem hiding this comment.
Cleaned up. Async is locked, "(both options)" headers and Option 1/2 comparison language removed throughout. Section 4.4 (Tentacle-flag tests) deleted, Section 5 M5 rewritten around the server-side flag, ~30 lines trimmed in the latest commit.
| - **Manual `Thread` instead of `Task`.** Same leak problem, just trades threadpool for kernel thread handles + stack memory. | ||
| - **`Thread.Abort` / `Thread.Interrupt` / `TerminateThread` P/Invoke.** No safe managed mechanism to release a thread parked in unmanaged code. `TerminateThread` doesn't unwind stack or release locks; can corrupt Tentacle's own state. | ||
| - **Out-of-process script worker.** Cleanly isolates the stuck-process problem from Tentacle, but is a massive refactor far outside EFT-3295's scope. Worth a separate proposal someday. | ||
| - **Sync cancellable wait via `ManualResetEventSlim.Wait()`** (the earlier "Option 2" we held open for external input). Replaces only the blocking primitive inside `SilentProcessRunner`, leaves everything else synchronous. Smaller diff, but preserves a parked thread per running script in the normal case (same cost as today) and doesn't move the codebase toward async. Rejected in favour of the async approach below — direction matters, not just diff size. |
There was a problem hiding this comment.
And tentacle is very well tested so we believe we can make this change safely
There was a problem hiding this comment.
Good point. Worked it into the rejected bullet: "Rejected in favour of the async approach below. Tentacle's existing test coverage gives us confidence the wider async migration is safe to ship, so the smaller-diff defensiveness isn't compelling."
|
|
||
| **Why a new verb (not a "force" flag on Cancel).** Different semantics: Cancel = "try to stop the OS process gracefully". Abandon = "give up tracking; release the mutex; the OS process may still be running". Two verbs map cleanly to ProcessExecution's two-step escalation (cancel first, abandon if cancel doesn't propagate). | ||
|
|
||
| ## Section 2 — Mutex release mechanics (locked: async) |
There was a problem hiding this comment.
(locked: async) still refers to earlier optionality. You can remove all of this and just leave the considered section as the only mention of optionality
There was a problem hiding this comment.
Done. Dropped (locked: async) from Section 2 header and (locked) from Section 1 header for consistency. Status line at the top simplified too. The rejected-alternatives section is now the only place that mentions past optionality.
| - `TentacleServiceDecoratorBuilder.RecordMethodUsages(...)` decorates the script service so the test can assert how many times each method was called. Use this to verify capability negotiation and call counts for the new `AbandonScript` verb. | ||
| - `Wait.For(condition, timeout, onFail, ct)` is the event-driven polling helper. Always preferred over `Task.Delay` in test bodies. | ||
|
|
||
| **Pattern to follow:** mirror `ClientScriptExecutionIsolationMutex.cs`. Stuck-script tests should use `ScriptBuilder.WaitForFileToExist(...)` as the "kernel-blocked" simulant rather than `sleep 600`. The file-wait is event-driven and the test can release it on demand by creating the file. For the unkillable variant, combine the file-wait pattern with the `Tentacle.Debug.DisableProcessKill` flag described in the manual test setup so `Hitman` becomes a no-op for the test's duration. |
There was a problem hiding this comment.
@LukeButters I can't think of a way to simulate being blocked at the kernel level than actually creating an alternate path in hitman that doesn't do the killing... lmk if you can think of anything better. I dont love this
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. I'm Claude, leaving this review on Jim's behalf. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
| try | ||
| { | ||
| SilentProcessRunnerExtended.ExecuteCommand( | ||
| // Sync boundary: prerequisite check runs from the WPF installer |
There was a problem hiding this comment.
What does this comment mean its illegible
There was a problem hiding this comment.
Rewritten in the base PR (#1236) with the lighthouse sync-over-async comment pattern: states where we are, why the surrounding code must be sync, the deadlock risk that's being avoided, and a link to Stephen Cleary's canonical reference. The same pattern is applied uniformly at every sync↔async boundary in the codebase.
jimmyp
left a comment
There was a problem hiding this comment.
Need to address these comments
| try | ||
| { | ||
| SilentProcessRunnerExtended.ExecuteCommand( | ||
| // Sync boundary: prerequisite check runs from the WPF installer |
There was a problem hiding this comment.
Same as the other illegibility thread on this file — rewritten in #1236 with the lighthouse sync-over-async pattern.
| var stdOut = new List<string>(); | ||
| var stdErr = new List<string>(); | ||
| var exitCode = silentProcessRunner.ExecuteCommand("du", $"-s -B 1 {directoryPath}", "/", stdOut.Add, stdErr.Add); | ||
| // Sync boundary: called from IMemoryCache.GetOrCreate factory which is synchronous. |
There was a problem hiding this comment.
Brainstorm with me what we should do in each instance of these
There was a problem hiding this comment.
We brainstormed all six sync↔async sites (PowerShellPrerequisite, KubernetesDirectoryInformationProvider, SystemCtlHelper x2, LinuxServiceConfigurator x3, WindowsServiceConfigurator) and landed on a consistent pattern in the base PR (#1236): block via .GetAwaiter().GetResult() with a comment that explains where we are, why the surrounding code is sync, and why the thread (no captured SynchronizationContext) means no deadlock risk. The Kubernetes site here uses that same pattern.
|
|
||
| //non-kubernetes agent tentacles only support the standard script services | ||
| return new CapabilitiesResponseV2(new List<string> { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2) }); | ||
| return new CapabilitiesResponseV2(new List<string> { nameof(IScriptService), nameof(IFileTransferService), nameof(IScriptServiceV2), "AbandonScriptV2" }); |
There was a problem hiding this comment.
Use the nameof operator on the method we added
There was a problem hiding this comment.
Done — the capability now uses nameof(IAsyncClientScriptServiceV2.AbandonScriptAsync) instead of a string literal. Required adding using Octopus.Tentacle.Contracts.ClientServices; to the file.
| return bashScript.ToString(); | ||
| } | ||
|
|
||
| public ScriptBuilder AppendRaw(string bash, string windows) |
There was a problem hiding this comment.
Why did we have to create this why couldnt we use ther existing primitives
There was a problem hiding this comment.
Removed AppendRaw in commit 07bd2728. It was only used by the multi-level test, which was itself deleted in the same commit (the abandon mechanism doesn't care about process tree depth — see the other reply on ClientScriptExecutionAbandon.cs:98).
| exitCode = workspace.ShouldMonitorPowerShellStartup() | ||
| ? await RunPowershellScriptWithMonitoring(shellPath, writer, runningScriptToken) | ||
| : RunScript(shellPath, writer, runningScriptToken); | ||
| : await RunScriptAsync(shellPath, writer, runningScriptToken, abandonToken); |
There was a problem hiding this comment.
Where we catch the canceled exception below, do we want to differentiate abandoned scripts and log appropriately?
There was a problem hiding this comment.
Done in commit 07bd2728. Added a catch (OperationCanceledException) when (abandonToken.IsCancellationRequested) branch that logs "Script execution abandoned." and returns AbandonedExitCode. The plain cancel catch is unchanged.
| return new LegacyTentacleClientBuilder(halibutRuntime, ServiceEndPoint); | ||
| } | ||
|
|
||
| public IAsyncClientScriptServiceV2 CreateScriptServiceV2Client() |
There was a problem hiding this comment.
Added a comment in commit 07bd2728 explaining the rationale: the integration test needs to call AbandonScriptAsync directly to assert on the RPC response and drive the cancel→abandon sequence without going through TentacleClient.ExecuteScript's higher-level orchestrator. TentacleClient deliberately doesn't expose AbandonScript today — the server is the only production consumer and it talks to Halibut directly too. Exposing a direct client here keeps the test focused on the RPC behavior.
| var exitCode = await task; | ||
| sw.Stop(); | ||
|
|
||
| sw.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2), "abandon should return promptly"); |
There was a problem hiding this comment.
Why do we want to assert on this?
There was a problem hiding this comment.
Added an explanatory comment in commit 07bd2728. The whole point of abandon is "return promptly without waiting for the script process". The script we just started runs for 5 minutes. Without the elapsed bound this test would pass even if abandon accidentally waited the full 5 minutes — silently losing the entire contract. 2s is a generous near-instant bound (local: ~tens of ms, CI: under 500ms measured).
| } | ||
|
|
||
| // Latest tentacles also advertise AbandonScriptV2 (added in EFT-3295). Older versioned builds do not. | ||
| if (version == null) |
There was a problem hiding this comment.
Added a comment in commit 07bd2728. version == null indicates the "latest" build under test (the code in this branch). Concrete Version test cases exercise older released tentacles fetched from S3 to verify backwards compatibility. Older builds pre-date EFT-3295 and don't advertise the AbandonScriptAsync capability, so we only assert it for the latest build.
|
|
||
| // Load-bearing: second FullIsolation script with the same mutex name must now be able to start, | ||
| // proving the mutex was released by the abandon | ||
| var secondStartFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "second-start"); |
There was a problem hiding this comment.
Done in commit 07bd2728. The single mixed test was split into two focused tests in ClientScriptExecutionAbandon:
AbandonScript_WhenCancelFailsToKillProcess_ReturnsAbandonedExitCode— covers the abandon-completes path withAbandonedExitCode.AbandonScript_ReleasesIsolationMutexEvenWhileProcessIsStillRunning— covers the mutex-release contract (a second FullIsolation script with the same mutex name must be able to acquire and run).
| var startFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "ml-start"); | ||
| var releaseFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "ml-release"); | ||
|
|
||
| // Multi-level chain: outer shell (bootstrap) → child shell → grandchild polls for the release file. |
There was a problem hiding this comment.
Why do we need to test this directly, does it matter if its multiple levels deep where the stuck thing occurs
There was a problem hiding this comment.
You're right — the multi-level test was redundant. Deleted it in commit 07bd2728. The abandon mechanism is entirely Tentacle-side (fires the abandon token → SilentProcessRunner returns AbandonedExitCode → Execute() unwinds → mutex released). It doesn't touch the script's process tree at all. Hitman is also short-circuited by TentacleDebugDisableProcessKill before any kill happens regardless of tree depth. So process tree depth is irrelevant to the abandon path; the single-level test covers the same mechanism.
cb9761e to
583eb46
Compare
583eb46 to
89077d9
Compare
|
Rebased on top of the new foundational PR #1236 (async migration of SilentProcessRunner). The diff is now focused on the abandon feature itself — the async-migration plumbing has moved to the base PR. Previous head was preserved at git tag claude-safety-2026-05-25-pre-split. |
89077d9 to
0c23f2e
Compare
…, lighthouse comments, async tests - Strip SilentProcessRunner.cs to signature-only async change (move polyfill, EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR) - Apply lighthouse comment pattern uniformly to all sync-over-async sites including test scaffolding and the second/third sites in LinuxServiceConfigurator - Improve LinuxServiceConfigurator justification: name the sync interface chain (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the no-SynchronizationContext property of Main/Topshelf worker threads - Move SilentProcessRunnerFixture sync-over-async comment to actual call site - Convert LinuxConfigureServiceHelperFixture test methods to async Task, eliminating sync-over-async entirely in that file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Adds CancellationToken abandon parameter to ExecuteCommandAsync - Switches the await from WaitForExitAsync(cancel) to WaitForExitAsync(abandon) - Returns ScriptExitCodes.AbandonedExitCode when abandon fires before process exits - Removes process.Close() from DoOurBestToCleanUp (race with WaitForExitAsync's TCS via the Exited event - Close tore down the wait state, hung cancel) - Adds long-form documentation comments explaining the race, the grandchild-pipe scenario, and worst-case cancel latency - Adds TentacleDebugDisableProcessKill test affordance to Hitman Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ace cleanup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sync behavior Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nc callers, add unit fixtures, advertise capability count Callers outside the abandon hot-path (Manager prereq check, Kubernetes integration tooling, Windows service install, SystemCtl, helm/kubectl downloaders, PowerShell startup detection, Linux tentacle fetcher) pass abandon: CancellationToken.None to satisfy the new ExecuteCommandAsync signature. Also adds RunningScript-level abandon test, ScriptService capabilities count update for AbandonScriptV2, ClientAndTentacle.CreateScriptServiceV2Client helper, and TestDecorators AbandonScriptAsync stub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0c23f2e to
e5ca633
Compare
AbandonScript was waiting up to 5 seconds inside the RPC handler for the running script to reach Complete state before returning the response. The server (Octopus) polls GetStatus afterwards anyway (same pattern as the cancel flow), so the synchronous wait on the Tentacle side just burns RPC time without adding behavior. Integration tests updated to assert against polled GetStatus instead of directly against the abandon response state. Addresses #1226 review: ScriptServiceV2.cs:153 ("Which callers. I dont think we want to do this"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explains why the abandoned-script path tolerates Delete() failure (process still alive, may hold open handles) while the normal completion path surfaces failures. Addresses #1226 review: ScriptServiceV2.cs:176 ('Add some comments'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- RunningScript: differentiate abandoned (log "Script execution abandoned." + AbandonedExitCode) from cancelled (existing behavior). The abandon-specific catch checks abandonToken.IsCancellationRequested so the cancel path is unaffected. - SilentProcessRunner: extract SafelyCancelOutputAndErrorRead helper used by both the abandon catch and the normal completion path. Keeps the two paths consistent (a missed CancelXxxRead leaves async readers firing during dispose). - ClientAndTentacle.CreateScriptServiceV2Client: documented why this helper exists (test needs direct RPC access; TentacleClient deliberately doesn't expose AbandonScript today). - SilentProcessRunnerFixture: documented the 2s elapsed-time assertion (it's the contract: abandon returns promptly without waiting for the underlying process. Without the bound, the test would silently lose the entire contract). - CapabilitiesServiceV2Test: documented why version==null means "latest tentacle under test" (vs older S3-fetched builds for backwards-compat coverage). - ClientScriptExecutionAbandon: split into two focused tests (cancel-fails-then-abandon, and mutex-release). Deleted the multi-level-deep test: the abandon mechanism doesn't touch the script process tree, so process tree depth is irrelevant to the abandon path (Hitman with the env var short- circuits before any kill regardless of depth). - ScriptBuilder: removed AppendRaw (was only used by the deleted multi-level test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| IReadOnlyDictionary<string, string> environmentVariables, | ||
| TimeSpan powerShellStartupTimeout, | ||
| ILog log) : this(shell, workspace, null, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, environmentVariables, powerShellStartupTimeout, log) | ||
| ILog log) : this(shell, workspace, null, scriptLog, taskId, scriptIsolationMutex, runningScriptToken, CancellationToken.None, environmentVariables, powerShellStartupTimeout, log) |
There was a problem hiding this comment.
Why do we want to ever create a running script without an abandon token? At least create well named factory methods to show the intent
Background
EFT-3295. Tentacle script can hang inside endpoint protection (Philips' CrowdStrike + Rapid7 contention). Cancel doesn't propagate, the script-isolation mutex stays held, subsequent deployments queue forever. Customer's only recovery today is RDP-in-and-kill or reboot.
Results
Adds a design doc at
docs/superpowers/specs/2026-05-21-tentacle-script-abandon-design.mdproposing a newAbandonScriptverb onIScriptServiceV2. Tentacle releases the mutex, logs honestly that the script may still be running, frees the .NET thread, does NOT kill the OS process.Implementation approach: async. Replace
SilentProcessRunner.ExecuteCommandoutright withExecuteCommandAsync(no additive overload). All ~20 callers across production code, Kubernetes integration tests, and Tentacle integration test scaffolding migrate to await.Server-side contract is locked with the parallel session via the EFT-3295 Linear thread. Capability check up front. Single feature flag, server-side:
AbandonTentacleScriptOnCancellationTimeoutFeatureToggle. Tentacle's capability advertisement is binary on build version.Before
Status quo. Tentacle hangs indefinitely on
process.WaitForExit()when Cancel can't kill the process. Mutex stays held. Deployments queue forever until reboot.After
Server escalates to
AbandonScriptafter 2 min. Tentacle'sawait process.WaitForExitAsync(abandon)returns. The existingusingblock disposes naturally, mutex released. Tentacle writes an honest line into the script log and accepts new work. OS process is left alone per the ticket.How to review this PR
Doc-only PR. One open question remaining:
Earlier review questions resolved in the latest commit:
Tentacle-side scope only. Server-side has its own design doc in OctopusDeploy.
Quality ✔️
Design review only.
🤖 Generated with Claude Code
[JIM_BOT.EXE v2.13]