refactor(subagents): consolidate sub-agent liveness — guard, delete suspend flag, stop parent over-supervision by Aaronontheweb · Pull Request #1481 · netclaw-dev/netclaw

Aaronontheweb · 2026-06-24T04:32:10Z

What & why

The subagent-boundary half of the #1472 consolidation ("addition through subtraction" of the sub-agent liveness machinery). Builds on #1473 (step 1, already merged). Three independently-tested commits, in dependency order:

Step 3 — fail loud if a SelfMonitoring tool resolves to Opaque (`5ff03bbf`)

The watchdog budget is chosen from GetLivenessMode; a SelfMonitoring-declared tool that silently resolved to Opaque would be put on a wall-clock budget and could kill a healthy self-monitoring run. ToolLivenessValidator runs at startup and fails loud on any such mismatch. (Adapted from the goal's literal "throw in GetLivenessMode" — unsafe, because unknown tools legitimately reach it before ExecuteStreamAsync rejects them — to a startup assertion that achieves the same intent.)

Step 2 — delete the `SuspendsInactivityWatchdog` flag (`b0765503`)

Its only consumer was the parent's watchdog on the spawn_agent stream; its only producer was the sub-agent's approval wait. With step 3 guaranteeing spawn_agent resolves to SelfMonitoring, the parent already stops watching after the first item, so the flag was dead weight. The sub-agent's own approval survival is via ProcessingWatchdog.Stop, unaffected.

Step 4 — parent no longer supervises self-monitoring tools (`1d2d4f53`)

Self-monitoring tools own their liveness end to end: the sub-agent's prefill/no-progress watchdog self-terminates stalls, its guards fast-fail, and SubAgentActor.PostStop guarantees a terminal SubAgentResult on any stop — so the parent's Ask always completes. The parent drains the stream to its terminal item under caller cancellation only, and the FirstItemOnly budget mode is removed.

Contract change: "self-monitoring" now means a tool owns its liveness completely and always yields a terminal item. The parent's blind-startup backstop is gone; spawn_agent satisfies the contract via the PostStop guarantee.

Net effect

Net −67 LOC (my diff: +183 / −250 across 11 files). The flag, the parent's dual-supervision of sub-agents, and the FirstItemOnly mode are all gone; the three invariants (healthy sub-agent never killed; approvals never affect the clock; no silent liveness downgrade) hold by construction at the sub-agent boundary.

Scope / follow-ups

Per checkpoint, the two system-wide steps of #1472 are deferred to their own legs:

Step 5 — replace the off-thread volatile StreamingToolWatchdog (now used only for opaque tools' wall-clock budget) with actor-owned liveness.
Step 6 — unify the daemon's logging path (remove the Akka-async-vs-MEL-sync split + the SessionDiagnosticsContext AsyncLocal gate).

Verification

Full Netclaw.Actors.Tests suite green (2443 passed) before rebase; rebuild + relevant sweep (130) green after rebasing onto latest dev.
dotnet slopwatch analyze clean; copyright headers present.

Part of #1472.

- Cancel handling: a self-monitoring drain cancelled by the caller no longer becomes a tool-result "Error executing tool: The operation was canceled." fed back to the model; it rethrows to the batch handler (clean turn failure). The outer relabel to a wall-clock TimeoutException is kept on purpose — the tool-execution token is also the session's own timeout signal, so the two cannot be distinguished there. - Vacuous test: the old "no parent timeout" test advanced a FakeTimeProvider the drain path never reads, so it proved nothing. Replaced with two honest tests — runs-to-completion, and bounded-only-by-caller-cancellation (the latter also covers the cancel-handling fix above). - Dead code: drop the never-assigned `result` local in ConsumeAsync (after FirstItemOnly was removed, `return result ?? throw` was always the throw). - Stale comment: SpawnAgentTool no longer claims a parent watchdog keeps the delegated run alive — there is none. - Overstated justification: correct the pipeline comment — the drain's terminal item depends on SpawnAsync's finally completing the activity channel, not on PostStop's reply alone (PostStop only unblocks the spawner Ask). - Skill: netclaw-operations now states spawn_agent has no parent timeout (the subagent owns its liveness; bounded by its own watchdog or by cancelling the turn). Bump version 2.15.0 -> 2.16.0. By design (deliberately NOT "fixed"): no parent backstop and an unbounded approval wait. A self-monitoring tool owns its liveness end to end and approval blocks until answered — removing the moving part is the goal. Relates to #1481, #1472.

The startup guard only flagged "declared SelfMonitoring but resolved to a wall-clock mode" (the kill-mid-run direction). With the parent no longer supervising self-monitoring tools at all, the reverse is now equally dangerous: a tool that resolves SelfMonitoring without declaring it via the attribute would be drained with NO watchdog. ToolLivenessValidator now fails loud on either mismatch; adds a test for the reverse direction. Addresses review finding #9. Relates to #1481, #1472.

The streaming-tool-call-execution OpenSpec change and SPEC-016 still described the original two-phase watchdog / "parent first-item startup guard for self-monitoring tools" design, which #1481 superseded. Reconcile the in-flight (unarchived) change's delta spec to the implemented two-class model so it archives correct: - netclaw-tools delta spec: replace "Per-call two-phase inactivity watchdog" with "Per-call liveness by tool class" — opaque tools get one wall-clock budget; self-monitoring tools get NO parent watchdog (drained to a terminal item, bounded only by their own watchdog or caller cancellation); both-direction startup liveness validation. New scenarios cover each. - design.md: D3 update recording the "addition through subtraction" refinement. - tasks.md: Phase H for the consolidation work. - SPEC-016: superseded note pointing at the reconciled OpenSpec requirement. openspec validate passes. Relates to #1481, #1472, #1450.

The per-tool watchdog budget is chosen from IToolExecutor.GetLivenessMode. A SelfMonitoring-declared tool that silently resolves to Opaque would be put on a wall-clock budget and could kill a healthy self-monitoring run — e.g. spawn_agent killing a sub-agent mid-approval. The source generator emits LivenessMode from the [NetclawTool(Liveness=...)] attribute, so the two can only diverge via stale generated code or a hand-rolled override. Add ToolLivenessValidator.AssertSelfMonitoringConsistency and run it at startup (ToolIndexUpdater) over the assembled registry: any tool declaring SelfMonitoring that resolves to a different mode fails loud at startup rather than mis-supervising at runtime. This is the precondition for removing the SuspendsInactivityWatchdog flag (next commit). Note: a literal "make GetLivenessMode throw on miss" is unsafe — unknown tools legitimately reach GetLivenessMode before ExecuteStreamAsync rejects them, so a throw there would break the graceful unknown-tool path. A startup assertion achieves the same intent (no silent downgrade) without that regression. Step 3 of #1472.

The flag's only consumer was the parent's per-tool-call watchdog on the spawn_agent stream; its only producer was the sub-agent's approval wait. With the step-3 guard guaranteeing spawn_agent resolves to SelfMonitoring (FirstItemOnly, which disables the parent watchdog after the first stream item), the parent no longer watches during an approval wait, so the flag is dead weight. The sub-agent's own approval survival is via ProcessingWatchdog.Stop, unaffected. Remove: - ToolActivityUpdate.SuspendsInactivityWatchdog (the flag) - the StreamingToolWatchdog consumer branch - the SubAgentActor producer (EmitActivity param + the approval-wait call) - the now-obsolete suspend watchdog tests (the behaviour no longer exists) The sub-agent still surfaces "awaiting human approval" / "approval resolved" activities to the parent stream for visibility; the renamed actor test covers that. FirstItemOnly survival of a long approval stays covered by First_item_only_budget_disables_parent_liveness_after_startup. Step 2 of #1472.

Self-monitoring tools (spawn_agent) own their liveness end to end: the sub-agent's prefill/no-progress watchdog self-terminates stalls, its guards fast-fail, and SubAgentActor.PostStop guarantees a terminal SubAgentResult on any stop — so the parent's Ask always completes. The parent therefore does not supervise such a tool at all: it drains the stream to its terminal item under caller (turn/user) cancellation only. - SessionToolExecutionPipeline: self-monitoring tools are drained without a watchdog (DrainToCompletionAsync); opaque tools keep the one wall-clock budget. - Remove the FirstItemOnly budget mode (ToolWatchdogResetMode.FirstItemOnly + its factory + the two switch cases) — no longer used. Contract change: "self-monitoring" now means the tool owns its liveness completely and always yields a terminal item; the parent's blind-startup backstop is gone. The removed Self_monitoring_tool_still_times_out_when_startup_never_emits test (and its SelfMonitoringNeverStartsExecutor stub) exercised exactly that backstop. spawn_agent satisfies the contract via the PostStop guarantee; affected tests now drain rather than watch. Step 4 of #1472.

- Cancel handling: a self-monitoring drain cancelled by the caller no longer becomes a tool-result "Error executing tool: The operation was canceled." fed back to the model; it rethrows to the batch handler (clean turn failure). The outer relabel to a wall-clock TimeoutException is kept on purpose — the tool-execution token is also the session's own timeout signal, so the two cannot be distinguished there. - Vacuous test: the old "no parent timeout" test advanced a FakeTimeProvider the drain path never reads, so it proved nothing. Replaced with two honest tests — runs-to-completion, and bounded-only-by-caller-cancellation (the latter also covers the cancel-handling fix above). - Dead code: drop the never-assigned `result` local in ConsumeAsync (after FirstItemOnly was removed, `return result ?? throw` was always the throw). - Stale comment: SpawnAgentTool no longer claims a parent watchdog keeps the delegated run alive — there is none. - Overstated justification: correct the pipeline comment — the drain's terminal item depends on SpawnAsync's finally completing the activity channel, not on PostStop's reply alone (PostStop only unblocks the spawner Ask). - Skill: netclaw-operations now states spawn_agent has no parent timeout (the subagent owns its liveness; bounded by its own watchdog or by cancelling the turn). Bump version 2.15.0 -> 2.16.0. By design (deliberately NOT "fixed"): no parent backstop and an unbounded approval wait. A self-monitoring tool owns its liveness end to end and approval blocks until answered — removing the moving part is the goal. Relates to #1481, #1472.

The startup guard only flagged "declared SelfMonitoring but resolved to a wall-clock mode" (the kill-mid-run direction). With the parent no longer supervising self-monitoring tools at all, the reverse is now equally dangerous: a tool that resolves SelfMonitoring without declaring it via the attribute would be drained with NO watchdog. ToolLivenessValidator now fails loud on either mismatch; adds a test for the reverse direction. Addresses review finding #9. Relates to #1481, #1472.

The streaming-tool-call-execution OpenSpec change and SPEC-016 still described the original two-phase watchdog / "parent first-item startup guard for self-monitoring tools" design, which #1481 superseded. Reconcile the in-flight (unarchived) change's delta spec to the implemented two-class model so it archives correct: - netclaw-tools delta spec: replace "Per-call two-phase inactivity watchdog" with "Per-call liveness by tool class" — opaque tools get one wall-clock budget; self-monitoring tools get NO parent watchdog (drained to a terminal item, bounded only by their own watchdog or caller cancellation); both-direction startup liveness validation. New scenarios cover each. - design.md: D3 update recording the "addition through subtraction" refinement. - tasks.md: Phase H for the consolidation work. - SPEC-016: superseded note pointing at the reconciled OpenSpec requirement. openspec validate passes. Relates to #1481, #1472, #1450.

Aaronontheweb

LGTM - mostly competing timeout ideas being deleted.

Aaronontheweb · 2026-06-25T01:30:45Z

 Status: Planning note for issue #1450. The corresponding OpenSpec artifacts must
 be updated through the OpenSpec workflow before implementation is closed.

+> **Superseded in part by #1472 / PR #1481 ("addition through subtraction").** The


Aaronontheweb · 2026-06-25T01:32:03Z

Aaronontheweb · 2026-06-25T01:32:22Z

-            time,
-            onActivity: _ => firstActivitySeen.TrySetResult(),
-            TestContext.Current.CancellationToken);
+        async Task<string?> DrainAsync()


Aaronontheweb · 2026-06-25T01:33:14Z

+            // SpawnAsync return even on a crash. (Note: PostStop alone only unblocks
+            // the spawner Ask — the terminal stream item depends on that finally
+            // running.) Opaque tools remain bounded by one wall-clock budget.
+            if (executor.GetLivenessMode(toolCall) == ToolLivenessMode.SelfMonitoring)


…out (#1472 step 5) (#1487) After #1481 the off-thread StreamingToolWatchdog was used for exactly one thing: bounding opaque tool calls with a flat wall-clock budget. In that mode its two shared Volatile fields were written once and never changed, so the 1s polling timer only ever computed "has the budget elapsed since start?" — which is exactly what a CancellationTokenSource timeout token does. (The "why is this volatile instead of actor state?" smell.) Replace it with a TimeProvider-driven timeout token — new CancellationTokenSource( timeout, timeProvider) linked to the caller token — draining through the same DrainToCompletionAsync helper self-monitoring tools already use. A budget trip is converted to the same TimeoutException message; caller (turn/user) cancellation is rethrown unchanged (preserves #1481). Per call, so a slow tool times out without affecting siblings, and TimeProvider-virtualizable so tests stay deterministic. - Delete StreamingToolWatchdog, ToolWatchdogBudget, ToolWatchdogResetMode — removes the volatile, the polling timer, and the dead ResetOnItem/FirstItem/InterItem machinery (review follow-up #13). - Unify the two drain paths on DrainToCompletionAsync (review follow-up #11). - Point SpawnAgentStreamingTests at the production drain instead of the watchdog (review follow-up #12). - Opaque wall-clock timeout + concurrent independence are already covered at the pipeline level (FakeTimeProvider); add one test for the stream-without-a-completion- item contract; gut the watchdog unit tests, keeping the INetclawTool default-adapter test. No behavior change: opaque tools still time out at the same wall-clock budget with the same message and per-tool isolation; self-monitoring tools are untouched. The spec-of- record ("Per-call liveness by tool class") already says opaque tools get one wall-clock budget — only the mechanism changed. Net: roughly -150 LOC. Part of #1472.

Aaronontheweb added reliability Retries, resilience, graceful degradation cleanup Code quality improvements and tech debt reduction subagents spawn_agent, SubAgentActor, definition loader, discovery context layer, and related features labels Jun 24, 2026

Aaronontheweb marked this pull request as ready for review June 24, 2026 17:40

Aaronontheweb added 6 commits June 25, 2026 00:40

Aaronontheweb force-pushed the fix/approval-as-state branch from 5f4713f to 9c8c9e6 Compare June 25, 2026 00:43

Aaronontheweb commented Jun 25, 2026

View reviewed changes

Aaronontheweb merged commit 75e7ea2 into dev Jun 25, 2026
21 checks passed

Aaronontheweb deleted the fix/approval-as-state branch June 25, 2026 01:33

Aaronontheweb mentioned this pull request Jun 25, 2026

release: Netclaw 0.24.2 #1497

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(subagents): consolidate sub-agent liveness — guard, delete suspend flag, stop parent over-supervision#1481

refactor(subagents): consolidate sub-agent liveness — guard, delete suspend flag, stop parent over-supervision#1481
Aaronontheweb merged 6 commits into
devfrom
fix/approval-as-state

Aaronontheweb commented Jun 24, 2026

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Jun 25, 2026

Uh oh!

Aaronontheweb Jun 25, 2026

Uh oh!

Aaronontheweb Jun 25, 2026

Uh oh!

Aaronontheweb Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Aaronontheweb commented Jun 24, 2026

What & why

Step 3 — fail loud if a SelfMonitoring tool resolves to Opaque (5ff03bbf)

Step 2 — delete the SuspendsInactivityWatchdog flag (b0765503)

Step 4 — parent no longer supervises self-monitoring tools (1d2d4f53)

Net effect

Scope / follow-ups

Verification

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Step 3 — fail loud if a SelfMonitoring tool resolves to Opaque (`5ff03bbf`)

Step 2 — delete the `SuspendsInactivityWatchdog` flag (`b0765503`)

Step 4 — parent no longer supervises self-monitoring tools (`1d2d4f53`)