.NET: Python: [Bug]: AG-UI host loses tool calls when parallel calls require approval

### Description

## 1. Summary

When a model returns **several tool calls in one turn** and **any of them requires human-in-the-loop approval**, an agent hosted over AG-UI permanently loses every call in the batch except the single one the user approves. The lost calls are never executed and never re-prompted; instead the AG-UI message sanitizer fabricates `"Tool execution skipped …"` results for them — so the model receives false tool results, concludes its calls failed, and re-issues them indefinitely. Auto-approved tools (e.g. `load_skill`, whose approval is supposed to be granted silently by `SkillsProvider.all_tools_auto_approval_rule`) **never execute at all** on this host.

The root cause is a contract violation between three layers that individually behave as designed: MAF's tool-approval flow is *session-stateful* (it parks batch state in `AgentSession.state`), while the AG-UI host constructs a **fresh `AgentSession` per HTTP request** and resolves approvals at the transport layer without a session. Two additional protocol-level defects (cross-run `TOOL_CALL_END`, stream-invisible approval executions) surface once the state loss is fixed.

## 2. Symptoms

User-visible, on any AG-UI frontend (observed with CopilotKit):

- Of N parallel tool calls, only the approved one ever shows a result; sibling call chips stay "Running" forever.
- Only one approval card appears even when several calls are gated; the siblings' cards never appear on later turns either.
- The agent visibly "loops": each reply re-requests the tools it already requested.
- Skills never load on the AG-UI surface (`load_skill` is `always_require` + auto-approve rule — both halves of that contract break here).

Wire-level (captured with `ENABLE_SENSITIVE_DATA` logging, Ollama provider): the request sent to the model after approving 1 of 3 calls:

```text
user: I want a landing zone for my AI app like Microsoft Learn recommends
assistant: tool_calls [load_skill, caf_methodologies, microsoft_docs_search]
tool: load_skill -> "Tool execution skipped - assistant continued before the tool result was available."
tool: caf_methodologies -> "Tool execution skipped - assistant continued before the tool result was available."
tool: microsoft_docs_search -> "Tool execution skipped - ..." <- the call the user JUST approved
assistant: (streamed text)
tool: microsoft_docs_search -> {real results} <- duplicate result for the same call_id
```

Two of three results are fabricated; the approved call gets **two contradictory results** for the same `call_id`. On subsequent turns even the real result disappears (see §4.4) and the fabricated one returns.

### Code Sample

```markdown

```

### Error Messages / Stack Traces

```markdown
## 3. Reproduction

1. Host a harness agent (`create_harness_agent`) over AG-UI (`add_agent_framework_fastapi_endpoint`, `require_confirmation=True`) with at least one tool registered `approval_mode="always_require"` (e.g. an `MCPStreamableHTTPTool`) plus ordinary never-require tools, and skills enabled.
2. Send a prompt that elicits parallel tool calls, e.g. *"I want a landing zone for my AI app like Microsoft Learn recommends"* followed by *"do more MS Learn research"* against the Microsoft Learn MCP — models reliably issue 2–4 parallel `microsoft_docs_search`/`microsoft_docs_fetch` calls plus local tools.
3. Approve the single surfaced request; observe the logs (`Injecting synthetic tool result for pending call_id=…` per orphaned call) and the model's next turn re-issuing the lost calls.

Any provider reproduces it; nothing is provider-specific.
```

### Package Versions

agent-framework-core: 1.10.0, agent-framework-ag-ui: 1.0.0rc7

### Python Version

Python 3.12

### Additional Context

## 4. Root cause analysis

### 4.1 The approval flow is session-stateful by design

Both approval layers write to `AgentSession.state["tool_approval"]` (`DEFAULT_TOOL_APPROVAL_SOURCE_ID` and `_TOOL_APPROVAL_STATE_KEY` are the same key):

- **Function-invocation layer** (`agent_framework/_tools.py`, `_try_execute_function_calls`): when *any* call in a batch is gated, the **whole batch** becomes `function_approval_request` items (documented behavior: *"if the model returns multiple function calls, some that require approval and others that do not, it will ask approval for all of them"*). The never-require siblings are hidden in session state as **already-approved groups** keyed by the visible request ids, to be released by `_pop_already_approved_approval_responses` when those visible approvals are answered.
- **Harness `ToolApprovalMiddleware`** (`agent_framework/_harness/_tool_approval.py`): parks rule-auto-approved responses in **`collected_approval_responses`** (to be injected into the next run's messages) and queues all but the first user-facing request in **`queued_approval_requests`** (one prompt per run).

```mermaid
flowchart TD
 A[Model returns parallel tool-call batch] --> B{Any call gated by approval_mode = always_require?}
 B -- no --> C[Execute all calls immediately]
 B -- yes --> D[Whole batch becomes function_approval_request items]
 D --> E["Never-require siblings session.state['tool_approval'] .already_approved_approval_request_groups"]
 D --> F["Rule-auto-approved calls session.state['tool_approval'] .collected_approval_responses"]
 D --> G[First unresolved request surfaced to the user]
 D --> H["Remaining unresolved requests session.state['tool_approval'] .queued_approval_requests"]
 E -. released when any visible approval of the batch is answered .-> I[Executed on a later run]
 F -. injected into the next run's messages .-> I
 H -. popped one per later run .-> G
 style E fill:#fdd,stroke:#c00
 style F fill:#fdd,stroke:#c00
 style H fill:#fdd,stroke:#c00
```

*(red = state that AG-UI hosting destroys)*

This design works when the host keeps one `AgentSession` alive across runs, as interactive hosts do:

```mermaid
sequenceDiagram
 participant User
 participant Host as Host (DevUI / console)
 participant MW as ToolApprovalMiddleware
 participant FI as Function-invocation layer
 participant LLM

 User->>Host: prompt
 Host->>MW: agent.run(session S)
 MW->>FI: call_next
 FI->>LLM: request
 LLM-->>FI: parallel batch: 1 gated + 2 never-require calls
 FI->>FI: any call gated -> approval requests for ALL
 Note over FI: never-require siblings hidden in S.state ("already-approved group")
 FI-->>MW: 3 approval requests
 Note over MW: rule-approved call parked in S.state (collected), extra prompts parked in S.state (queued)
 MW-->>Host: one approval request
 Host->>User: approve?
 User->>Host: approve
 Host->>MW: agent.run(SAME session S)
 Note over MW,FI: S.state intact: collected responses injected, hidden group popped, queue surfaces next prompt
 MW->>FI: approval responses
 FI->>FI: execute every approved call
 FI->>LLM: request with all real results
```

### 4.2 The AG-UI host violates the session contract

`agent_framework_ag_ui/_agent_run.py::run_agent_stream` constructs `AgentSession(session_id=thread_id)` **per HTTP request** and discards it when the request ends. `AgentSession` is a plain in-memory container; nothing persists or restores `session.state`. Additionally, `_resolve_approval_responses` executes approved calls at the transport layer **without a session**, so `_pop_already_approved_approval_responses` is never called on this host at all.

```mermaid
sequenceDiagram
 participant UI as AG-UI client (CopilotKit)
 participant Run as run_agent_stream
 participant MW as ToolApprovalMiddleware
 participant FI as Function-invocation layer
 participant LLM

 UI->>Run: RunAgentInput (user turn)
 Note over Run: AgentSession S1 = fresh (state = {})
 Run->>MW: agent.run(S1)
 MW->>FI: call_next
 FI->>LLM: request
 LLM-->>FI: load_skill + caf_methodologies + microsoft_docs_search
 FI->>FI: batch gated -> approval requests for all 3
 Note over FI: caf_methodologies (never-require) hidden in S1.state
 Note over MW: load_skill auto-approved by rule -> parked in S1.state microsoft_docs_search surfaced
 Run-->>UI: RUN_FINISHED with interrupt (approve docs_search?)
 Note over Run: HTTP request ends -> S1 discarded queued + collected + hidden state destroyed
 UI->>Run: resume (approval for docs_search)
 Note over Run: AgentSession S2 = fresh (state = {})
 Run->>Run: _sanitize_tool_history injects synthetic "Tool execution skipped" for unanswered calls
 Run->>Run: transport executes docs_search only
 Run->>MW: agent.run(S2) - nothing to recover
 MW->>FI: call_next
 FI->>LLM: request: 1 real result + 2 fabricated "skipped"
 LLM-->>FI: re-issues the lost calls (loop repeats)
```

### 4.3 The sanitizer masks the damage with fabricated results

`_message_adapters._sanitize_tool_history` injects a synthetic `"Tool execution skipped …"` result for any tool call it considers abandoned. Because the thread snapshot stores the assistant's streamed text as a **separate assistant message after** the tool-call message, the sanitizer's *"assistant continued"* branch fires before it ever reaches the approval response — so it fabricates a result **even for the call the user is approving in that very request**, which then coexists with the real result (duplicate `tool` messages for one `call_id`; strict providers can reject this, lenient ones feed the model contradictory data).

```mermaid
flowchart TD
 subgraph History ["Reconstructed history on the resume request (order matters)"]
 M1["user: prompt"] --> M2["assistant: tool_calls A, B, C"]
 M2 --> M3["assistant: streamed text (snapshot stores it as a SEPARATE message after the tool-call message)"]
 M3 --> M4["user: approval response for A"]
 end
 M2 -. "pending = {A, B, C}" .-> S1
 subgraph Sanitizer ["_sanitize_tool_history (single forward pass)"]
 S1{"assistant text arrives while pending is non-empty"} -->|"'assistant continued' branch fires"| S2["inject synthetic results for A, B, C: 'Tool execution skipped ...'"]
 S2 --> S3["approval response for A is reached only AFTER the injection"]
 end
 S3 --> R1["_resolve_approval_responses executes A and writes A's REAL result"]
 R1 --> OUT["Model context: A = fabricated + real (two results, same call_id) B, C = fabricated only"]
 style S2 fill:#fdd,stroke:#c00
 style OUT fill:#fdd,stroke:#c00
```

### 4.4 Approved results are also lost from subsequent turns

`_make_approval_tool_result_events` emits `TOOL_CALL_RESULT` to the client but does **not** record the result in `flow.tool_results`, so the persisted thread snapshot never pairs the approved call with its result under its real `call_id` (`_clean_resolved_approvals_from_snapshot` rewrites the payload under the synthetic `confirm_changes` id instead, which the sanitizer *drops* on the next turn). Net effect: the model sees the real result exactly once — in the resumed run — and the fabricated "skipped" result on every turn thereafter.

Live corroboration (2026-07-04): the client-visible `MESSAGES_SNAPSHOT` after a resume run retained the **raw** approval payload — `{"accepted": true, "steps": […]}` as a tool message under the synthetic confirm id. `_clean_resolved_approvals_from_snapshot` builds its replacement map keyed by the resolved results' **real** `call_id`s, then looks up the snapshot tool message's `toolCallId` — the confirm id — so on this path the intended payload→result rewrite finds no match and never fires at all.

```mermaid
sequenceDiagram
 participant Run as Resume run (turn N)
 participant Flow as flow.tool_results
 participant Snap as Saved thread snapshot
 participant Next as Next turn (N+1)

 Run->>Run: _make_approval_tool_result_events emits TOOL_CALL_RESULT (real call_id) to the client
 Note over Run,Flow: result NOT recorded in flow.tool_results
 Run->>Snap: _clean_resolved_approvals_from_snapshot rewrites the confirm_changes tool message payload with the result
 Note over Snap: result persisted under the SYNTHETIC confirm_changes id, not the real call_id
 Next->>Snap: reconstruct history from snapshot
 Next->>Next: sanitizer filters confirm_changes from the assistant message and drops its orphaned tool message (the real result)
 Next->>Next: real call_id now unanswered again -> synthetic "Tool execution skipped" injected
 Note over Next: the model sees the real result exactly once (turn N) and a fabricated result on every later turn
```

### 4.5 Two latent protocol defects (exposed once state survives)

1. **Cross-run (duplicate) `TOOL_CALL_END`.** A queued approval request popped in a later run makes `_emit_approval_request` emit `TOOL_CALL_END` for a tool call whose `TOOL_CALL_START` happened in the *previous* run. Note this is a **duplicate**, not a late completion: the originating run already emitted the call's `TOOL_CALL_END` when argument streaming finished (via `_emit_approval_request` for the surfaced request, or the end-of-run *"pending tool calls without end event"* sweep for stripped siblings — both observed live: START event 237 and END event 247 in the originating run, then a second END as event 5 of the resume run). The AG-UI client rejects the second END and drops the whole event stream: `Cannot send 'TOOL_CALL_END' event: No active tool call found with ID '…'. A 'TOOL_CALL_START' event must be sent first.` (`runtimeErrorCode: INCOMPLETE_STREAM`). Suppressing it therefore loses no information — it converges on the spec's "emit the lifecycle exactly once, in the run that streamed the call" (§5).
2. **Stream-invisible approval executions.** In the streaming function-invocation loop (`_tools.py`), approved calls executed from inbound approval responses put their results into `prepped_messages` only — the normal in-run execution path yields results as a `ChatResponseUpdate` (`update_role` branch), but the approval path does not. Any host that builds client-visible history from the stream (AG-UI) can never persist these results.

 **This is not limited to cross-run resumes.** Live evidence (2026-07-03, run `40101f41…`): a batch resolved *entirely inside its own run* — `load_skill` rule-auto-approved by `SkillsProvider` with `caf_methodologies`/`waf_pillars` riding along as never-require siblings — executed and fed the model correctly (the results appear in the very next model call's payload), yet the run emitted **zero `TOOL_CALL_RESULT` events among 214** (six `TOOL_CALL_START`s — five real calls plus the synthetic `confirm_changes` — and their six `END`s). Client history and the thread snapshot are permanently missing those results, so tool cards stay "in progress" forever and the sanitizer fabricates "skipped" results for those calls on every later turn **even when approval state itself survives**. Repair layers outside the run are structurally too late: the transport stops consuming the agent stream at the approval interrupt, so anything recovering results at stream cleanup runs only after `RUN_FINISHED` has been emitted. Only yielding the results on the stream — this defect's fix — makes them visible in time.

## 5. AG-UI protocol conformance

The AG-UI interrupts specification (<https://docs.ag-ui.com/concepts/interrupts>) is directly on point for tool-based HITL:

> When `reason: "tool_call"` with `toolCallId`: the agent proposes a tool call in the interrupted run; the client resumes; **the agent emits `ToolCallResult` against the original `toolCallId` in the resumed run**. Crucially: *"The agent does **not** re-emit `ToolCallStart`/`ToolCallArgs`/`ToolCallEnd`."*

and:

> *"A single `resume` array must address every open interrupt from the interrupted run."* …resumption occurs by *"starting a new run whose `RunAgentInput` includes a `resume` array addressing every open interrupt."*

and on state:

> *"The agent must emit any state required for resume via `StateSnapshot`"* before the interrupt; *"State is agent-owned; the client submits resume payloads linking to interrupts, not state objects."*

Conformance assessment of current `agent_framework_ag_ui` behavior:

| Spec expectation | Current behavior | Verdict |
|---|---|---|
| Result-only emission for calls approved across runs (`ToolCallResult` against the original id, no re-emitted Start/Args/End) | `_make_approval_tool_result_events` does this correctly for the *approved* call | ✅ conforms |
| No `ToolCallEnd` without a same-run `ToolCallStart` | `_emit_approval_request` emits `TOOL_CALL_END` for prior-run calls when a queued request surfaces later | ❌ violates (defect 4.5.1) |
| Every open interrupt of the interrupted run addressed by one resume | Only one interrupt is ever opened per run (queueing), so per-run coverage trivially holds; the *batch* however spans runs by design | ⚠️ legal, but see Approach B |
| State required for resume survives to the resumed run | Tool-approval state is destroyed between runs | ❌ the essence of this bug |
| Interrupt objects carry `toolCallId` binding | MAF emits a `CustomEvent("function_approval_request")` + a synthetic `confirm_changes` frontend tool + `RUN_FINISHED.interrupt` entries keyed by the function `call_id` | ⚠️ pre-dates the core taxonomy; out of scope here |

Note the spec's resume-state guidance points at the **shared-state channel** (`StateSnapshot` → client → `RunAgentInput.state`). That is evaluated below as Approach E — and rejected on security grounds for *approval* state specifically; the server-side thread snapshot store (which `agent_framework_ag_ui` already owns for message history) satisfies "state is agent-owned" more strongly.

## 6. Fix approaches considered

### Approach A — persist tool-approval session state in the thread snapshot *(proposed)*

Add `session_state` to `AGUIThreadSnapshot`; `run_agent_stream` restores it into the per-request session and saves back **only** `session.state["tool_approval"]` after each run. Pass the session into `_resolve_approval_responses` so already-approved sibling groups are popped and executed at the transport layer (events + history included).

Storage reuses the existing thread-snapshot abstraction in `agent_framework_ag_ui._snapshots`: the **`AGUIThreadSnapshotStore`** protocol (async `save`/`get`/`delete`/`clear`, keyed by `(scope, thread_id)`) with **`InMemoryAGUIThreadSnapshotStore`** as the only shipped implementation — already bounded (`DEFAULT_MAX_THREAD_SNAPSHOTS = 1_000`, oldest snapshot evicted on overflow), so persisted approval state inherits the same LRU guard. Production deployments plug in their own durable/shared implementation of the protocol.

Today persistence stays inactive unless the endpoint is configured with both a `snapshot_store` and a `snapshot_scope_resolver`. To avoid shipping a fix that is silently off for the common local-dev setup, the proposal is to **default to a bounded `InMemoryAGUIThreadSnapshotStore` (with a constant default scope) when the endpoint serves an agent with approval-gated tools and no store is configured** — process-local and LRU-guarded, so it adds no durability or memory-growth concerns — plus a log line telling multi-instance deployments to supply a shared store. Per the [AG-UI security guidance](https://learn.microsoft.com/en-us/agent-framework/integrations/ag-ui/security-considerations) ("verify session ownership before processing requests"), this default is **dev-only**: a constant scope performs no ownership verification, so multi-user deployments must supply their own `snapshot_scope_resolver` deriving the scope from authenticated user/tenant context (see §8). Alternative if a behavioral default is unwanted: keep opt-in and emit a prominent warning when approval-gated tools are detected without a configured store.

Approach A is the state-persistence core; the proposed branch pairs it with the four companion fixes (detailed at the end of this section). The mapping below shows which §4 defect each piece resolves and which §5 conformance gap it closes:

| Defect / gap | Addressed by | Mechanism |
|---|---|---|
| §4.1 batch-wide gating (design, not a defect) | *unchanged by design* | Semantics preserved identically for all hosts; see Approach C for why changing them is rejected |
| §4.2 per-request session destroys parked state (queued / collected / hidden) | **Approach A core** | `AGUIThreadSnapshot.session_state` restores/persists `session.state["tool_approval"]`; queued prompts re-surface run by run; collected auto-approvals inject into the next run; the session passed into `_resolve_approval_responses` lets hidden already-approved groups pop and execute |
| §4.3 fabricated "Tool execution skipped" results | Companion fix: sanitizer answered-ids guard | `_sanitize_tool_history` skips call ids answered elsewhere in the history (real result or approval response); repair of genuinely abandoned calls is preserved |
| §4.4 approved result vanishes from later turns | Companion fix: record resolved results in `flow.tool_results` | The persisted snapshot pairs the approved call with its result under the real `call_id`; the §4.3 guard then has a real answer to protect on every later turn |
| §4.5.1 duplicate cross-run `TOOL_CALL_END` | Companion fix: same-run guard in `_emit_approval_request` | END is emitted exactly once, in the run that streamed the call (the originating run already emits it — §4.5.1); the later-run duplicate is suppressed, restoring §5 row 2 conformance |
| §4.5.2 stream-invisible approval executions | Companion fix: core streaming yield | Executed approval-response results are yielded as a `ChatResponseUpdate(role="tool")`, so AG-UI (and every streaming host) can render and persist them |
| §5 row "state required for resume survives" (❌) | **Approach A core** | Tool-approval state survives server-side in the snapshot — "state is agent-owned", never client-writable |
| §5 row "no `ToolCallEnd` without same-run `ToolCallStart`" (❌) | Companion fix (§4.5.1 row above) | — |
| §5 row "result-only emission for cross-run approvals" (✅ already) | reinforced | §4.4/§4.5.2 fixes keep those results in history without ever re-emitting Start/Args/End |

| Pros | Cons |
|---|---|
| Fixes the root cause (§4.2) — all three loss classes (queued, collected, hidden) — with the existing UX: sequential prompts keep working, one interrupt per run, protocol-legal; with the bundled companion fixes it covers §4.3, §4.4, §4.5.1 and §4.5.2 and closes both ❌ rows of the §5 conformance table (see mapping above) | Requires a configured snapshot store + scope resolver; without one the host stays broken (mitigated by the local-dev default above: fall back to a bounded `InMemoryAGUIThreadSnapshotStore` when approval-gated tools are present, or at minimum warn) |
| No client or protocol changes; no changes to middleware/core approval semantics — zero impact on console/headless hosts | Widens the snapshot contract (previously "only replayable UI data"); documented narrowly as tool-approval state |
| State stays server-side: clients cannot forge queue/rule/hidden-group contents (contrast E) | Horizontal scaling needs a shared snapshot store (Redis etc.) — but message history already has that exact requirement, so no *new* deployment constraint |
| Piggybacks on the store's existing authz boundary (`(scope, thread_id)`, scope resolver) and eviction | Persisted approval state contains tool names + full arguments → data-at-rest consideration (§8) |
| Scoping to the `tool_approval` key avoids resurrecting history-provider state (full-state persistence demonstrably duplicates AG-UI-owned history — caught by `test_agent_endpoint_prepends_stored_snapshot_for_new_user_turn`) | Other session-stateful features (compaction summaries, invocation budgets) remain per-request; deliberate, but a partial answer to the general problem |

### Approach B — surface all sibling approval requests in one run (no queue)

Change `ToolApprovalMiddleware._process_outbound_messages` to keep every unresolved request visible; the client renders N approval cards; one `resume` answers all.

| Pros | Cons |
|---|---|
| Most protocol-idiomatic: *"a single `resume` array must address every open interrupt"* — the batch resolves in one round trip | Only fixes the *queued* class. Collected auto-approved responses and hidden never-require groups still park in session state → still needs A (or C) anyway |
| Stateless across runs for the prompting part | Changes harness UX for **every** host: console/TUI/DevUI one-at-a-time prompting is a deliberate design; making it host-conditional leaks hosting concerns into the middleware |
| Fewer round trips for large batches | Client support for N simultaneous HITL cards is uneven (CopilotKit `useHumanInTheLoop` renders per action instance; batched `respond()` aggregation into one resume is not guaranteed — observed working reliably only one-at-a-time) |
| | Highest back-compat risk: existing consumers assert single-request rounds |

Worth pursuing upstream *later* as a UX option (`surface_all_approvals=True`), on top of A.

### Approach C — execute never-require siblings immediately (change core batch semantics)

Make `_try_execute_function_calls` run non-gated calls right away and emit approval requests only for gated ones.

| Pros | Cons |
|---|---|
| Eliminates the hidden-group mechanism entirely; results stream in-run and are visible everywhere | Reverses a documented core design decision (*"…it will ask approval for all of them"*) with .NET-parity implications |
| Least total state | **Safety regression**: a never-require tool executes even if the user then *rejects* the gated sibling. Today nothing in the batch runs before the decision — that is a feature for correlated actions (e.g. `read_file` + `send_email` in one batch) |
| | Touches every consumer: console, headless hosting, workflows — largest blast radius for a hosting-specific bug |

### Approach D — long-lived server-side `AgentSession` registry keyed by thread id

Keep real `AgentSession` objects alive in the AG-UI host between requests.

| Pros | Cons |
|---|---|
| Fixes *all* session-stateful features (approvals, compaction, budgets, memory) at once | Duplicates what the snapshot store abstraction already provides, minus its pluggability — an in-process registry breaks on multi-instance deployments (sticky sessions required) |
| No snapshot contract change | Needs its own eviction, lifetime, and serialization policy from scratch |
| | Authorization: thread ids are client-supplied; a bare registry keyed by thread id has no scope boundary. The snapshot store's `snapshot_scope_resolver` exists precisely to bind thread access to an app-defined authz scope — a second mechanism would have to replicate it (risk of divergence) |

### Approach E — round-trip approval state through the client (shared state channel)

Emit the tool-approval state via `StateSnapshot`; the client returns it in `RunAgentInput.state` on resume — the spec's own resume-state channel.

| Pros | Cons |
|---|---|
| Fully stateless server; horizontal scaling for free | **Security-disqualifying for this state**: the queue, standing "always approve" rules, collected auto-approvals, and hidden already-approved groups would become *client-writable*. A tampered client could inject an `already_approved_approval_request_groups` entry or a standing rule and have the server execute arbitrary registered tools **without any approval prompt**. The existing `pending_approvals` server-side registry validates *visible* approval responses precisely because client input is untrusted — this approach would reopen the same hole one layer down |
| Spec-endorsed mechanism (*"emit any state required for resume via StateSnapshot"*) | Mitigation would require signing/encrypting the state blob → key management and complexity out of proportion to the bug |
| | Tool arguments (potentially sensitive) leak to the client beyond what the approval UI needs; payload bloat on every turn |

The spec's guidance fits *UI-relevant* state; for authorization-relevant state, "agent-owned" is better satisfied by server-side storage the client can't write.

### Companion fixes (needed under any approach above)

| Fix | Alternatives considered | Why this one |
|---|---|---|
| **Sanitizer answered-ids guard**: `_sanitize_tool_history` skips synthetic results for call ids answered anywhere in the history (real result or approval response) | (a) run sanitization *after* approval resolution — reorders `normalize_agui_input_messages` for all callers incl. workflows; (b) delete synthetic injection — breaks legitimate abandoned-call repair (user types a new message instead of approving) | Local, additive, keeps the abandoned-call behavior byte-identical |
| **Cross-run `TOOL_CALL_END` suppression**: `_emit_approval_request` emits END only if the call started this run (`flow.tool_calls_by_id`) | (a) client-side tolerance — not ours to change, and the spec explicitly says not to re-emit lifecycle events; (b) re-emit Start+Args+End for the prior call — directly contradicts the spec quote in §5 | Mandated by the spec's normative wording |
| **Stream approval-execution results**: the streaming invocation loop yields a `ChatResponseUpdate(role="tool")` with results executed from inbound approval responses; `_process_function_requests` returns them | (a) AG-UI-only synthesis of result events at the transport — duplicates execution knowledge, leaves DevUI/console equally blind; (b) put results in the final response only — streaming consumers still miss them mid-run | Mirrors the existing in-run execution path (which already yields results); benefits every streaming consumer |
| **Record resolved approval results in `flow.tool_results`** so the snapshot pairs approved calls with results under their real ids | rely on `_clean_resolved_approvals_from_snapshot`'s confirm-id rewrite — demonstrated broken (§4.4) | Makes the persisted history self-consistent; the confirm-id rewrite stays as defense-in-depth |

## 7. Proposed fix — behavior after the change

```mermaid
sequenceDiagram
 participant UI as AG-UI client
 participant Run as run_agent_stream
 participant Store as Thread snapshot store
 participant Agent as Agent (middleware + invocation)

 UI->>Run: resume (approval response)
 Run->>Store: get(scope, thread_id)
 Store-->>Run: snapshot incl. session_state
 Note over Run: restore tool_approval state into the fresh AgentSession
 Run->>Run: resolve approvals: execute approved call, pop + execute released never-require siblings
 Run->>Agent: agent.run(session)
 Note over Agent: middleware injects collected auto-approved calls queue pops the next prompt, if any
 Agent-->>Run: stream (now incl. executed approval results)
 Run->>Store: save(snapshot + tool_approval session_state)
 Run-->>UI: TOOL_CALL_RESULT events + RUN_FINISHED
```

## 8. Security analysis of the proposed fix

- **No new client-controlled input.** `session_state` is written and read exclusively server-side via the snapshot store; it never appears in `RunAgentInput` or any event. There is no injection or forgery channel (this is the decisive difference vs Approach E).
- **Authorization boundary unchanged.** Persisted approval state inherits the store's `(scope, thread_id)` keying; the `snapshot_scope_resolver` remains the app's authz boundary. A client that can access a thread's history can already see its tool calls — the approval state adds queue ordering and standing rules, same sensitivity class.
- **No approval bypass introduced.** Released "already-approved groups" contain only calls that never required approval (`approval_mode="never_require"`) — deferring vs. executing them grants no additional privilege. They are released only after an approval response that passed the existing `pending_approvals` server-side registry validation (anti-bypass, anti-spoofing, anti-replay — untouched). Note the release-on-rejection semantics match the existing `_process_function_requests` behavior (pre-existing, deliberate: rejection of the gated call does not veto never-gated siblings; if that is ever tightened, both paths change together).
- **Standing "always approve" rules now actually persist** on this host — that is the feature working as documented, but it means a rule created via a validated approval response is durable for the thread's snapshot lifetime. Bounded by store eviction (in-memory default: 1000 snapshots LRU); documented in the field docstring.
- **Data at rest.** Persisted approval state includes tool names and full call arguments, which can embed user data. This matches the sensitivity of `snapshot.messages` already stored in the same record, so no *new* class of data is persisted — but stores backed by durable media should treat the whole snapshot as sensitive (encryption at rest is the store implementer's responsibility; worth a line in `AGUIThreadSnapshotStore` docs).
- **Model-context integrity improves.** The sanitizer guard stops fabricated tool results from being presented to the model for calls that have (or will get) real outcomes — fabricated results are a correctness *and* a prompt-integrity concern. Genuinely abandoned calls keep the synthetic repair, so no unpaired-call regressions for strict providers.
- **DoS surface.** Bounded by existing snapshot limits; the queue/groups are bounded by the model's per-turn tool-call count. The 10k-entry LRU `pending_approvals` registry is unchanged.

### Validation against the official AG-UI security guidance

Checked against [Security Considerations for AG-UI (Microsoft Learn)](https://learn.microsoft.com/en-us/agent-framework/integrations/ag-ui/security-considerations) — the framework's own threat model for AG-UI hosting:

| Guidance | How the proposal complies |
|---|---|
| *"Untrusted client input: all data from clients should be treated as potentially malicious"* / **State Injection** threat (*"the messages list and state are the primary vectors for prompt injection attacks"*) | Approval state (queue, standing rules, collected auto-approvals, hidden groups) is kept **out of both primary injection vectors**: `session_state` never appears in `RunAgentInput` or any event. This is precisely why Approach E (round-tripping approval state through the client via shared state) is rejected in §6 — it would move authorization-relevant state into the documented State Injection vector |
| **Message List Injection** threat (*"tool call messages to simulate tool executions or extract data"*) | The sanitizer answered-ids guard reads only the same history the host already consumes; with the snapshot store active (a prerequisite of Approach A), that history is **backend-owned** — `_reconstruct_messages_from_thread_snapshot` + `_filter_untrusted_suffix` (introduced by PR [#6471](https://github.com/microsoft/agent-framework/pull/6471)) drop client-forged non-user messages before the guard ever sees them, so the fix *strengthens* rather than weakens this boundary. The `pending_approvals` registry validation of approval responses is untouched |
| **Session ID Management** (*"Never allow clients to directly access arbitrary Session IDs"*, *"Verify session ownership before processing requests"*) | Persisted approval state is keyed by `(scope, thread_id)` with `snapshot_scope_resolver` as the ownership check — the same boundary already protecting `snapshot.messages`. Consequently the local-dev in-memory default proposed in §6.A **must be flagged dev-only**: its constant scope performs no ownership verification, so production deployments must derive the scope from authenticated user/tenant context (noted in §6.A) |
| **Trusted Frontend Server Pattern** (*"Do not expose AG-UI servers directly to untrusted clients"*) | The proposal is compatible with both deployment models and — because approval state is server-side — remains safe even in the discouraged direct-exposure model, where a hostile client controls every protocol field |
| **Sensitive Data Filtering** (*"Tool responses may inadvertently include sensitive data … Always filter responses before sending to clients"*) | Persisting approval requests (tool names + arguments) adds no new data class: the same store already holds `snapshot.messages` including tool calls and results. The data-at-rest note above applies; response filtering obligations are unchanged |
| **Human-in-the-Loop for Sensitive Operations** (*"Implement approval workflows for high-risk tool operations"*) | This defect silently breaks the article's own recommended control on AG-UI hosting (gated calls never run, auto-approval rules never apply, and the model is fed fabricated results). The fix restores the HITL control's integrity — an argument for treating this as security-relevant, not just functional |

## 9. Impact on non-AG-UI hosts (console, DevUI, headless)

| Change | Console / TUI | DevUI | Headless hosting (`agent-framework-hosting*`) | Workflows |
|---|---|---|---|---|
| Snapshot `session_state` (+ restore) | none — AG-UI package only | none (DevUI has its own host) | none | none |
| Session into `_resolve_approval_responses` + group release | none — AG-UI internal | none | none | none |
| Sanitizer guard | none — AG-UI internal (`normalize_agui_input_messages` callers only; workflow runs already opt out via `sanitize_tool_history=False`) | none | none | none |
| Same-run-only `TOOL_CALL_END` | none | none | none | none |
| **Core: stream approval-execution results** | **visible change**: streaming consumers now receive a `ChatResponseUpdate(role="tool")` with the executed results when a run begins with inbound approval responses. Interactive hosts that render tool results will start showing them (previously silently absent — arguably a fix in itself). Consumers that *aggregate* updates into a final response will include these result contents | same | same | `AgentExecutor` streams updates through; additive content, no ordering change for existing events |

The core change is additive (a new update, no removed or reordered events) and mirrors the payload shape of the existing in-run execution yield. Full upstream test suites pass unmodified except the intentional snapshot field-contract test. Sessions that persist across runs in-process (console/DevUI) never hit the changed AG-UI code paths at all; their approval flow is byte-identical.

## 10. Related upstream issues and PRs (duplicate check, 2026-07-03)

Searched `microsoft/agent-framework` issues and PRs (open and closed, Python and .NET) for: approval, parallel tool calls, queued approval, session state, thread snapshot, `TOOL_CALL_END`/`TOOL_CALL_RESULT`, synthetic/skipped tool results, sanitize tool history, interrupt/resume. **No existing issue reports the core defect** (per-request `AgentSession` destroying tool-approval state for parallel batches). The closest items:

### Same root-cause family (Python) — sub-symptoms this fix addresses or reshapes

| Ref | State | Relation |
|---|---|---|
| [#6828](https://github.com/microsoft/agent-framework/issues/6828) | open | Approval-gated tool reverts to "in progress": resolved results emitted only as transient `TOOL_CALL_RESULT`, never recorded in `flow.tool_results`, so `MESSAGES_SNAPSHOT` omits them — **exactly §4.4**; the `flow.tool_results` companion fix resolves it |
| [#6851](https://github.com/microsoft/agent-framework/issues/6851) | open | Approved tool **re-executes on a later unrelated turn** (stale approval payload re-detected from snapshot); self-identifies the same `_clean_resolved_approvals_from_snapshot` gap as #6828. The result-pairing + answered-ids guard remove the re-detection trigger; verify against its repro before closing as fixed |
| [#6894](https://github.com/microsoft/agent-framework/issues/6894) | open | Foundry: `pending_approvals` registry rejects the approval → tool never executes → stuck "Running" chip + "No tool output found" crash. Different trigger (registry id mismatch on Foundry — **not** fixed by this branch), identical downstream symptom family; the sanitizer guard changes its failure mode |
| [#6266](https://github.com/microsoft/agent-framework/issues/6266) | open | `MessagesSnapshotEvent` splits/reassigns the streamed text message on mixed turns — the snapshot layout that trips the sanitizer's "assistant continued" branch (§4.3) |
| [#4589](https://github.com/microsoft/agent-framework/issues/4589) | closed | Added `TOOL_CALL_RESULT` emission on approval resume (`_make_approval_tool_result_events`) — §4.4 extends that fix from the event stream to the persisted history |
| [#5855](https://github.com/microsoft/agent-framework/issues/5855) | closed | Origin of `_sanitize_tool_history`'s synthetic results (unpaired-call repair for OpenAI). The answered-ids guard deliberately preserves its behavior for genuinely abandoned calls |

### Adjacent Python issues/PRs (same subsystem, not addressed by this fix)

| Ref | State | Relation |
|---|---|---|
| [#5818](https://github.com/microsoft/agent-framework/issues/5818) | open | `AgentSession` pending-request resume failure in workflows (Magentic plan review) — same class of problem (session state across pause/resume), workflow path |
| [#6845](https://github.com/microsoft/agent-framework/issues/6845) / PR [#6867](https://github.com/microsoft/agent-framework/pull/6867) | open | Allow disabling approval for `SkillsProvider` tools — pain amplified by this bug (`load_skill` could never run on AG-UI at all) |
| [#6788](https://github.com/microsoft/agent-framework/issues/6788) / PR [#6822](https://github.com/microsoft/agent-framework/pull/6822) | open | Ollama reuses tool `call_id`s — amplifier observed in our repro (re-issued lost calls collide with their originals) |
| [#5941](https://github.com/microsoft/agent-framework/issues/5941) | open | Multi-turn tool calls 400 on Foundry via AG-UI (replay pairing) — same strict-provider failure class as the **companion ordering report**; not addressed by this fix |
| [#6652](https://github.com/microsoft/agent-framework/issues/6652) | open | Forward HITL approval to hosted/remote FoundryAgent instead of executing locally — adjacent approval-routing gap |
| [#4963](https://github.com/microsoft/agent-framework/issues/4963) / PR [#5300](https://github.com/microsoft/agent-framework/pull/5300) | open | Sub-agent tool approval routing |

### .NET counterparts

| Ref | State | Relation |
|---|---|---|
| [#6756](https://github.com/microsoft/agent-framework/issues/6756) | open | Implement AG-UI interrupts in .NET (*"Python is currently implemented"*) — **the port should not replicate this defect**; §5–§6 of this report apply directly to that design |
| [#2699](https://github.com/microsoft/agent-framework/issues/2699) | open | .NET: AG-UI multi-turn replay produces invalid OpenAI `tool_call` history — .NET member of the invalid-replay family; see the **companion ordering report** for the Python mis-ordering variant |
| [#3054](https://github.com/microsoft/agent-framework/issues/3054) | open | .NET twin of the batch-wide approval semantics (approval asked for *all* functions when one is wrapped) — the same design decision Approach C would reverse; cross-language parity argues for keeping semantics aligned |
| [#5600](https://github.com/microsoft/agent-framework/issues/5600) | open | GroupChat approval flow → HTTP 400 *"tool message without a preceding 'tool_calls'"* — .NET manifestation of unpaired approval results (§4.4's failure mode on a strict provider) |
| [#6786](https://github.com/microsoft/agent-framework/issues/6786) | open | Durable workflows lose executor `TInput` state after HITL approval — .NET cousin of state loss across an approval boundary |
| [#6220](https://github.com/microsoft/agent-framework/issues/6220), [#5587](https://github.com/microsoft/agent-framework/issues/5587) | open | .NET AG-UI client errors on approval rejection / non-JSON `TOOL_CALL_RESULT` — client-side strictness of the same event contract that makes cross-run `TOOL_CALL_END` fatal (§4.5.1) |
| PR [#5805](https://github.com/microsoft/agent-framework/pull/5805) | open | .NET: fix function-approval persistence with per-service-call history persistence — .NET-side recognition that approval state needs durable pairing |

### PRs the proposed fix builds on

| Ref | State | Relation |
|---|---|---|
| PR [#6471](https://github.com/microsoft/agent-framework/pull/6471) | merged | Introduced `AGUIThreadSnapshot` persistence/hydration — the store Approach A extends. Its motivation ("clients that resend full history … allows tampered transcripts") is the same trust argument that disqualifies Approach E for approval state |
| PR [#6376](https://github.com/microsoft/agent-framework/pull/6376) | merged | `pending_approvals` registry argument matching — the validation layer the released-groups path relies on (§8) |
| PR [#3212](https://github.com/microsoft/agent-framework/pull/3212) | merged | MCP tool support in AG-UI approval flows — the code path this repro exercises |
| PR [#6646](https://github.com/microsoft/agent-framework/pull/6646) | open | AG-UI workflow checkpointing — the workflow-side analogue of persisting resume state in the AG-UI host |
| PR [#6594](https://github.com/microsoft/agent-framework/pull/6594) | open | Foundry hosting: extend hosted-session scoping to approval handling — parallel effort recognizing that approval state needs session-scoped ownership on another host |

## 11. Suggested issue framing / follow-ups

- Primary issue: *"AG-UI host loses parallel tool calls when any call requires approval (per-request AgentSession destroys tool-approval state)"* — this report. Cross-link [#6828](https://github.com/microsoft/agent-framework/issues/6828) and [#6851](https://github.com/microsoft/agent-framework/issues/6851) as sub-symptoms of the same family that the accompanying branch also addresses, and [#6756](https://github.com/microsoft/agent-framework/issues/6756) so the .NET AG-UI interrupts port takes this into account.
- Separable follow-ups worth their own issues:
 1. core: approval-execution results invisible to streaming consumers (§4.5.2) — fixes independently of AG-UI;
 2. ag-ui: adopt the AG-UI core interrupt taxonomy (`reason: "tool_call"`, `responseSchema`) instead of `CustomEvent` + synthetic `confirm_changes` (§5, last row) — larger, orthogonal; overlaps with [#6756](https://github.com/microsoft/agent-framework/issues/6756)'s intent;
 3. harness: optional `surface_all_approvals` mode (Approach B) for hosts/clients that support multi-card approval UIs;
 4. ag-ui: approval-resolved results appended **out of order** (after the snapshot's split-off assistant text), producing protocol-invalid history for strict chat providers

Guidance	How the proposal complies
"Untrusted client input: all data from clients should be treated as potentially malicious" / State Injection threat ("the messages list and state are the primary vectors for prompt injection attacks")	Approval state (queue, standing rules, collected auto-approvals, hidden groups) is kept out of both primary injection vectors: `session_state` never appears in `RunAgentInput` or any event. This is precisely why Approach E (round-tripping approval state through the client via shared state) is rejected in §6 — it would move authorization-relevant state into the documented State Injection vector
Message List Injection threat ("tool call messages to simulate tool executions or extract data")	The sanitizer answered-ids guard reads only the same history the host already consumes; with the snapshot store active (a prerequisite of Approach A), that history is backend-owned — `_reconstruct_messages_from_thread_snapshot` + `_filter_untrusted_suffix` (introduced by PR #6471) drop client-forged non-user messages before the guard ever sees them, so the fix strengthens rather than weakens this boundary. The `pending_approvals` registry validation of approval responses is untouched
Session ID Management ("Never allow clients to directly access arbitrary Session IDs", "Verify session ownership before processing requests")	Persisted approval state is keyed by `(scope, thread_id)` with `snapshot_scope_resolver` as the ownership check — the same boundary already protecting `snapshot.messages`. Consequently the local-dev in-memory default proposed in §6.A must be flagged dev-only: its constant scope performs no ownership verification, so production deployments must derive the scope from authenticated user/tenant context (noted in §6.A)
Trusted Frontend Server Pattern ("Do not expose AG-UI servers directly to untrusted clients")	The proposal is compatible with both deployment models and — because approval state is server-side — remains safe even in the discouraged direct-exposure model, where a hostile client controls every protocol field
Sensitive Data Filtering ("Tool responses may inadvertently include sensitive data … Always filter responses before sending to clients")	Persisting approval requests (tool names + arguments) adds no new data class: the same store already holds `snapshot.messages` including tool calls and results. The data-at-rest note above applies; response filtering obligations are unchanged
Human-in-the-Loop for Sensitive Operations ("Implement approval workflows for high-risk tool operations")	This defect silently breaks the article's own recommended control on AG-UI hosting (gated calls never run, auto-approval rules never apply, and the model is fed fabricated results). The fix restores the HITL control's integrity — an argument for treating this as security-relevant, not just functional

Spec expectation	Current behavior	Verdict
Result-only emission for calls approved across runs (`ToolCallResult` against the original id, no re-emitted Start/Args/End)	`_make_approval_tool_result_events` does this correctly for the approved call	✅ conforms
No `ToolCallEnd` without a same-run `ToolCallStart`	`_emit_approval_request` emits `TOOL_CALL_END` for prior-run calls when a queued request surfaces later	❌ violates (defect 4.5.1)
Every open interrupt of the interrupted run addressed by one resume	Only one interrupt is ever opened per run (queueing), so per-run coverage trivially holds; the batch however spans runs by design	⚠️ legal, but see Approach B
State required for resume survives to the resumed run	Tool-approval state is destroyed between runs	❌ the essence of this bug
Interrupt objects carry `toolCallId` binding	MAF emits a `CustomEvent("function_approval_request")` + a synthetic `confirm_changes` frontend tool + `RUN_FINISHED.interrupt` entries keyed by the function `call_id`	⚠️ pre-dates the core taxonomy; out of scope here

Defect / gap	Addressed by	Mechanism
§4.1 batch-wide gating (design, not a defect)	unchanged by design	Semantics preserved identically for all hosts; see Approach C for why changing them is rejected
§4.2 per-request session destroys parked state (queued / collected / hidden)	Approach A core	`AGUIThreadSnapshot.session_state` restores/persists `session.state["tool_approval"]`; queued prompts re-surface run by run; collected auto-approvals inject into the next run; the session passed into `_resolve_approval_responses` lets hidden already-approved groups pop and execute
§4.3 fabricated "Tool execution skipped" results	Companion fix: sanitizer answered-ids guard	`_sanitize_tool_history` skips call ids answered elsewhere in the history (real result or approval response); repair of genuinely abandoned calls is preserved
§4.4 approved result vanishes from later turns	Companion fix: record resolved results in `flow.tool_results`	The persisted snapshot pairs the approved call with its result under the real `call_id`; the §4.3 guard then has a real answer to protect on every later turn
§4.5.1 duplicate cross-run `TOOL_CALL_END`	Companion fix: same-run guard in `_emit_approval_request`	END is emitted exactly once, in the run that streamed the call (the originating run already emits it — §4.5.1); the later-run duplicate is suppressed, restoring §5 row 2 conformance
§4.5.2 stream-invisible approval executions	Companion fix: core streaming yield	Executed approval-response results are yielded as a `ChatResponseUpdate(role="tool")`, so AG-UI (and every streaming host) can render and persist them
§5 row "state required for resume survives" (❌)	Approach A core	Tool-approval state survives server-side in the snapshot — "state is agent-owned", never client-writable
§5 row "no `ToolCallEnd` without same-run `ToolCallStart`" (❌)	Companion fix (§4.5.1 row above)	—
§5 row "result-only emission for cross-run approvals" (✅ already)	reinforced	§4.4/§4.5.2 fixes keep those results in history without ever re-emitting Start/Args/End

Pros	Cons
Fixes the root cause (§4.2) — all three loss classes (queued, collected, hidden) — with the existing UX: sequential prompts keep working, one interrupt per run, protocol-legal; with the bundled companion fixes it covers §4.3, §4.4, §4.5.1 and §4.5.2 and closes both ❌ rows of the §5 conformance table (see mapping above)	Requires a configured snapshot store + scope resolver; without one the host stays broken (mitigated by the local-dev default above: fall back to a bounded `InMemoryAGUIThreadSnapshotStore` when approval-gated tools are present, or at minimum warn)
No client or protocol changes; no changes to middleware/core approval semantics — zero impact on console/headless hosts	Widens the snapshot contract (previously "only replayable UI data"); documented narrowly as tool-approval state
State stays server-side: clients cannot forge queue/rule/hidden-group contents (contrast E)	Horizontal scaling needs a shared snapshot store (Redis etc.) — but message history already has that exact requirement, so no new deployment constraint
Piggybacks on the store's existing authz boundary (`(scope, thread_id)`, scope resolver) and eviction	Persisted approval state contains tool names + full arguments → data-at-rest consideration (§8)
Scoping to the `tool_approval` key avoids resurrecting history-provider state (full-state persistence demonstrably duplicates AG-UI-owned history — caught by `test_agent_endpoint_prepends_stored_snapshot_for_new_user_turn`)	Other session-stateful features (compaction summaries, invocation budgets) remain per-request; deliberate, but a partial answer to the general problem

Pros	Cons
Most protocol-idiomatic: "a single `resume` array must address every open interrupt" — the batch resolves in one round trip	Only fixes the queued class. Collected auto-approved responses and hidden never-require groups still park in session state → still needs A (or C) anyway
Stateless across runs for the prompting part	Changes harness UX for every host: console/TUI/DevUI one-at-a-time prompting is a deliberate design; making it host-conditional leaks hosting concerns into the middleware
Fewer round trips for large batches	Client support for N simultaneous HITL cards is uneven (CopilotKit `useHumanInTheLoop` renders per action instance; batched `respond()` aggregation into one resume is not guaranteed — observed working reliably only one-at-a-time)
	Highest back-compat risk: existing consumers assert single-request rounds

Pros	Cons
Eliminates the hidden-group mechanism entirely; results stream in-run and are visible everywhere	Reverses a documented core design decision ("…it will ask approval for all of them") with .NET-parity implications
Least total state	Safety regression: a never-require tool executes even if the user then rejects the gated sibling. Today nothing in the batch runs before the decision — that is a feature for correlated actions (e.g. `read_file` + `send_email` in one batch)
	Touches every consumer: console, headless hosting, workflows — largest blast radius for a hosting-specific bug

Pros	Cons
Fixes all session-stateful features (approvals, compaction, budgets, memory) at once	Duplicates what the snapshot store abstraction already provides, minus its pluggability — an in-process registry breaks on multi-instance deployments (sticky sessions required)
No snapshot contract change	Needs its own eviction, lifetime, and serialization policy from scratch
	Authorization: thread ids are client-supplied; a bare registry keyed by thread id has no scope boundary. The snapshot store's `snapshot_scope_resolver` exists precisely to bind thread access to an app-defined authz scope — a second mechanism would have to replicate it (risk of divergence)

Pros	Cons
Fully stateless server; horizontal scaling for free	Security-disqualifying for this state: the queue, standing "always approve" rules, collected auto-approvals, and hidden already-approved groups would become client-writable. A tampered client could inject an `already_approved_approval_request_groups` entry or a standing rule and have the server execute arbitrary registered tools without any approval prompt. The existing `pending_approvals` server-side registry validates visible approval responses precisely because client input is untrusted — this approach would reopen the same hole one layer down
Spec-endorsed mechanism ("emit any state required for resume via StateSnapshot")	Mitigation would require signing/encrypting the state blob → key management and complexity out of proportion to the bug
	Tool arguments (potentially sensitive) leak to the client beyond what the approval UI needs; payload bloat on every turn

Fix	Alternatives considered	Why this one
Sanitizer answered-ids guard: `_sanitize_tool_history` skips synthetic results for call ids answered anywhere in the history (real result or approval response)	(a) run sanitization after approval resolution — reorders `normalize_agui_input_messages` for all callers incl. workflows; (b) delete synthetic injection — breaks legitimate abandoned-call repair (user types a new message instead of approving)	Local, additive, keeps the abandoned-call behavior byte-identical
Cross-run `TOOL_CALL_END` suppression: `_emit_approval_request` emits END only if the call started this run (`flow.tool_calls_by_id`)	(a) client-side tolerance — not ours to change, and the spec explicitly says not to re-emit lifecycle events; (b) re-emit Start+Args+End for the prior call — directly contradicts the spec quote in §5	Mandated by the spec's normative wording
Stream approval-execution results: the streaming invocation loop yields a `ChatResponseUpdate(role="tool")` with results executed from inbound approval responses; `_process_function_requests` returns them	(a) AG-UI-only synthesis of result events at the transport — duplicates execution knowledge, leaves DevUI/console equally blind; (b) put results in the final response only — streaming consumers still miss them mid-run	Mirrors the existing in-run execution path (which already yields results); benefits every streaming consumer
Record resolved approval results in `flow.tool_results` so the snapshot pairs approved calls with results under their real ids	rely on `_clean_resolved_approvals_from_snapshot`'s confirm-id rewrite — demonstrated broken (§4.4)	Makes the persisted history self-consistent; the confirm-id rewrite stays as defense-in-depth

Change	Console / TUI	DevUI	Headless hosting (`agent-framework-hosting*`)	Workflows
Snapshot `session_state` (+ restore)	none — AG-UI package only	none (DevUI has its own host)	none	none
Session into `_resolve_approval_responses` + group release	none — AG-UI internal	none	none	none
Sanitizer guard	none — AG-UI internal (`normalize_agui_input_messages` callers only; workflow runs already opt out via `sanitize_tool_history=False`)	none	none	none
Same-run-only `TOOL_CALL_END`	none	none	none	none
Core: stream approval-execution results	visible change: streaming consumers now receive a `ChatResponseUpdate(role="tool")` with the executed results when a run begins with inbound approval responses. Interactive hosts that render tool results will start showing them (previously silently absent — arguably a fix in itself). Consumers that aggregate updates into a final response will include these result contents	same	same	`AgentExecutor` streams updates through; additive content, no ordering change for existing events

Ref	State	Relation
#6828	open	Approval-gated tool reverts to "in progress": resolved results emitted only as transient `TOOL_CALL_RESULT`, never recorded in `flow.tool_results`, so `MESSAGES_SNAPSHOT` omits them — exactly §4.4; the `flow.tool_results` companion fix resolves it
#6851	open	Approved tool re-executes on a later unrelated turn (stale approval payload re-detected from snapshot); self-identifies the same `_clean_resolved_approvals_from_snapshot` gap as #6828. The result-pairing + answered-ids guard remove the re-detection trigger; verify against its repro before closing as fixed
#6894	open	Foundry: `pending_approvals` registry rejects the approval → tool never executes → stuck "Running" chip + "No tool output found" crash. Different trigger (registry id mismatch on Foundry — not fixed by this branch), identical downstream symptom family; the sanitizer guard changes its failure mode
#6266	open	`MessagesSnapshotEvent` splits/reassigns the streamed text message on mixed turns — the snapshot layout that trips the sanitizer's "assistant continued" branch (§4.3)
#4589	closed	Added `TOOL_CALL_RESULT` emission on approval resume (`_make_approval_tool_result_events`) — §4.4 extends that fix from the event stream to the persisted history
#5855	closed	Origin of `_sanitize_tool_history`'s synthetic results (unpaired-call repair for OpenAI). The answered-ids guard deliberately preserves its behavior for genuinely abandoned calls

Ref	State	Relation
#5818	open	`AgentSession` pending-request resume failure in workflows (Magentic plan review) — same class of problem (session state across pause/resume), workflow path
#6845 / PR #6867	open	Allow disabling approval for `SkillsProvider` tools — pain amplified by this bug (`load_skill` could never run on AG-UI at all)
#6788 / PR #6822	open	Ollama reuses tool `call_id`s — amplifier observed in our repro (re-issued lost calls collide with their originals)
#5941	open	Multi-turn tool calls 400 on Foundry via AG-UI (replay pairing) — same strict-provider failure class as the companion ordering report; not addressed by this fix
#6652	open	Forward HITL approval to hosted/remote FoundryAgent instead of executing locally — adjacent approval-routing gap
#4963 / PR #5300	open	Sub-agent tool approval routing

Ref	State	Relation
#6756	open	Implement AG-UI interrupts in .NET ("Python is currently implemented") — the port should not replicate this defect; §5–§6 of this report apply directly to that design
#2699	open	.NET: AG-UI multi-turn replay produces invalid OpenAI `tool_call` history — .NET member of the invalid-replay family; see the companion ordering report for the Python mis-ordering variant
#3054	open	.NET twin of the batch-wide approval semantics (approval asked for all functions when one is wrapped) — the same design decision Approach C would reverse; cross-language parity argues for keeping semantics aligned
#5600	open	GroupChat approval flow → HTTP 400 "tool message without a preceding 'tool_calls'" — .NET manifestation of unpaired approval results (§4.4's failure mode on a strict provider)
#6786	open	Durable workflows lose executor `TInput` state after HITL approval — .NET cousin of state loss across an approval boundary
#6220, #5587	open	.NET AG-UI client errors on approval rejection / non-JSON `TOOL_CALL_RESULT` — client-side strictness of the same event contract that makes cross-run `TOOL_CALL_END` fatal (§4.5.1)
PR #5805	open	.NET: fix function-approval persistence with per-service-call history persistence — .NET-side recognition that approval state needs durable pairing

Ref	State	Relation
PR #6471	merged	Introduced `AGUIThreadSnapshot` persistence/hydration — the store Approach A extends. Its motivation ("clients that resend full history … allows tampered transcripts") is the same trust argument that disqualifies Approach E for approval state
PR #6376	merged	`pending_approvals` registry argument matching — the validation layer the released-groups path relies on (§8)
PR #3212	merged	MCP tool support in AG-UI approval flows — the code path this repro exercises
PR #6646	open	AG-UI workflow checkpointing — the workflow-side analogue of persisting resume state in the AG-UI host
PR #6594	open	Foundry hosting: extend hosted-session scoping to approval handling — parallel effort recognizing that approval state needs session-scoped ownership on another host

Uh oh!

.NET: Python: [Bug]: AG-UI host loses tool calls when parallel calls require approval #6910

Description

Description

1. Summary

2. Symptoms

Code Sample

Error Messages / Stack Traces

Package Versions

Python Version

Additional Context

4. Root cause analysis

4.1 The approval flow is session-stateful by design

4.2 The AG-UI host violates the session contract

4.3 The sanitizer masks the damage with fabricated results

4.4 Approved results are also lost from subsequent turns

4.5 Two latent protocol defects (exposed once state survives)

5. AG-UI protocol conformance

6. Fix approaches considered

Approach A — persist tool-approval session state in the thread snapshot (proposed)

Approach B — surface all sibling approval requests in one run (no queue)

Approach C — execute never-require siblings immediately (change core batch semantics)

Approach D — long-lived server-side AgentSession registry keyed by thread id

Approach E — round-trip approval state through the client (shared state channel)

Companion fixes (needed under any approach above)

7. Proposed fix — behavior after the change

8. Security analysis of the proposed fix

Validation against the official AG-UI security guidance

9. Impact on non-AG-UI hosts (console, DevUI, headless)

10. Related upstream issues and PRs (duplicate check, 2026-07-03)

Same root-cause family (Python) — sub-symptoms this fix addresses or reshapes

Adjacent Python issues/PRs (same subsystem, not addressed by this fix)

.NET counterparts

PRs the proposed fix builds on

11. Suggested issue framing / follow-ups

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Approach D — long-lived server-side `AgentSession` registry keyed by thread id