You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a model returns several tool calls in one turn and any of them requires human-in-the-loop approval, an agent hosted over AG-UI permanently loses every call in the batch except the single one the user approves. The lost calls are never executed and never re-prompted; instead the AG-UI message sanitizer fabricates "Tool execution skipped …" results for them — so the model receives false tool results, concludes its calls failed, and re-issues them indefinitely. Auto-approved tools (e.g. load_skill, whose approval is supposed to be granted silently by SkillsProvider.all_tools_auto_approval_rule) never execute at all on this host.
The root cause is a contract violation between three layers that individually behave as designed: MAF's tool-approval flow is session-stateful (it parks batch state in AgentSession.state), while the AG-UI host constructs a fresh AgentSession per HTTP request and resolves approvals at the transport layer without a session. Two additional protocol-level defects (cross-run TOOL_CALL_END, stream-invisible approval executions) surface once the state loss is fixed.
2. Symptoms
User-visible, on any AG-UI frontend (observed with CopilotKit):
Of N parallel tool calls, only the approved one ever shows a result; sibling call chips stay "Running" forever.
Only one approval card appears even when several calls are gated; the siblings' cards never appear on later turns either.
The agent visibly "loops": each reply re-requests the tools it already requested.
Skills never load on the AG-UI surface (load_skill is always_require + auto-approve rule — both halves of that contract break here).
Wire-level (captured with ENABLE_SENSITIVE_DATA logging, Ollama provider): the request sent to the model after approving 1 of 3 calls:
user: I want a landing zone for my AI app like Microsoft Learn recommends
assistant: tool_calls [load_skill, caf_methodologies, microsoft_docs_search]
tool: load_skill -> "Tool execution skipped - assistant continued before the tool result was available."
tool: caf_methodologies -> "Tool execution skipped - assistant continued before the tool result was available."
tool: microsoft_docs_search -> "Tool execution skipped - ..." <- the call the user JUST approved
assistant: (streamed text)
tool: microsoft_docs_search -> {real results} <- duplicate result for the same call_id
Two of three results are fabricated; the approved call gets two contradictory results for the same call_id. On subsequent turns even the real result disappears (see §4.4) and the fabricated one returns.
Code Sample
Error Messages / Stack Traces
## 3. Reproduction1. Host a harness agent (`create_harness_agent`) over AG-UI (`add_agent_framework_fastapi_endpoint`, `require_confirmation=True`) with at least one tool registered `approval_mode="always_require"` (e.g. an `MCPStreamableHTTPTool`) plus ordinary never-require tools, and skills enabled.
2. Send a prompt that elicits parallel tool calls, e.g. *"I want a landing zone for my AI app like Microsoft Learn recommends"* followed by *"do more MS Learn research"* against the Microsoft Learn MCP — models reliably issue 2–4 parallel `microsoft_docs_search`/`microsoft_docs_fetch` calls plus local tools.
3. Approve the single surfaced request; observe the logs (`Injecting synthetic tool result for pending call_id=…` per orphaned call) and the model's next turn re-issuing the lost calls.
Any provider reproduces it; nothing is provider-specific.
4.1 The approval flow is session-stateful by design
Both approval layers write to AgentSession.state["tool_approval"] (DEFAULT_TOOL_APPROVAL_SOURCE_ID and _TOOL_APPROVAL_STATE_KEY are the same key):
Function-invocation layer (agent_framework/_tools.py, _try_execute_function_calls): when any call in a batch is gated, the whole batch becomes function_approval_request items (documented behavior: "if the model returns multiple function calls, some that require approval and others that do not, it will ask approval for all of them"). The never-require siblings are hidden in session state as already-approved groups keyed by the visible request ids, to be released by _pop_already_approved_approval_responses when those visible approvals are answered.
Harness ToolApprovalMiddleware (agent_framework/_harness/_tool_approval.py): parks rule-auto-approved responses in collected_approval_responses (to be injected into the next run's messages) and queues all but the first user-facing request in queued_approval_requests (one prompt per run).
flowchart TD
A[Model returns parallel tool-call batch] --> B{Any call gated by<br/>approval_mode = always_require?}
B -- no --> C[Execute all calls immediately]
B -- yes --> D[Whole batch becomes<br/>function_approval_request items]
D --> E["Never-require siblings<br/>session.state['tool_approval']<br/>.already_approved_approval_request_groups"]
D --> F["Rule-auto-approved calls<br/>session.state['tool_approval']<br/>.collected_approval_responses"]
D --> G[First unresolved request<br/>surfaced to the user]
D --> H["Remaining unresolved requests<br/>session.state['tool_approval']<br/>.queued_approval_requests"]
E -. released when any visible approval<br/>of the batch is answered .-> I[Executed on a later run]
F -. injected into the next run's messages .-> I
H -. popped one per later run .-> G
style E fill:#fdd,stroke:#c00
style F fill:#fdd,stroke:#c00
style H fill:#fdd,stroke:#c00
Loading
(red = state that AG-UI hosting destroys)
This design works when the host keeps one AgentSession alive across runs, as interactive hosts do:
sequenceDiagram
participant User
participant Host as Host (DevUI / console)
participant MW as ToolApprovalMiddleware
participant FI as Function-invocation layer
participant LLM
User->>Host: prompt
Host->>MW: agent.run(session S)
MW->>FI: call_next
FI->>LLM: request
LLM-->>FI: parallel batch: 1 gated + 2 never-require calls
FI->>FI: any call gated -> approval requests for ALL
Note over FI: never-require siblings hidden in S.state<br/>("already-approved group")
FI-->>MW: 3 approval requests
Note over MW: rule-approved call parked in S.state (collected),<br/>extra prompts parked in S.state (queued)
MW-->>Host: one approval request
Host->>User: approve?
User->>Host: approve
Host->>MW: agent.run(SAME session S)
Note over MW,FI: S.state intact: collected responses injected,<br/>hidden group popped, queue surfaces next prompt
MW->>FI: approval responses
FI->>FI: execute every approved call
FI->>LLM: request with all real results
Loading
4.2 The AG-UI host violates the session contract
agent_framework_ag_ui/_agent_run.py::run_agent_stream constructs AgentSession(session_id=thread_id)per HTTP request and discards it when the request ends. AgentSession is a plain in-memory container; nothing persists or restores session.state. Additionally, _resolve_approval_responses executes approved calls at the transport layer without a session, so _pop_already_approved_approval_responses is never called on this host at all.
sequenceDiagram
participant UI as AG-UI client (CopilotKit)
participant Run as run_agent_stream
participant MW as ToolApprovalMiddleware
participant FI as Function-invocation layer
participant LLM
UI->>Run: RunAgentInput (user turn)
Note over Run: AgentSession S1 = fresh (state = {})
Run->>MW: agent.run(S1)
MW->>FI: call_next
FI->>LLM: request
LLM-->>FI: load_skill + caf_methodologies + microsoft_docs_search
FI->>FI: batch gated -> approval requests for all 3
Note over FI: caf_methodologies (never-require)<br/>hidden in S1.state
Note over MW: load_skill auto-approved by rule -> parked in S1.state<br/>microsoft_docs_search surfaced
Run-->>UI: RUN_FINISHED with interrupt (approve docs_search?)
Note over Run: HTTP request ends -> S1 discarded<br/>queued + collected + hidden state destroyed
UI->>Run: resume (approval for docs_search)
Note over Run: AgentSession S2 = fresh (state = {})
Run->>Run: _sanitize_tool_history injects synthetic<br/>"Tool execution skipped" for unanswered calls
Run->>Run: transport executes docs_search only
Run->>MW: agent.run(S2) - nothing to recover
MW->>FI: call_next
FI->>LLM: request: 1 real result + 2 fabricated "skipped"
LLM-->>FI: re-issues the lost calls (loop repeats)
Loading
4.3 The sanitizer masks the damage with fabricated results
_message_adapters._sanitize_tool_history injects a synthetic "Tool execution skipped …" result for any tool call it considers abandoned. Because the thread snapshot stores the assistant's streamed text as a separate assistant message after the tool-call message, the sanitizer's "assistant continued" branch fires before it ever reaches the approval response — so it fabricates a result even for the call the user is approving in that very request, which then coexists with the real result (duplicate tool messages for one call_id; strict providers can reject this, lenient ones feed the model contradictory data).
flowchart TD
subgraph History ["Reconstructed history on the resume request (order matters)"]
M1["user: prompt"] --> M2["assistant: tool_calls A, B, C"]
M2 --> M3["assistant: streamed text<br/>(snapshot stores it as a SEPARATE message<br/>after the tool-call message)"]
M3 --> M4["user: approval response for A"]
end
M2 -. "pending = {A, B, C}" .-> S1
subgraph Sanitizer ["_sanitize_tool_history (single forward pass)"]
S1{"assistant text arrives<br/>while pending is non-empty"} -->|"'assistant continued' branch fires"| S2["inject synthetic results for A, B, C:<br/>'Tool execution skipped ...'"]
S2 --> S3["approval response for A is reached<br/>only AFTER the injection"]
end
S3 --> R1["_resolve_approval_responses executes A<br/>and writes A's REAL result"]
R1 --> OUT["Model context:<br/>A = fabricated + real (two results, same call_id)<br/>B, C = fabricated only"]
style S2 fill:#fdd,stroke:#c00
style OUT fill:#fdd,stroke:#c00
Loading
4.4 Approved results are also lost from subsequent turns
_make_approval_tool_result_events emits TOOL_CALL_RESULT to the client but does not record the result in flow.tool_results, so the persisted thread snapshot never pairs the approved call with its result under its real call_id (_clean_resolved_approvals_from_snapshot rewrites the payload under the synthetic confirm_changes id instead, which the sanitizer drops on the next turn). Net effect: the model sees the real result exactly once — in the resumed run — and the fabricated "skipped" result on every turn thereafter.
Live corroboration (2026-07-04): the client-visible MESSAGES_SNAPSHOT after a resume run retained the raw approval payload — {"accepted": true, "steps": […]} as a tool message under the synthetic confirm id. _clean_resolved_approvals_from_snapshot builds its replacement map keyed by the resolved results' realcall_ids, then looks up the snapshot tool message's toolCallId — the confirm id — so on this path the intended payload→result rewrite finds no match and never fires at all.
sequenceDiagram
participant Run as Resume run (turn N)
participant Flow as flow.tool_results
participant Snap as Saved thread snapshot
participant Next as Next turn (N+1)
Run->>Run: _make_approval_tool_result_events emits<br/>TOOL_CALL_RESULT (real call_id) to the client
Note over Run,Flow: result NOT recorded in flow.tool_results
Run->>Snap: _clean_resolved_approvals_from_snapshot rewrites the<br/>confirm_changes tool message payload with the result
Note over Snap: result persisted under the SYNTHETIC<br/>confirm_changes id, not the real call_id
Next->>Snap: reconstruct history from snapshot
Next->>Next: sanitizer filters confirm_changes from the assistant<br/>message and drops its orphaned tool message (the real result)
Next->>Next: real call_id now unanswered again -><br/>synthetic "Tool execution skipped" injected
Note over Next: the model sees the real result exactly once (turn N)<br/>and a fabricated result on every later turn
Loading
4.5 Two latent protocol defects (exposed once state survives)
Cross-run (duplicate) TOOL_CALL_END. A queued approval request popped in a later run makes _emit_approval_request emit TOOL_CALL_END for a tool call whose TOOL_CALL_START happened in the previous run. Note this is a duplicate, not a late completion: the originating run already emitted the call's TOOL_CALL_END when argument streaming finished (via _emit_approval_request for the surfaced request, or the end-of-run "pending tool calls without end event" sweep for stripped siblings — both observed live: START event 237 and END event 247 in the originating run, then a second END as event 5 of the resume run). The AG-UI client rejects the second END and drops the whole event stream: Cannot send 'TOOL_CALL_END' event: No active tool call found with ID '…'. A 'TOOL_CALL_START' event must be sent first. (runtimeErrorCode: INCOMPLETE_STREAM). Suppressing it therefore loses no information — it converges on the spec's "emit the lifecycle exactly once, in the run that streamed the call" (§5).
Stream-invisible approval executions. In the streaming function-invocation loop (_tools.py), approved calls executed from inbound approval responses put their results into prepped_messages only — the normal in-run execution path yields results as a ChatResponseUpdate (update_role branch), but the approval path does not. Any host that builds client-visible history from the stream (AG-UI) can never persist these results.
This is not limited to cross-run resumes. Live evidence (2026-07-03, run 40101f41…): a batch resolved entirely inside its own run — load_skill rule-auto-approved by SkillsProvider with caf_methodologies/waf_pillars riding along as never-require siblings — executed and fed the model correctly (the results appear in the very next model call's payload), yet the run emitted zero TOOL_CALL_RESULT events among 214 (six TOOL_CALL_STARTs — five real calls plus the synthetic confirm_changes — and their six ENDs). Client history and the thread snapshot are permanently missing those results, so tool cards stay "in progress" forever and the sanitizer fabricates "skipped" results for those calls on every later turn even when approval state itself survives. Repair layers outside the run are structurally too late: the transport stops consuming the agent stream at the approval interrupt, so anything recovering results at stream cleanup runs only after RUN_FINISHED has been emitted. Only yielding the results on the stream — this defect's fix — makes them visible in time.
When reason: "tool_call" with toolCallId: the agent proposes a tool call in the interrupted run; the client resumes; the agent emits ToolCallResult against the original toolCallId in the resumed run. Crucially: "The agent does not re-emit ToolCallStart/ToolCallArgs/ToolCallEnd."
and:
"A single resume array must address every open interrupt from the interrupted run." …resumption occurs by "starting a new run whose RunAgentInput includes a resume array addressing every open interrupt."
and on state:
"The agent must emit any state required for resume via StateSnapshot" before the interrupt; "State is agent-owned; the client submits resume payloads linking to interrupts, not state objects."
Conformance assessment of current agent_framework_ag_ui behavior:
Spec expectation
Current behavior
Verdict
Result-only emission for calls approved across runs (ToolCallResult against the original id, no re-emitted Start/Args/End)
_make_approval_tool_result_events does this correctly for the approved call
✅ conforms
No ToolCallEnd without a same-run ToolCallStart
_emit_approval_request emits TOOL_CALL_END for prior-run calls when a queued request surfaces later
❌ violates (defect 4.5.1)
Every open interrupt of the interrupted run addressed by one resume
Only one interrupt is ever opened per run (queueing), so per-run coverage trivially holds; the batch however spans runs by design
⚠️ legal, but see Approach B
State required for resume survives to the resumed run
Tool-approval state is destroyed between runs
❌ the essence of this bug
Interrupt objects carry toolCallId binding
MAF emits a CustomEvent("function_approval_request") + a synthetic confirm_changes frontend tool + RUN_FINISHED.interrupt entries keyed by the function call_id
⚠️ pre-dates the core taxonomy; out of scope here
Note the spec's resume-state guidance points at the shared-state channel (StateSnapshot → client → RunAgentInput.state). That is evaluated below as Approach E — and rejected on security grounds for approval state specifically; the server-side thread snapshot store (which agent_framework_ag_ui already owns for message history) satisfies "state is agent-owned" more strongly.
6. Fix approaches considered
Approach A — persist tool-approval session state in the thread snapshot (proposed)
Add session_state to AGUIThreadSnapshot; run_agent_stream restores it into the per-request session and saves back onlysession.state["tool_approval"] after each run. Pass the session into _resolve_approval_responses so already-approved sibling groups are popped and executed at the transport layer (events + history included).
Storage reuses the existing thread-snapshot abstraction in agent_framework_ag_ui._snapshots: the AGUIThreadSnapshotStore protocol (async save/get/delete/clear, keyed by (scope, thread_id)) with InMemoryAGUIThreadSnapshotStore as the only shipped implementation — already bounded (DEFAULT_MAX_THREAD_SNAPSHOTS = 1_000, oldest snapshot evicted on overflow), so persisted approval state inherits the same LRU guard. Production deployments plug in their own durable/shared implementation of the protocol.
Today persistence stays inactive unless the endpoint is configured with both a snapshot_store and a snapshot_scope_resolver. To avoid shipping a fix that is silently off for the common local-dev setup, the proposal is to default to a bounded InMemoryAGUIThreadSnapshotStore (with a constant default scope) when the endpoint serves an agent with approval-gated tools and no store is configured — process-local and LRU-guarded, so it adds no durability or memory-growth concerns — plus a log line telling multi-instance deployments to supply a shared store. Per the AG-UI security guidance ("verify session ownership before processing requests"), this default is dev-only: a constant scope performs no ownership verification, so multi-user deployments must supply their own snapshot_scope_resolver deriving the scope from authenticated user/tenant context (see §8). Alternative if a behavioral default is unwanted: keep opt-in and emit a prominent warning when approval-gated tools are detected without a configured store.
Approach A is the state-persistence core; the proposed branch pairs it with the four companion fixes (detailed at the end of this section). The mapping below shows which §4 defect each piece resolves and which §5 conformance gap it closes:
Defect / gap
Addressed by
Mechanism
§4.1 batch-wide gating (design, not a defect)
unchanged by design
Semantics preserved identically for all hosts; see Approach C for why changing them is rejected
AGUIThreadSnapshot.session_state restores/persists session.state["tool_approval"]; queued prompts re-surface run by run; collected auto-approvals inject into the next run; the session passed into _resolve_approval_responses lets hidden already-approved groups pop and execute
§4.3 fabricated "Tool execution skipped" results
Companion fix: sanitizer answered-ids guard
_sanitize_tool_history skips call ids answered elsewhere in the history (real result or approval response); repair of genuinely abandoned calls is preserved
§4.4 approved result vanishes from later turns
Companion fix: record resolved results in flow.tool_results
The persisted snapshot pairs the approved call with its result under the real call_id; the §4.3 guard then has a real answer to protect on every later turn
§4.5.1 duplicate cross-run TOOL_CALL_END
Companion fix: same-run guard in _emit_approval_request
END is emitted exactly once, in the run that streamed the call (the originating run already emits it — §4.5.1); the later-run duplicate is suppressed, restoring §5 row 2 conformance
§4.5.2 stream-invisible approval executions
Companion fix: core streaming yield
Executed approval-response results are yielded as a ChatResponseUpdate(role="tool"), so AG-UI (and every streaming host) can render and persist them
§5 row "state required for resume survives" (❌)
Approach A core
Tool-approval state survives server-side in the snapshot — "state is agent-owned", never client-writable
§5 row "no ToolCallEnd without same-run ToolCallStart" (❌)
Companion fix (§4.5.1 row above)
—
§5 row "result-only emission for cross-run approvals" (✅ already)
reinforced
§4.4/§4.5.2 fixes keep those results in history without ever re-emitting Start/Args/End
Pros
Cons
Fixes the root cause (§4.2) — all three loss classes (queued, collected, hidden) — with the existing UX: sequential prompts keep working, one interrupt per run, protocol-legal; with the bundled companion fixes it covers §4.3, §4.4, §4.5.1 and §4.5.2 and closes both ❌ rows of the §5 conformance table (see mapping above)
Requires a configured snapshot store + scope resolver; without one the host stays broken (mitigated by the local-dev default above: fall back to a bounded InMemoryAGUIThreadSnapshotStore when approval-gated tools are present, or at minimum warn)
No client or protocol changes; no changes to middleware/core approval semantics — zero impact on console/headless hosts
Widens the snapshot contract (previously "only replayable UI data"); documented narrowly as tool-approval state
State stays server-side: clients cannot forge queue/rule/hidden-group contents (contrast E)
Horizontal scaling needs a shared snapshot store (Redis etc.) — but message history already has that exact requirement, so no new deployment constraint
Piggybacks on the store's existing authz boundary ((scope, thread_id), scope resolver) and eviction
Persisted approval state contains tool names + full arguments → data-at-rest consideration (§8)
Scoping to the tool_approval key avoids resurrecting history-provider state (full-state persistence demonstrably duplicates AG-UI-owned history — caught by test_agent_endpoint_prepends_stored_snapshot_for_new_user_turn)
Other session-stateful features (compaction summaries, invocation budgets) remain per-request; deliberate, but a partial answer to the general problem
Approach B — surface all sibling approval requests in one run (no queue)
Change ToolApprovalMiddleware._process_outbound_messages to keep every unresolved request visible; the client renders N approval cards; one resume answers all.
Pros
Cons
Most protocol-idiomatic: "a single resume array must address every open interrupt" — the batch resolves in one round trip
Only fixes the queued class. Collected auto-approved responses and hidden never-require groups still park in session state → still needs A (or C) anyway
Stateless across runs for the prompting part
Changes harness UX for every host: console/TUI/DevUI one-at-a-time prompting is a deliberate design; making it host-conditional leaks hosting concerns into the middleware
Fewer round trips for large batches
Client support for N simultaneous HITL cards is uneven (CopilotKit useHumanInTheLoop renders per action instance; batched respond() aggregation into one resume is not guaranteed — observed working reliably only one-at-a-time)
Make _try_execute_function_calls run non-gated calls right away and emit approval requests only for gated ones.
Pros
Cons
Eliminates the hidden-group mechanism entirely; results stream in-run and are visible everywhere
Reverses a documented core design decision ("…it will ask approval for all of them") with .NET-parity implications
Least total state
Safety regression: a never-require tool executes even if the user then rejects the gated sibling. Today nothing in the batch runs before the decision — that is a feature for correlated actions (e.g. read_file + send_email in one batch)
Touches every consumer: console, headless hosting, workflows — largest blast radius for a hosting-specific bug
Approach D — long-lived server-side AgentSession registry keyed by thread id
Keep real AgentSession objects alive in the AG-UI host between requests.
Pros
Cons
Fixes all session-stateful features (approvals, compaction, budgets, memory) at once
Duplicates what the snapshot store abstraction already provides, minus its pluggability — an in-process registry breaks on multi-instance deployments (sticky sessions required)
No snapshot contract change
Needs its own eviction, lifetime, and serialization policy from scratch
Authorization: thread ids are client-supplied; a bare registry keyed by thread id has no scope boundary. The snapshot store's snapshot_scope_resolver exists precisely to bind thread access to an app-defined authz scope — a second mechanism would have to replicate it (risk of divergence)
Approach E — round-trip approval state through the client (shared state channel)
Emit the tool-approval state via StateSnapshot; the client returns it in RunAgentInput.state on resume — the spec's own resume-state channel.
Pros
Cons
Fully stateless server; horizontal scaling for free
Security-disqualifying for this state: the queue, standing "always approve" rules, collected auto-approvals, and hidden already-approved groups would become client-writable. A tampered client could inject an already_approved_approval_request_groups entry or a standing rule and have the server execute arbitrary registered tools without any approval prompt. The existing pending_approvals server-side registry validates visible approval responses precisely because client input is untrusted — this approach would reopen the same hole one layer down
Spec-endorsed mechanism ("emit any state required for resume via StateSnapshot")
Mitigation would require signing/encrypting the state blob → key management and complexity out of proportion to the bug
Tool arguments (potentially sensitive) leak to the client beyond what the approval UI needs; payload bloat on every turn
The spec's guidance fits UI-relevant state; for authorization-relevant state, "agent-owned" is better satisfied by server-side storage the client can't write.
Companion fixes (needed under any approach above)
Fix
Alternatives considered
Why this one
Sanitizer answered-ids guard: _sanitize_tool_history skips synthetic results for call ids answered anywhere in the history (real result or approval response)
(a) run sanitization after approval resolution — reorders normalize_agui_input_messages for all callers incl. workflows; (b) delete synthetic injection — breaks legitimate abandoned-call repair (user types a new message instead of approving)
Local, additive, keeps the abandoned-call behavior byte-identical
Cross-run TOOL_CALL_END suppression: _emit_approval_request emits END only if the call started this run (flow.tool_calls_by_id)
(a) client-side tolerance — not ours to change, and the spec explicitly says not to re-emit lifecycle events; (b) re-emit Start+Args+End for the prior call — directly contradicts the spec quote in §5
Mandated by the spec's normative wording
Stream approval-execution results: the streaming invocation loop yields a ChatResponseUpdate(role="tool") with results executed from inbound approval responses; _process_function_requests returns them
(a) AG-UI-only synthesis of result events at the transport — duplicates execution knowledge, leaves DevUI/console equally blind; (b) put results in the final response only — streaming consumers still miss them mid-run
Mirrors the existing in-run execution path (which already yields results); benefits every streaming consumer
Record resolved approval results in flow.tool_results so the snapshot pairs approved calls with results under their real ids
rely on _clean_resolved_approvals_from_snapshot's confirm-id rewrite — demonstrated broken (§4.4)
Makes the persisted history self-consistent; the confirm-id rewrite stays as defense-in-depth
7. Proposed fix — behavior after the change
sequenceDiagram
participant UI as AG-UI client
participant Run as run_agent_stream
participant Store as Thread snapshot store
participant Agent as Agent (middleware + invocation)
UI->>Run: resume (approval response)
Run->>Store: get(scope, thread_id)
Store-->>Run: snapshot incl. session_state
Note over Run: restore tool_approval state<br/>into the fresh AgentSession
Run->>Run: resolve approvals: execute approved call,<br/>pop + execute released never-require siblings
Run->>Agent: agent.run(session)
Note over Agent: middleware injects collected auto-approved calls<br/>queue pops the next prompt, if any
Agent-->>Run: stream (now incl. executed approval results)
Run->>Store: save(snapshot + tool_approval session_state)
Run-->>UI: TOOL_CALL_RESULT events + RUN_FINISHED
Loading
8. Security analysis of the proposed fix
No new client-controlled input.session_state is written and read exclusively server-side via the snapshot store; it never appears in RunAgentInput or any event. There is no injection or forgery channel (this is the decisive difference vs Approach E).
Authorization boundary unchanged. Persisted approval state inherits the store's (scope, thread_id) keying; the snapshot_scope_resolver remains the app's authz boundary. A client that can access a thread's history can already see its tool calls — the approval state adds queue ordering and standing rules, same sensitivity class.
No approval bypass introduced. Released "already-approved groups" contain only calls that never required approval (approval_mode="never_require") — deferring vs. executing them grants no additional privilege. They are released only after an approval response that passed the existing pending_approvals server-side registry validation (anti-bypass, anti-spoofing, anti-replay — untouched). Note the release-on-rejection semantics match the existing _process_function_requests behavior (pre-existing, deliberate: rejection of the gated call does not veto never-gated siblings; if that is ever tightened, both paths change together).
Standing "always approve" rules now actually persist on this host — that is the feature working as documented, but it means a rule created via a validated approval response is durable for the thread's snapshot lifetime. Bounded by store eviction (in-memory default: 1000 snapshots LRU); documented in the field docstring.
Data at rest. Persisted approval state includes tool names and full call arguments, which can embed user data. This matches the sensitivity of snapshot.messages already stored in the same record, so no new class of data is persisted — but stores backed by durable media should treat the whole snapshot as sensitive (encryption at rest is the store implementer's responsibility; worth a line in AGUIThreadSnapshotStore docs).
Model-context integrity improves. The sanitizer guard stops fabricated tool results from being presented to the model for calls that have (or will get) real outcomes — fabricated results are a correctness and a prompt-integrity concern. Genuinely abandoned calls keep the synthetic repair, so no unpaired-call regressions for strict providers.
DoS surface. Bounded by existing snapshot limits; the queue/groups are bounded by the model's per-turn tool-call count. The 10k-entry LRU pending_approvals registry is unchanged.
Validation against the official AG-UI security guidance
"Untrusted client input: all data from clients should be treated as potentially malicious" / State Injection threat ("the messages list and state are the primary vectors for prompt injection attacks")
Approval state (queue, standing rules, collected auto-approvals, hidden groups) is kept out of both primary injection vectors: session_state never appears in RunAgentInput or any event. This is precisely why Approach E (round-tripping approval state through the client via shared state) is rejected in §6 — it would move authorization-relevant state into the documented State Injection vector
Message List Injection threat ("tool call messages to simulate tool executions or extract data")
The sanitizer answered-ids guard reads only the same history the host already consumes; with the snapshot store active (a prerequisite of Approach A), that history is backend-owned — _reconstruct_messages_from_thread_snapshot + _filter_untrusted_suffix (introduced by PR #6471) drop client-forged non-user messages before the guard ever sees them, so the fix strengthens rather than weakens this boundary. The pending_approvals registry validation of approval responses is untouched
Session ID Management ("Never allow clients to directly access arbitrary Session IDs", "Verify session ownership before processing requests")
Persisted approval state is keyed by (scope, thread_id) with snapshot_scope_resolver as the ownership check — the same boundary already protecting snapshot.messages. Consequently the local-dev in-memory default proposed in §6.A must be flagged dev-only: its constant scope performs no ownership verification, so production deployments must derive the scope from authenticated user/tenant context (noted in §6.A)
Trusted Frontend Server Pattern ("Do not expose AG-UI servers directly to untrusted clients")
The proposal is compatible with both deployment models and — because approval state is server-side — remains safe even in the discouraged direct-exposure model, where a hostile client controls every protocol field
Sensitive Data Filtering ("Tool responses may inadvertently include sensitive data … Always filter responses before sending to clients")
Persisting approval requests (tool names + arguments) adds no new data class: the same store already holds snapshot.messages including tool calls and results. The data-at-rest note above applies; response filtering obligations are unchanged
Human-in-the-Loop for Sensitive Operations ("Implement approval workflows for high-risk tool operations")
This defect silently breaks the article's own recommended control on AG-UI hosting (gated calls never run, auto-approval rules never apply, and the model is fed fabricated results). The fix restores the HITL control's integrity — an argument for treating this as security-relevant, not just functional
9. Impact on non-AG-UI hosts (console, DevUI, headless)
Change
Console / TUI
DevUI
Headless hosting (agent-framework-hosting*)
Workflows
Snapshot session_state (+ restore)
none — AG-UI package only
none (DevUI has its own host)
none
none
Session into _resolve_approval_responses + group release
none — AG-UI internal
none
none
none
Sanitizer guard
none — AG-UI internal (normalize_agui_input_messages callers only; workflow runs already opt out via sanitize_tool_history=False)
none
none
none
Same-run-only TOOL_CALL_END
none
none
none
none
Core: stream approval-execution results
visible change: streaming consumers now receive a ChatResponseUpdate(role="tool") with the executed results when a run begins with inbound approval responses. Interactive hosts that render tool results will start showing them (previously silently absent — arguably a fix in itself). Consumers that aggregate updates into a final response will include these result contents
same
same
AgentExecutor streams updates through; additive content, no ordering change for existing events
The core change is additive (a new update, no removed or reordered events) and mirrors the payload shape of the existing in-run execution yield. Full upstream test suites pass unmodified except the intentional snapshot field-contract test. Sessions that persist across runs in-process (console/DevUI) never hit the changed AG-UI code paths at all; their approval flow is byte-identical.
10. Related upstream issues and PRs (duplicate check, 2026-07-03)
Searched microsoft/agent-framework issues and PRs (open and closed, Python and .NET) for: approval, parallel tool calls, queued approval, session state, thread snapshot, TOOL_CALL_END/TOOL_CALL_RESULT, synthetic/skipped tool results, sanitize tool history, interrupt/resume. No existing issue reports the core defect (per-request AgentSession destroying tool-approval state for parallel batches). The closest items:
Same root-cause family (Python) — sub-symptoms this fix addresses or reshapes
Approval-gated tool reverts to "in progress": resolved results emitted only as transient TOOL_CALL_RESULT, never recorded in flow.tool_results, so MESSAGES_SNAPSHOT omits them — exactly §4.4; the flow.tool_results companion fix resolves it
Approved tool re-executes on a later unrelated turn (stale approval payload re-detected from snapshot); self-identifies the same _clean_resolved_approvals_from_snapshot gap as #6828. The result-pairing + answered-ids guard remove the re-detection trigger; verify against its repro before closing as fixed
Foundry: pending_approvals registry rejects the approval → tool never executes → stuck "Running" chip + "No tool output found" crash. Different trigger (registry id mismatch on Foundry — not fixed by this branch), identical downstream symptom family; the sanitizer guard changes its failure mode
MessagesSnapshotEvent splits/reassigns the streamed text message on mixed turns — the snapshot layout that trips the sanitizer's "assistant continued" branch (§4.3)
Added TOOL_CALL_RESULT emission on approval resume (_make_approval_tool_result_events) — §4.4 extends that fix from the event stream to the persisted history
Origin of _sanitize_tool_history's synthetic results (unpaired-call repair for OpenAI). The answered-ids guard deliberately preserves its behavior for genuinely abandoned calls
Adjacent Python issues/PRs (same subsystem, not addressed by this fix)
AgentSession pending-request resume failure in workflows (Magentic plan review) — same class of problem (session state across pause/resume), workflow path
Multi-turn tool calls 400 on Foundry via AG-UI (replay pairing) — same strict-provider failure class as the companion ordering report; not addressed by this fix
Implement AG-UI interrupts in .NET ("Python is currently implemented") — the port should not replicate this defect; §5–§6 of this report apply directly to that design
.NET: AG-UI multi-turn replay produces invalid OpenAI tool_call history — .NET member of the invalid-replay family; see the companion ordering report for the Python mis-ordering variant
.NET twin of the batch-wide approval semantics (approval asked for all functions when one is wrapped) — the same design decision Approach C would reverse; cross-language parity argues for keeping semantics aligned
.NET AG-UI client errors on approval rejection / non-JSON TOOL_CALL_RESULT — client-side strictness of the same event contract that makes cross-run TOOL_CALL_END fatal (§4.5.1)
Introduced AGUIThreadSnapshot persistence/hydration — the store Approach A extends. Its motivation ("clients that resend full history … allows tampered transcripts") is the same trust argument that disqualifies Approach E for approval state
Foundry hosting: extend hosted-session scoping to approval handling — parallel effort recognizing that approval state needs session-scoped ownership on another host
11. Suggested issue framing / follow-ups
Primary issue: "AG-UI host loses parallel tool calls when any call requires approval (per-request AgentSession destroys tool-approval state)" — this report. Cross-link #6828 and #6851 as sub-symptoms of the same family that the accompanying branch also addresses, and #6756 so the .NET AG-UI interrupts port takes this into account.
Separable follow-ups worth their own issues:
core: approval-execution results invisible to streaming consumers (§4.5.2) — fixes independently of AG-UI;
ag-ui: adopt the AG-UI core interrupt taxonomy (reason: "tool_call", responseSchema) instead of CustomEvent + synthetic confirm_changes (§5, last row) — larger, orthogonal; overlaps with #6756's intent;
harness: optional surface_all_approvals mode (Approach B) for hosts/clients that support multi-card approval UIs;
ag-ui: approval-resolved results appended out of order (after the snapshot's split-off assistant text), producing protocol-invalid history for strict chat providers
Description
1. Summary
When a model returns several tool calls in one turn and any of them requires human-in-the-loop approval, an agent hosted over AG-UI permanently loses every call in the batch except the single one the user approves. The lost calls are never executed and never re-prompted; instead the AG-UI message sanitizer fabricates
"Tool execution skipped …"results for them — so the model receives false tool results, concludes its calls failed, and re-issues them indefinitely. Auto-approved tools (e.g.load_skill, whose approval is supposed to be granted silently bySkillsProvider.all_tools_auto_approval_rule) never execute at all on this host.The root cause is a contract violation between three layers that individually behave as designed: MAF's tool-approval flow is session-stateful (it parks batch state in
AgentSession.state), while the AG-UI host constructs a freshAgentSessionper HTTP request and resolves approvals at the transport layer without a session. Two additional protocol-level defects (cross-runTOOL_CALL_END, stream-invisible approval executions) surface once the state loss is fixed.2. Symptoms
User-visible, on any AG-UI frontend (observed with CopilotKit):
load_skillisalways_require+ auto-approve rule — both halves of that contract break here).Wire-level (captured with
ENABLE_SENSITIVE_DATAlogging, Ollama provider): the request sent to the model after approving 1 of 3 calls:Two of three results are fabricated; the approved call gets two contradictory results for the same
call_id. On subsequent turns even the real result disappears (see §4.4) and the fabricated one returns.Code Sample
Error Messages / Stack Traces
Package Versions
agent-framework-core: 1.10.0, agent-framework-ag-ui: 1.0.0rc7
Python Version
Python 3.12
Additional Context
4. Root cause analysis
4.1 The approval flow is session-stateful by design
Both approval layers write to
AgentSession.state["tool_approval"](DEFAULT_TOOL_APPROVAL_SOURCE_IDand_TOOL_APPROVAL_STATE_KEYare the same key):agent_framework/_tools.py,_try_execute_function_calls): when any call in a batch is gated, the whole batch becomesfunction_approval_requestitems (documented behavior: "if the model returns multiple function calls, some that require approval and others that do not, it will ask approval for all of them"). The never-require siblings are hidden in session state as already-approved groups keyed by the visible request ids, to be released by_pop_already_approved_approval_responseswhen those visible approvals are answered.ToolApprovalMiddleware(agent_framework/_harness/_tool_approval.py): parks rule-auto-approved responses incollected_approval_responses(to be injected into the next run's messages) and queues all but the first user-facing request inqueued_approval_requests(one prompt per run).flowchart TD A[Model returns parallel tool-call batch] --> B{Any call gated by<br/>approval_mode = always_require?} B -- no --> C[Execute all calls immediately] B -- yes --> D[Whole batch becomes<br/>function_approval_request items] D --> E["Never-require siblings<br/>session.state['tool_approval']<br/>.already_approved_approval_request_groups"] D --> F["Rule-auto-approved calls<br/>session.state['tool_approval']<br/>.collected_approval_responses"] D --> G[First unresolved request<br/>surfaced to the user] D --> H["Remaining unresolved requests<br/>session.state['tool_approval']<br/>.queued_approval_requests"] E -. released when any visible approval<br/>of the batch is answered .-> I[Executed on a later run] F -. injected into the next run's messages .-> I H -. popped one per later run .-> G style E fill:#fdd,stroke:#c00 style F fill:#fdd,stroke:#c00 style H fill:#fdd,stroke:#c00(red = state that AG-UI hosting destroys)
This design works when the host keeps one
AgentSessionalive across runs, as interactive hosts do:sequenceDiagram participant User participant Host as Host (DevUI / console) participant MW as ToolApprovalMiddleware participant FI as Function-invocation layer participant LLM User->>Host: prompt Host->>MW: agent.run(session S) MW->>FI: call_next FI->>LLM: request LLM-->>FI: parallel batch: 1 gated + 2 never-require calls FI->>FI: any call gated -> approval requests for ALL Note over FI: never-require siblings hidden in S.state<br/>("already-approved group") FI-->>MW: 3 approval requests Note over MW: rule-approved call parked in S.state (collected),<br/>extra prompts parked in S.state (queued) MW-->>Host: one approval request Host->>User: approve? User->>Host: approve Host->>MW: agent.run(SAME session S) Note over MW,FI: S.state intact: collected responses injected,<br/>hidden group popped, queue surfaces next prompt MW->>FI: approval responses FI->>FI: execute every approved call FI->>LLM: request with all real results4.2 The AG-UI host violates the session contract
agent_framework_ag_ui/_agent_run.py::run_agent_streamconstructsAgentSession(session_id=thread_id)per HTTP request and discards it when the request ends.AgentSessionis a plain in-memory container; nothing persists or restoressession.state. Additionally,_resolve_approval_responsesexecutes approved calls at the transport layer without a session, so_pop_already_approved_approval_responsesis never called on this host at all.sequenceDiagram participant UI as AG-UI client (CopilotKit) participant Run as run_agent_stream participant MW as ToolApprovalMiddleware participant FI as Function-invocation layer participant LLM UI->>Run: RunAgentInput (user turn) Note over Run: AgentSession S1 = fresh (state = {}) Run->>MW: agent.run(S1) MW->>FI: call_next FI->>LLM: request LLM-->>FI: load_skill + caf_methodologies + microsoft_docs_search FI->>FI: batch gated -> approval requests for all 3 Note over FI: caf_methodologies (never-require)<br/>hidden in S1.state Note over MW: load_skill auto-approved by rule -> parked in S1.state<br/>microsoft_docs_search surfaced Run-->>UI: RUN_FINISHED with interrupt (approve docs_search?) Note over Run: HTTP request ends -> S1 discarded<br/>queued + collected + hidden state destroyed UI->>Run: resume (approval for docs_search) Note over Run: AgentSession S2 = fresh (state = {}) Run->>Run: _sanitize_tool_history injects synthetic<br/>"Tool execution skipped" for unanswered calls Run->>Run: transport executes docs_search only Run->>MW: agent.run(S2) - nothing to recover MW->>FI: call_next FI->>LLM: request: 1 real result + 2 fabricated "skipped" LLM-->>FI: re-issues the lost calls (loop repeats)4.3 The sanitizer masks the damage with fabricated results
_message_adapters._sanitize_tool_historyinjects a synthetic"Tool execution skipped …"result for any tool call it considers abandoned. Because the thread snapshot stores the assistant's streamed text as a separate assistant message after the tool-call message, the sanitizer's "assistant continued" branch fires before it ever reaches the approval response — so it fabricates a result even for the call the user is approving in that very request, which then coexists with the real result (duplicatetoolmessages for onecall_id; strict providers can reject this, lenient ones feed the model contradictory data).flowchart TD subgraph History ["Reconstructed history on the resume request (order matters)"] M1["user: prompt"] --> M2["assistant: tool_calls A, B, C"] M2 --> M3["assistant: streamed text<br/>(snapshot stores it as a SEPARATE message<br/>after the tool-call message)"] M3 --> M4["user: approval response for A"] end M2 -. "pending = {A, B, C}" .-> S1 subgraph Sanitizer ["_sanitize_tool_history (single forward pass)"] S1{"assistant text arrives<br/>while pending is non-empty"} -->|"'assistant continued' branch fires"| S2["inject synthetic results for A, B, C:<br/>'Tool execution skipped ...'"] S2 --> S3["approval response for A is reached<br/>only AFTER the injection"] end S3 --> R1["_resolve_approval_responses executes A<br/>and writes A's REAL result"] R1 --> OUT["Model context:<br/>A = fabricated + real (two results, same call_id)<br/>B, C = fabricated only"] style S2 fill:#fdd,stroke:#c00 style OUT fill:#fdd,stroke:#c004.4 Approved results are also lost from subsequent turns
_make_approval_tool_result_eventsemitsTOOL_CALL_RESULTto the client but does not record the result inflow.tool_results, so the persisted thread snapshot never pairs the approved call with its result under its realcall_id(_clean_resolved_approvals_from_snapshotrewrites the payload under the syntheticconfirm_changesid instead, which the sanitizer drops on the next turn). Net effect: the model sees the real result exactly once — in the resumed run — and the fabricated "skipped" result on every turn thereafter.Live corroboration (2026-07-04): the client-visible
MESSAGES_SNAPSHOTafter a resume run retained the raw approval payload —{"accepted": true, "steps": […]}as a tool message under the synthetic confirm id._clean_resolved_approvals_from_snapshotbuilds its replacement map keyed by the resolved results' realcall_ids, then looks up the snapshot tool message'stoolCallId— the confirm id — so on this path the intended payload→result rewrite finds no match and never fires at all.sequenceDiagram participant Run as Resume run (turn N) participant Flow as flow.tool_results participant Snap as Saved thread snapshot participant Next as Next turn (N+1) Run->>Run: _make_approval_tool_result_events emits<br/>TOOL_CALL_RESULT (real call_id) to the client Note over Run,Flow: result NOT recorded in flow.tool_results Run->>Snap: _clean_resolved_approvals_from_snapshot rewrites the<br/>confirm_changes tool message payload with the result Note over Snap: result persisted under the SYNTHETIC<br/>confirm_changes id, not the real call_id Next->>Snap: reconstruct history from snapshot Next->>Next: sanitizer filters confirm_changes from the assistant<br/>message and drops its orphaned tool message (the real result) Next->>Next: real call_id now unanswered again -><br/>synthetic "Tool execution skipped" injected Note over Next: the model sees the real result exactly once (turn N)<br/>and a fabricated result on every later turn4.5 Two latent protocol defects (exposed once state survives)
Cross-run (duplicate)
TOOL_CALL_END. A queued approval request popped in a later run makes_emit_approval_requestemitTOOL_CALL_ENDfor a tool call whoseTOOL_CALL_STARThappened in the previous run. Note this is a duplicate, not a late completion: the originating run already emitted the call'sTOOL_CALL_ENDwhen argument streaming finished (via_emit_approval_requestfor the surfaced request, or the end-of-run "pending tool calls without end event" sweep for stripped siblings — both observed live: START event 237 and END event 247 in the originating run, then a second END as event 5 of the resume run). The AG-UI client rejects the second END and drops the whole event stream:Cannot send 'TOOL_CALL_END' event: No active tool call found with ID '…'. A 'TOOL_CALL_START' event must be sent first.(runtimeErrorCode: INCOMPLETE_STREAM). Suppressing it therefore loses no information — it converges on the spec's "emit the lifecycle exactly once, in the run that streamed the call" (§5).Stream-invisible approval executions. In the streaming function-invocation loop (
_tools.py), approved calls executed from inbound approval responses put their results intoprepped_messagesonly — the normal in-run execution path yields results as aChatResponseUpdate(update_rolebranch), but the approval path does not. Any host that builds client-visible history from the stream (AG-UI) can never persist these results.This is not limited to cross-run resumes. Live evidence (2026-07-03, run
40101f41…): a batch resolved entirely inside its own run —load_skillrule-auto-approved bySkillsProviderwithcaf_methodologies/waf_pillarsriding along as never-require siblings — executed and fed the model correctly (the results appear in the very next model call's payload), yet the run emitted zeroTOOL_CALL_RESULTevents among 214 (sixTOOL_CALL_STARTs — five real calls plus the syntheticconfirm_changes— and their sixENDs). Client history and the thread snapshot are permanently missing those results, so tool cards stay "in progress" forever and the sanitizer fabricates "skipped" results for those calls on every later turn even when approval state itself survives. Repair layers outside the run are structurally too late: the transport stops consuming the agent stream at the approval interrupt, so anything recovering results at stream cleanup runs only afterRUN_FINISHEDhas been emitted. Only yielding the results on the stream — this defect's fix — makes them visible in time.5. AG-UI protocol conformance
The AG-UI interrupts specification (https://docs.ag-ui.com/concepts/interrupts) is directly on point for tool-based HITL:
and:
and on state:
Conformance assessment of current
agent_framework_ag_uibehavior:ToolCallResultagainst the original id, no re-emitted Start/Args/End)_make_approval_tool_result_eventsdoes this correctly for the approved callToolCallEndwithout a same-runToolCallStart_emit_approval_requestemitsTOOL_CALL_ENDfor prior-run calls when a queued request surfaces latertoolCallIdbindingCustomEvent("function_approval_request")+ a syntheticconfirm_changesfrontend tool +RUN_FINISHED.interruptentries keyed by the functioncall_idNote the spec's resume-state guidance points at the shared-state channel (
StateSnapshot→ client →RunAgentInput.state). That is evaluated below as Approach E — and rejected on security grounds for approval state specifically; the server-side thread snapshot store (whichagent_framework_ag_uialready owns for message history) satisfies "state is agent-owned" more strongly.6. Fix approaches considered
Approach A — persist tool-approval session state in the thread snapshot (proposed)
Add
session_statetoAGUIThreadSnapshot;run_agent_streamrestores it into the per-request session and saves back onlysession.state["tool_approval"]after each run. Pass the session into_resolve_approval_responsesso already-approved sibling groups are popped and executed at the transport layer (events + history included).Storage reuses the existing thread-snapshot abstraction in
agent_framework_ag_ui._snapshots: theAGUIThreadSnapshotStoreprotocol (asyncsave/get/delete/clear, keyed by(scope, thread_id)) withInMemoryAGUIThreadSnapshotStoreas the only shipped implementation — already bounded (DEFAULT_MAX_THREAD_SNAPSHOTS = 1_000, oldest snapshot evicted on overflow), so persisted approval state inherits the same LRU guard. Production deployments plug in their own durable/shared implementation of the protocol.Today persistence stays inactive unless the endpoint is configured with both a
snapshot_storeand asnapshot_scope_resolver. To avoid shipping a fix that is silently off for the common local-dev setup, the proposal is to default to a boundedInMemoryAGUIThreadSnapshotStore(with a constant default scope) when the endpoint serves an agent with approval-gated tools and no store is configured — process-local and LRU-guarded, so it adds no durability or memory-growth concerns — plus a log line telling multi-instance deployments to supply a shared store. Per the AG-UI security guidance ("verify session ownership before processing requests"), this default is dev-only: a constant scope performs no ownership verification, so multi-user deployments must supply their ownsnapshot_scope_resolverderiving the scope from authenticated user/tenant context (see §8). Alternative if a behavioral default is unwanted: keep opt-in and emit a prominent warning when approval-gated tools are detected without a configured store.Approach A is the state-persistence core; the proposed branch pairs it with the four companion fixes (detailed at the end of this section). The mapping below shows which §4 defect each piece resolves and which §5 conformance gap it closes:
AGUIThreadSnapshot.session_staterestores/persistssession.state["tool_approval"]; queued prompts re-surface run by run; collected auto-approvals inject into the next run; the session passed into_resolve_approval_responseslets hidden already-approved groups pop and execute_sanitize_tool_historyskips call ids answered elsewhere in the history (real result or approval response); repair of genuinely abandoned calls is preservedflow.tool_resultscall_id; the §4.3 guard then has a real answer to protect on every later turnTOOL_CALL_END_emit_approval_requestChatResponseUpdate(role="tool"), so AG-UI (and every streaming host) can render and persist themToolCallEndwithout same-runToolCallStart" (❌)InMemoryAGUIThreadSnapshotStorewhen approval-gated tools are present, or at minimum warn)(scope, thread_id), scope resolver) and evictiontool_approvalkey avoids resurrecting history-provider state (full-state persistence demonstrably duplicates AG-UI-owned history — caught bytest_agent_endpoint_prepends_stored_snapshot_for_new_user_turn)Approach B — surface all sibling approval requests in one run (no queue)
Change
ToolApprovalMiddleware._process_outbound_messagesto keep every unresolved request visible; the client renders N approval cards; oneresumeanswers all.resumearray must address every open interrupt" — the batch resolves in one round tripuseHumanInTheLooprenders per action instance; batchedrespond()aggregation into one resume is not guaranteed — observed working reliably only one-at-a-time)Worth pursuing upstream later as a UX option (
surface_all_approvals=True), on top of A.Approach C — execute never-require siblings immediately (change core batch semantics)
Make
_try_execute_function_callsrun non-gated calls right away and emit approval requests only for gated ones.read_file+send_emailin one batch)Approach D — long-lived server-side
AgentSessionregistry keyed by thread idKeep real
AgentSessionobjects alive in the AG-UI host between requests.snapshot_scope_resolverexists precisely to bind thread access to an app-defined authz scope — a second mechanism would have to replicate it (risk of divergence)Approach E — round-trip approval state through the client (shared state channel)
Emit the tool-approval state via
StateSnapshot; the client returns it inRunAgentInput.stateon resume — the spec's own resume-state channel.already_approved_approval_request_groupsentry or a standing rule and have the server execute arbitrary registered tools without any approval prompt. The existingpending_approvalsserver-side registry validates visible approval responses precisely because client input is untrusted — this approach would reopen the same hole one layer downThe spec's guidance fits UI-relevant state; for authorization-relevant state, "agent-owned" is better satisfied by server-side storage the client can't write.
Companion fixes (needed under any approach above)
_sanitize_tool_historyskips synthetic results for call ids answered anywhere in the history (real result or approval response)normalize_agui_input_messagesfor all callers incl. workflows; (b) delete synthetic injection — breaks legitimate abandoned-call repair (user types a new message instead of approving)TOOL_CALL_ENDsuppression:_emit_approval_requestemits END only if the call started this run (flow.tool_calls_by_id)ChatResponseUpdate(role="tool")with results executed from inbound approval responses;_process_function_requestsreturns themflow.tool_resultsso the snapshot pairs approved calls with results under their real ids_clean_resolved_approvals_from_snapshot's confirm-id rewrite — demonstrated broken (§4.4)7. Proposed fix — behavior after the change
sequenceDiagram participant UI as AG-UI client participant Run as run_agent_stream participant Store as Thread snapshot store participant Agent as Agent (middleware + invocation) UI->>Run: resume (approval response) Run->>Store: get(scope, thread_id) Store-->>Run: snapshot incl. session_state Note over Run: restore tool_approval state<br/>into the fresh AgentSession Run->>Run: resolve approvals: execute approved call,<br/>pop + execute released never-require siblings Run->>Agent: agent.run(session) Note over Agent: middleware injects collected auto-approved calls<br/>queue pops the next prompt, if any Agent-->>Run: stream (now incl. executed approval results) Run->>Store: save(snapshot + tool_approval session_state) Run-->>UI: TOOL_CALL_RESULT events + RUN_FINISHED8. Security analysis of the proposed fix
session_stateis written and read exclusively server-side via the snapshot store; it never appears inRunAgentInputor any event. There is no injection or forgery channel (this is the decisive difference vs Approach E).(scope, thread_id)keying; thesnapshot_scope_resolverremains the app's authz boundary. A client that can access a thread's history can already see its tool calls — the approval state adds queue ordering and standing rules, same sensitivity class.approval_mode="never_require") — deferring vs. executing them grants no additional privilege. They are released only after an approval response that passed the existingpending_approvalsserver-side registry validation (anti-bypass, anti-spoofing, anti-replay — untouched). Note the release-on-rejection semantics match the existing_process_function_requestsbehavior (pre-existing, deliberate: rejection of the gated call does not veto never-gated siblings; if that is ever tightened, both paths change together).snapshot.messagesalready stored in the same record, so no new class of data is persisted — but stores backed by durable media should treat the whole snapshot as sensitive (encryption at rest is the store implementer's responsibility; worth a line inAGUIThreadSnapshotStoredocs).pending_approvalsregistry is unchanged.Validation against the official AG-UI security guidance
Checked against Security Considerations for AG-UI (Microsoft Learn) — the framework's own threat model for AG-UI hosting:
session_statenever appears inRunAgentInputor any event. This is precisely why Approach E (round-tripping approval state through the client via shared state) is rejected in §6 — it would move authorization-relevant state into the documented State Injection vector_reconstruct_messages_from_thread_snapshot+_filter_untrusted_suffix(introduced by PR #6471) drop client-forged non-user messages before the guard ever sees them, so the fix strengthens rather than weakens this boundary. Thepending_approvalsregistry validation of approval responses is untouched(scope, thread_id)withsnapshot_scope_resolveras the ownership check — the same boundary already protectingsnapshot.messages. Consequently the local-dev in-memory default proposed in §6.A must be flagged dev-only: its constant scope performs no ownership verification, so production deployments must derive the scope from authenticated user/tenant context (noted in §6.A)snapshot.messagesincluding tool calls and results. The data-at-rest note above applies; response filtering obligations are unchanged9. Impact on non-AG-UI hosts (console, DevUI, headless)
agent-framework-hosting*)session_state(+ restore)_resolve_approval_responses+ group releasenormalize_agui_input_messagescallers only; workflow runs already opt out viasanitize_tool_history=False)TOOL_CALL_ENDChatResponseUpdate(role="tool")with the executed results when a run begins with inbound approval responses. Interactive hosts that render tool results will start showing them (previously silently absent — arguably a fix in itself). Consumers that aggregate updates into a final response will include these result contentsAgentExecutorstreams updates through; additive content, no ordering change for existing eventsThe core change is additive (a new update, no removed or reordered events) and mirrors the payload shape of the existing in-run execution yield. Full upstream test suites pass unmodified except the intentional snapshot field-contract test. Sessions that persist across runs in-process (console/DevUI) never hit the changed AG-UI code paths at all; their approval flow is byte-identical.
10. Related upstream issues and PRs (duplicate check, 2026-07-03)
Searched
microsoft/agent-frameworkissues and PRs (open and closed, Python and .NET) for: approval, parallel tool calls, queued approval, session state, thread snapshot,TOOL_CALL_END/TOOL_CALL_RESULT, synthetic/skipped tool results, sanitize tool history, interrupt/resume. No existing issue reports the core defect (per-requestAgentSessiondestroying tool-approval state for parallel batches). The closest items:Same root-cause family (Python) — sub-symptoms this fix addresses or reshapes
TOOL_CALL_RESULT, never recorded inflow.tool_results, soMESSAGES_SNAPSHOTomits them — exactly §4.4; theflow.tool_resultscompanion fix resolves it_clean_resolved_approvals_from_snapshotgap as #6828. The result-pairing + answered-ids guard remove the re-detection trigger; verify against its repro before closing as fixedpending_approvalsregistry rejects the approval → tool never executes → stuck "Running" chip + "No tool output found" crash. Different trigger (registry id mismatch on Foundry — not fixed by this branch), identical downstream symptom family; the sanitizer guard changes its failure modeMessagesSnapshotEventsplits/reassigns the streamed text message on mixed turns — the snapshot layout that trips the sanitizer's "assistant continued" branch (§4.3)TOOL_CALL_RESULTemission on approval resume (_make_approval_tool_result_events) — §4.4 extends that fix from the event stream to the persisted history_sanitize_tool_history's synthetic results (unpaired-call repair for OpenAI). The answered-ids guard deliberately preserves its behavior for genuinely abandoned callsAdjacent Python issues/PRs (same subsystem, not addressed by this fix)
AgentSessionpending-request resume failure in workflows (Magentic plan review) — same class of problem (session state across pause/resume), workflow pathSkillsProvidertools — pain amplified by this bug (load_skillcould never run on AG-UI at all)call_ids — amplifier observed in our repro (re-issued lost calls collide with their originals).NET counterparts
tool_callhistory — .NET member of the invalid-replay family; see the companion ordering report for the Python mis-ordering variantTInputstate after HITL approval — .NET cousin of state loss across an approval boundaryTOOL_CALL_RESULT— client-side strictness of the same event contract that makes cross-runTOOL_CALL_ENDfatal (§4.5.1)PRs the proposed fix builds on
AGUIThreadSnapshotpersistence/hydration — the store Approach A extends. Its motivation ("clients that resend full history … allows tampered transcripts") is the same trust argument that disqualifies Approach E for approval statepending_approvalsregistry argument matching — the validation layer the released-groups path relies on (§8)11. Suggested issue framing / follow-ups
reason: "tool_call",responseSchema) instead ofCustomEvent+ syntheticconfirm_changes(§5, last row) — larger, orthogonal; overlaps with #6756's intent;surface_all_approvalsmode (Approach B) for hosts/clients that support multi-card approval UIs;