Skip to content

.NET: Python: [Bug]: AG-UI host loses tool calls when parallel calls require approval #6910

Description

@antsok

Description

1. Summary

When a model returns several tool calls in one turn and any of them requires human-in-the-loop approval, an agent hosted over AG-UI permanently loses every call in the batch except the single one the user approves. The lost calls are never executed and never re-prompted; instead the AG-UI message sanitizer fabricates "Tool execution skipped …" results for them — so the model receives false tool results, concludes its calls failed, and re-issues them indefinitely. Auto-approved tools (e.g. load_skill, whose approval is supposed to be granted silently by SkillsProvider.all_tools_auto_approval_rule) never execute at all on this host.

The root cause is a contract violation between three layers that individually behave as designed: MAF's tool-approval flow is session-stateful (it parks batch state in AgentSession.state), while the AG-UI host constructs a fresh AgentSession per HTTP request and resolves approvals at the transport layer without a session. Two additional protocol-level defects (cross-run TOOL_CALL_END, stream-invisible approval executions) surface once the state loss is fixed.

2. Symptoms

User-visible, on any AG-UI frontend (observed with CopilotKit):

  • Of N parallel tool calls, only the approved one ever shows a result; sibling call chips stay "Running" forever.
  • Only one approval card appears even when several calls are gated; the siblings' cards never appear on later turns either.
  • The agent visibly "loops": each reply re-requests the tools it already requested.
  • Skills never load on the AG-UI surface (load_skill is always_require + auto-approve rule — both halves of that contract break here).

Wire-level (captured with ENABLE_SENSITIVE_DATA logging, Ollama provider): the request sent to the model after approving 1 of 3 calls:

user:      I want a landing zone for my AI app like Microsoft Learn recommends
assistant: tool_calls [load_skill, caf_methodologies, microsoft_docs_search]
tool:      load_skill        -> "Tool execution skipped - assistant continued before the tool result was available."
tool:      caf_methodologies -> "Tool execution skipped - assistant continued before the tool result was available."
tool:      microsoft_docs_search -> "Tool execution skipped - ..."     <- the call the user JUST approved
assistant: (streamed text)
tool:      microsoft_docs_search -> {real results}                     <- duplicate result for the same call_id

Two of three results are fabricated; the approved call gets two contradictory results for the same call_id. On subsequent turns even the real result disappears (see §4.4) and the fabricated one returns.

Code Sample

Error Messages / Stack Traces

## 3. Reproduction

1. Host a harness agent (`create_harness_agent`) over AG-UI (`add_agent_framework_fastapi_endpoint`, `require_confirmation=True`) with at least one tool registered `approval_mode="always_require"` (e.g. an `MCPStreamableHTTPTool`) plus ordinary never-require tools, and skills enabled.
2. Send a prompt that elicits parallel tool calls, e.g. *"I want a landing zone for my AI app like Microsoft Learn recommends"* followed by *"do more MS Learn research"* against the Microsoft Learn MCP — models reliably issue 2–4 parallel `microsoft_docs_search`/`microsoft_docs_fetch` calls plus local tools.
3. Approve the single surfaced request; observe the logs (`Injecting synthetic tool result for pending call_id=…` per orphaned call) and the model's next turn re-issuing the lost calls.

Any provider reproduces it; nothing is provider-specific.

Package Versions

agent-framework-core: 1.10.0, agent-framework-ag-ui: 1.0.0rc7

Python Version

Python 3.12

Additional Context

4. Root cause analysis

4.1 The approval flow is session-stateful by design

Both approval layers write to AgentSession.state["tool_approval"] (DEFAULT_TOOL_APPROVAL_SOURCE_ID and _TOOL_APPROVAL_STATE_KEY are the same key):

  • Function-invocation layer (agent_framework/_tools.py, _try_execute_function_calls): when any call in a batch is gated, the whole batch becomes function_approval_request items (documented behavior: "if the model returns multiple function calls, some that require approval and others that do not, it will ask approval for all of them"). The never-require siblings are hidden in session state as already-approved groups keyed by the visible request ids, to be released by _pop_already_approved_approval_responses when those visible approvals are answered.
  • Harness ToolApprovalMiddleware (agent_framework/_harness/_tool_approval.py): parks rule-auto-approved responses in collected_approval_responses (to be injected into the next run's messages) and queues all but the first user-facing request in queued_approval_requests (one prompt per run).
flowchart TD
    A[Model returns parallel tool-call batch] --> B{Any call gated by<br/>approval_mode = always_require?}
    B -- no --> C[Execute all calls immediately]
    B -- yes --> D[Whole batch becomes<br/>function_approval_request items]
    D --> E["Never-require siblings<br/>session.state['tool_approval']<br/>.already_approved_approval_request_groups"]
    D --> F["Rule-auto-approved calls<br/>session.state['tool_approval']<br/>.collected_approval_responses"]
    D --> G[First unresolved request<br/>surfaced to the user]
    D --> H["Remaining unresolved requests<br/>session.state['tool_approval']<br/>.queued_approval_requests"]
    E -. released when any visible approval<br/>of the batch is answered .-> I[Executed on a later run]
    F -. injected into the next run's messages .-> I
    H -. popped one per later run .-> G
    style E fill:#fdd,stroke:#c00
    style F fill:#fdd,stroke:#c00
    style H fill:#fdd,stroke:#c00
Loading

(red = state that AG-UI hosting destroys)

This design works when the host keeps one AgentSession alive across runs, as interactive hosts do:

sequenceDiagram
    participant User
    participant Host as Host (DevUI / console)
    participant MW as ToolApprovalMiddleware
    participant FI as Function-invocation layer
    participant LLM

    User->>Host: prompt
    Host->>MW: agent.run(session S)
    MW->>FI: call_next
    FI->>LLM: request
    LLM-->>FI: parallel batch: 1 gated + 2 never-require calls
    FI->>FI: any call gated -> approval requests for ALL
    Note over FI: never-require siblings hidden in S.state<br/>("already-approved group")
    FI-->>MW: 3 approval requests
    Note over MW: rule-approved call parked in S.state (collected),<br/>extra prompts parked in S.state (queued)
    MW-->>Host: one approval request
    Host->>User: approve?
    User->>Host: approve
    Host->>MW: agent.run(SAME session S)
    Note over MW,FI: S.state intact: collected responses injected,<br/>hidden group popped, queue surfaces next prompt
    MW->>FI: approval responses
    FI->>FI: execute every approved call
    FI->>LLM: request with all real results
Loading

4.2 The AG-UI host violates the session contract

agent_framework_ag_ui/_agent_run.py::run_agent_stream constructs AgentSession(session_id=thread_id) per HTTP request and discards it when the request ends. AgentSession is a plain in-memory container; nothing persists or restores session.state. Additionally, _resolve_approval_responses executes approved calls at the transport layer without a session, so _pop_already_approved_approval_responses is never called on this host at all.

sequenceDiagram
    participant UI as AG-UI client (CopilotKit)
    participant Run as run_agent_stream
    participant MW as ToolApprovalMiddleware
    participant FI as Function-invocation layer
    participant LLM

    UI->>Run: RunAgentInput (user turn)
    Note over Run: AgentSession S1 = fresh (state = {})
    Run->>MW: agent.run(S1)
    MW->>FI: call_next
    FI->>LLM: request
    LLM-->>FI: load_skill + caf_methodologies + microsoft_docs_search
    FI->>FI: batch gated -> approval requests for all 3
    Note over FI: caf_methodologies (never-require)<br/>hidden in S1.state
    Note over MW: load_skill auto-approved by rule -> parked in S1.state<br/>microsoft_docs_search surfaced
    Run-->>UI: RUN_FINISHED with interrupt (approve docs_search?)
    Note over Run: HTTP request ends -> S1 discarded<br/>queued + collected + hidden state destroyed
    UI->>Run: resume (approval for docs_search)
    Note over Run: AgentSession S2 = fresh (state = {})
    Run->>Run: _sanitize_tool_history injects synthetic<br/>"Tool execution skipped" for unanswered calls
    Run->>Run: transport executes docs_search only
    Run->>MW: agent.run(S2) - nothing to recover
    MW->>FI: call_next
    FI->>LLM: request: 1 real result + 2 fabricated "skipped"
    LLM-->>FI: re-issues the lost calls (loop repeats)
Loading

4.3 The sanitizer masks the damage with fabricated results

_message_adapters._sanitize_tool_history injects a synthetic "Tool execution skipped …" result for any tool call it considers abandoned. Because the thread snapshot stores the assistant's streamed text as a separate assistant message after the tool-call message, the sanitizer's "assistant continued" branch fires before it ever reaches the approval response — so it fabricates a result even for the call the user is approving in that very request, which then coexists with the real result (duplicate tool messages for one call_id; strict providers can reject this, lenient ones feed the model contradictory data).

flowchart TD
    subgraph History ["Reconstructed history on the resume request (order matters)"]
        M1["user: prompt"] --> M2["assistant: tool_calls A, B, C"]
        M2 --> M3["assistant: streamed text<br/>(snapshot stores it as a SEPARATE message<br/>after the tool-call message)"]
        M3 --> M4["user: approval response for A"]
    end
    M2 -. "pending = {A, B, C}" .-> S1
    subgraph Sanitizer ["_sanitize_tool_history (single forward pass)"]
        S1{"assistant text arrives<br/>while pending is non-empty"} -->|"'assistant continued' branch fires"| S2["inject synthetic results for A, B, C:<br/>'Tool execution skipped ...'"]
        S2 --> S3["approval response for A is reached<br/>only AFTER the injection"]
    end
    S3 --> R1["_resolve_approval_responses executes A<br/>and writes A's REAL result"]
    R1 --> OUT["Model context:<br/>A = fabricated + real (two results, same call_id)<br/>B, C = fabricated only"]
    style S2 fill:#fdd,stroke:#c00
    style OUT fill:#fdd,stroke:#c00
Loading

4.4 Approved results are also lost from subsequent turns

_make_approval_tool_result_events emits TOOL_CALL_RESULT to the client but does not record the result in flow.tool_results, so the persisted thread snapshot never pairs the approved call with its result under its real call_id (_clean_resolved_approvals_from_snapshot rewrites the payload under the synthetic confirm_changes id instead, which the sanitizer drops on the next turn). Net effect: the model sees the real result exactly once — in the resumed run — and the fabricated "skipped" result on every turn thereafter.

Live corroboration (2026-07-04): the client-visible MESSAGES_SNAPSHOT after a resume run retained the raw approval payload — {"accepted": true, "steps": […]} as a tool message under the synthetic confirm id. _clean_resolved_approvals_from_snapshot builds its replacement map keyed by the resolved results' real call_ids, then looks up the snapshot tool message's toolCallId — the confirm id — so on this path the intended payload→result rewrite finds no match and never fires at all.

sequenceDiagram
    participant Run as Resume run (turn N)
    participant Flow as flow.tool_results
    participant Snap as Saved thread snapshot
    participant Next as Next turn (N+1)

    Run->>Run: _make_approval_tool_result_events emits<br/>TOOL_CALL_RESULT (real call_id) to the client
    Note over Run,Flow: result NOT recorded in flow.tool_results
    Run->>Snap: _clean_resolved_approvals_from_snapshot rewrites the<br/>confirm_changes tool message payload with the result
    Note over Snap: result persisted under the SYNTHETIC<br/>confirm_changes id, not the real call_id
    Next->>Snap: reconstruct history from snapshot
    Next->>Next: sanitizer filters confirm_changes from the assistant<br/>message and drops its orphaned tool message (the real result)
    Next->>Next: real call_id now unanswered again -><br/>synthetic "Tool execution skipped" injected
    Note over Next: the model sees the real result exactly once (turn N)<br/>and a fabricated result on every later turn
Loading

4.5 Two latent protocol defects (exposed once state survives)

  1. Cross-run (duplicate) TOOL_CALL_END. A queued approval request popped in a later run makes _emit_approval_request emit TOOL_CALL_END for a tool call whose TOOL_CALL_START happened in the previous run. Note this is a duplicate, not a late completion: the originating run already emitted the call's TOOL_CALL_END when argument streaming finished (via _emit_approval_request for the surfaced request, or the end-of-run "pending tool calls without end event" sweep for stripped siblings — both observed live: START event 237 and END event 247 in the originating run, then a second END as event 5 of the resume run). The AG-UI client rejects the second END and drops the whole event stream: Cannot send 'TOOL_CALL_END' event: No active tool call found with ID '…'. A 'TOOL_CALL_START' event must be sent first. (runtimeErrorCode: INCOMPLETE_STREAM). Suppressing it therefore loses no information — it converges on the spec's "emit the lifecycle exactly once, in the run that streamed the call" (§5).

  2. Stream-invisible approval executions. In the streaming function-invocation loop (_tools.py), approved calls executed from inbound approval responses put their results into prepped_messages only — the normal in-run execution path yields results as a ChatResponseUpdate (update_role branch), but the approval path does not. Any host that builds client-visible history from the stream (AG-UI) can never persist these results.

    This is not limited to cross-run resumes. Live evidence (2026-07-03, run 40101f41…): a batch resolved entirely inside its own runload_skill rule-auto-approved by SkillsProvider with caf_methodologies/waf_pillars riding along as never-require siblings — executed and fed the model correctly (the results appear in the very next model call's payload), yet the run emitted zero TOOL_CALL_RESULT events among 214 (six TOOL_CALL_STARTs — five real calls plus the synthetic confirm_changes — and their six ENDs). Client history and the thread snapshot are permanently missing those results, so tool cards stay "in progress" forever and the sanitizer fabricates "skipped" results for those calls on every later turn even when approval state itself survives. Repair layers outside the run are structurally too late: the transport stops consuming the agent stream at the approval interrupt, so anything recovering results at stream cleanup runs only after RUN_FINISHED has been emitted. Only yielding the results on the stream — this defect's fix — makes them visible in time.

5. AG-UI protocol conformance

The AG-UI interrupts specification (https://docs.ag-ui.com/concepts/interrupts) is directly on point for tool-based HITL:

When reason: "tool_call" with toolCallId: the agent proposes a tool call in the interrupted run; the client resumes; the agent emits ToolCallResult against the original toolCallId in the resumed run. Crucially: "The agent does not re-emit ToolCallStart/ToolCallArgs/ToolCallEnd."

and:

"A single resume array must address every open interrupt from the interrupted run." …resumption occurs by "starting a new run whose RunAgentInput includes a resume array addressing every open interrupt."

and on state:

"The agent must emit any state required for resume via StateSnapshot" before the interrupt; "State is agent-owned; the client submits resume payloads linking to interrupts, not state objects."

Conformance assessment of current agent_framework_ag_ui behavior:

Spec expectation Current behavior Verdict
Result-only emission for calls approved across runs (ToolCallResult against the original id, no re-emitted Start/Args/End) _make_approval_tool_result_events does this correctly for the approved call ✅ conforms
No ToolCallEnd without a same-run ToolCallStart _emit_approval_request emits TOOL_CALL_END for prior-run calls when a queued request surfaces later ❌ violates (defect 4.5.1)
Every open interrupt of the interrupted run addressed by one resume Only one interrupt is ever opened per run (queueing), so per-run coverage trivially holds; the batch however spans runs by design ⚠️ legal, but see Approach B
State required for resume survives to the resumed run Tool-approval state is destroyed between runs ❌ the essence of this bug
Interrupt objects carry toolCallId binding MAF emits a CustomEvent("function_approval_request") + a synthetic confirm_changes frontend tool + RUN_FINISHED.interrupt entries keyed by the function call_id ⚠️ pre-dates the core taxonomy; out of scope here

Note the spec's resume-state guidance points at the shared-state channel (StateSnapshot → client → RunAgentInput.state). That is evaluated below as Approach E — and rejected on security grounds for approval state specifically; the server-side thread snapshot store (which agent_framework_ag_ui already owns for message history) satisfies "state is agent-owned" more strongly.

6. Fix approaches considered

Approach A — persist tool-approval session state in the thread snapshot (proposed)

Add session_state to AGUIThreadSnapshot; run_agent_stream restores it into the per-request session and saves back only session.state["tool_approval"] after each run. Pass the session into _resolve_approval_responses so already-approved sibling groups are popped and executed at the transport layer (events + history included).

Storage reuses the existing thread-snapshot abstraction in agent_framework_ag_ui._snapshots: the AGUIThreadSnapshotStore protocol (async save/get/delete/clear, keyed by (scope, thread_id)) with InMemoryAGUIThreadSnapshotStore as the only shipped implementation — already bounded (DEFAULT_MAX_THREAD_SNAPSHOTS = 1_000, oldest snapshot evicted on overflow), so persisted approval state inherits the same LRU guard. Production deployments plug in their own durable/shared implementation of the protocol.

Today persistence stays inactive unless the endpoint is configured with both a snapshot_store and a snapshot_scope_resolver. To avoid shipping a fix that is silently off for the common local-dev setup, the proposal is to default to a bounded InMemoryAGUIThreadSnapshotStore (with a constant default scope) when the endpoint serves an agent with approval-gated tools and no store is configured — process-local and LRU-guarded, so it adds no durability or memory-growth concerns — plus a log line telling multi-instance deployments to supply a shared store. Per the AG-UI security guidance ("verify session ownership before processing requests"), this default is dev-only: a constant scope performs no ownership verification, so multi-user deployments must supply their own snapshot_scope_resolver deriving the scope from authenticated user/tenant context (see §8). Alternative if a behavioral default is unwanted: keep opt-in and emit a prominent warning when approval-gated tools are detected without a configured store.

Approach A is the state-persistence core; the proposed branch pairs it with the four companion fixes (detailed at the end of this section). The mapping below shows which §4 defect each piece resolves and which §5 conformance gap it closes:

Defect / gap Addressed by Mechanism
§4.1 batch-wide gating (design, not a defect) unchanged by design Semantics preserved identically for all hosts; see Approach C for why changing them is rejected
§4.2 per-request session destroys parked state (queued / collected / hidden) Approach A core AGUIThreadSnapshot.session_state restores/persists session.state["tool_approval"]; queued prompts re-surface run by run; collected auto-approvals inject into the next run; the session passed into _resolve_approval_responses lets hidden already-approved groups pop and execute
§4.3 fabricated "Tool execution skipped" results Companion fix: sanitizer answered-ids guard _sanitize_tool_history skips call ids answered elsewhere in the history (real result or approval response); repair of genuinely abandoned calls is preserved
§4.4 approved result vanishes from later turns Companion fix: record resolved results in flow.tool_results The persisted snapshot pairs the approved call with its result under the real call_id; the §4.3 guard then has a real answer to protect on every later turn
§4.5.1 duplicate cross-run TOOL_CALL_END Companion fix: same-run guard in _emit_approval_request END is emitted exactly once, in the run that streamed the call (the originating run already emits it — §4.5.1); the later-run duplicate is suppressed, restoring §5 row 2 conformance
§4.5.2 stream-invisible approval executions Companion fix: core streaming yield Executed approval-response results are yielded as a ChatResponseUpdate(role="tool"), so AG-UI (and every streaming host) can render and persist them
§5 row "state required for resume survives" (❌) Approach A core Tool-approval state survives server-side in the snapshot — "state is agent-owned", never client-writable
§5 row "no ToolCallEnd without same-run ToolCallStart" (❌) Companion fix (§4.5.1 row above)
§5 row "result-only emission for cross-run approvals" (✅ already) reinforced §4.4/§4.5.2 fixes keep those results in history without ever re-emitting Start/Args/End
Pros Cons
Fixes the root cause (§4.2) — all three loss classes (queued, collected, hidden) — with the existing UX: sequential prompts keep working, one interrupt per run, protocol-legal; with the bundled companion fixes it covers §4.3, §4.4, §4.5.1 and §4.5.2 and closes both ❌ rows of the §5 conformance table (see mapping above) Requires a configured snapshot store + scope resolver; without one the host stays broken (mitigated by the local-dev default above: fall back to a bounded InMemoryAGUIThreadSnapshotStore when approval-gated tools are present, or at minimum warn)
No client or protocol changes; no changes to middleware/core approval semantics — zero impact on console/headless hosts Widens the snapshot contract (previously "only replayable UI data"); documented narrowly as tool-approval state
State stays server-side: clients cannot forge queue/rule/hidden-group contents (contrast E) Horizontal scaling needs a shared snapshot store (Redis etc.) — but message history already has that exact requirement, so no new deployment constraint
Piggybacks on the store's existing authz boundary ((scope, thread_id), scope resolver) and eviction Persisted approval state contains tool names + full arguments → data-at-rest consideration (§8)
Scoping to the tool_approval key avoids resurrecting history-provider state (full-state persistence demonstrably duplicates AG-UI-owned history — caught by test_agent_endpoint_prepends_stored_snapshot_for_new_user_turn) Other session-stateful features (compaction summaries, invocation budgets) remain per-request; deliberate, but a partial answer to the general problem

Approach B — surface all sibling approval requests in one run (no queue)

Change ToolApprovalMiddleware._process_outbound_messages to keep every unresolved request visible; the client renders N approval cards; one resume answers all.

Pros Cons
Most protocol-idiomatic: "a single resume array must address every open interrupt" — the batch resolves in one round trip Only fixes the queued class. Collected auto-approved responses and hidden never-require groups still park in session state → still needs A (or C) anyway
Stateless across runs for the prompting part Changes harness UX for every host: console/TUI/DevUI one-at-a-time prompting is a deliberate design; making it host-conditional leaks hosting concerns into the middleware
Fewer round trips for large batches Client support for N simultaneous HITL cards is uneven (CopilotKit useHumanInTheLoop renders per action instance; batched respond() aggregation into one resume is not guaranteed — observed working reliably only one-at-a-time)
Highest back-compat risk: existing consumers assert single-request rounds

Worth pursuing upstream later as a UX option (surface_all_approvals=True), on top of A.

Approach C — execute never-require siblings immediately (change core batch semantics)

Make _try_execute_function_calls run non-gated calls right away and emit approval requests only for gated ones.

Pros Cons
Eliminates the hidden-group mechanism entirely; results stream in-run and are visible everywhere Reverses a documented core design decision ("…it will ask approval for all of them") with .NET-parity implications
Least total state Safety regression: a never-require tool executes even if the user then rejects the gated sibling. Today nothing in the batch runs before the decision — that is a feature for correlated actions (e.g. read_file + send_email in one batch)
Touches every consumer: console, headless hosting, workflows — largest blast radius for a hosting-specific bug

Approach D — long-lived server-side AgentSession registry keyed by thread id

Keep real AgentSession objects alive in the AG-UI host between requests.

Pros Cons
Fixes all session-stateful features (approvals, compaction, budgets, memory) at once Duplicates what the snapshot store abstraction already provides, minus its pluggability — an in-process registry breaks on multi-instance deployments (sticky sessions required)
No snapshot contract change Needs its own eviction, lifetime, and serialization policy from scratch
Authorization: thread ids are client-supplied; a bare registry keyed by thread id has no scope boundary. The snapshot store's snapshot_scope_resolver exists precisely to bind thread access to an app-defined authz scope — a second mechanism would have to replicate it (risk of divergence)

Approach E — round-trip approval state through the client (shared state channel)

Emit the tool-approval state via StateSnapshot; the client returns it in RunAgentInput.state on resume — the spec's own resume-state channel.

Pros Cons
Fully stateless server; horizontal scaling for free Security-disqualifying for this state: the queue, standing "always approve" rules, collected auto-approvals, and hidden already-approved groups would become client-writable. A tampered client could inject an already_approved_approval_request_groups entry or a standing rule and have the server execute arbitrary registered tools without any approval prompt. The existing pending_approvals server-side registry validates visible approval responses precisely because client input is untrusted — this approach would reopen the same hole one layer down
Spec-endorsed mechanism ("emit any state required for resume via StateSnapshot") Mitigation would require signing/encrypting the state blob → key management and complexity out of proportion to the bug
Tool arguments (potentially sensitive) leak to the client beyond what the approval UI needs; payload bloat on every turn

The spec's guidance fits UI-relevant state; for authorization-relevant state, "agent-owned" is better satisfied by server-side storage the client can't write.

Companion fixes (needed under any approach above)

Fix Alternatives considered Why this one
Sanitizer answered-ids guard: _sanitize_tool_history skips synthetic results for call ids answered anywhere in the history (real result or approval response) (a) run sanitization after approval resolution — reorders normalize_agui_input_messages for all callers incl. workflows; (b) delete synthetic injection — breaks legitimate abandoned-call repair (user types a new message instead of approving) Local, additive, keeps the abandoned-call behavior byte-identical
Cross-run TOOL_CALL_END suppression: _emit_approval_request emits END only if the call started this run (flow.tool_calls_by_id) (a) client-side tolerance — not ours to change, and the spec explicitly says not to re-emit lifecycle events; (b) re-emit Start+Args+End for the prior call — directly contradicts the spec quote in §5 Mandated by the spec's normative wording
Stream approval-execution results: the streaming invocation loop yields a ChatResponseUpdate(role="tool") with results executed from inbound approval responses; _process_function_requests returns them (a) AG-UI-only synthesis of result events at the transport — duplicates execution knowledge, leaves DevUI/console equally blind; (b) put results in the final response only — streaming consumers still miss them mid-run Mirrors the existing in-run execution path (which already yields results); benefits every streaming consumer
Record resolved approval results in flow.tool_results so the snapshot pairs approved calls with results under their real ids rely on _clean_resolved_approvals_from_snapshot's confirm-id rewrite — demonstrated broken (§4.4) Makes the persisted history self-consistent; the confirm-id rewrite stays as defense-in-depth

7. Proposed fix — behavior after the change

sequenceDiagram
    participant UI as AG-UI client
    participant Run as run_agent_stream
    participant Store as Thread snapshot store
    participant Agent as Agent (middleware + invocation)

    UI->>Run: resume (approval response)
    Run->>Store: get(scope, thread_id)
    Store-->>Run: snapshot incl. session_state
    Note over Run: restore tool_approval state<br/>into the fresh AgentSession
    Run->>Run: resolve approvals: execute approved call,<br/>pop + execute released never-require siblings
    Run->>Agent: agent.run(session)
    Note over Agent: middleware injects collected auto-approved calls<br/>queue pops the next prompt, if any
    Agent-->>Run: stream (now incl. executed approval results)
    Run->>Store: save(snapshot + tool_approval session_state)
    Run-->>UI: TOOL_CALL_RESULT events + RUN_FINISHED
Loading

8. Security analysis of the proposed fix

  • No new client-controlled input. session_state is written and read exclusively server-side via the snapshot store; it never appears in RunAgentInput or any event. There is no injection or forgery channel (this is the decisive difference vs Approach E).
  • Authorization boundary unchanged. Persisted approval state inherits the store's (scope, thread_id) keying; the snapshot_scope_resolver remains the app's authz boundary. A client that can access a thread's history can already see its tool calls — the approval state adds queue ordering and standing rules, same sensitivity class.
  • No approval bypass introduced. Released "already-approved groups" contain only calls that never required approval (approval_mode="never_require") — deferring vs. executing them grants no additional privilege. They are released only after an approval response that passed the existing pending_approvals server-side registry validation (anti-bypass, anti-spoofing, anti-replay — untouched). Note the release-on-rejection semantics match the existing _process_function_requests behavior (pre-existing, deliberate: rejection of the gated call does not veto never-gated siblings; if that is ever tightened, both paths change together).
  • Standing "always approve" rules now actually persist on this host — that is the feature working as documented, but it means a rule created via a validated approval response is durable for the thread's snapshot lifetime. Bounded by store eviction (in-memory default: 1000 snapshots LRU); documented in the field docstring.
  • Data at rest. Persisted approval state includes tool names and full call arguments, which can embed user data. This matches the sensitivity of snapshot.messages already stored in the same record, so no new class of data is persisted — but stores backed by durable media should treat the whole snapshot as sensitive (encryption at rest is the store implementer's responsibility; worth a line in AGUIThreadSnapshotStore docs).
  • Model-context integrity improves. The sanitizer guard stops fabricated tool results from being presented to the model for calls that have (or will get) real outcomes — fabricated results are a correctness and a prompt-integrity concern. Genuinely abandoned calls keep the synthetic repair, so no unpaired-call regressions for strict providers.
  • DoS surface. Bounded by existing snapshot limits; the queue/groups are bounded by the model's per-turn tool-call count. The 10k-entry LRU pending_approvals registry is unchanged.

Validation against the official AG-UI security guidance

Checked against Security Considerations for AG-UI (Microsoft Learn) — the framework's own threat model for AG-UI hosting:

Guidance How the proposal complies
"Untrusted client input: all data from clients should be treated as potentially malicious" / State Injection threat ("the messages list and state are the primary vectors for prompt injection attacks") Approval state (queue, standing rules, collected auto-approvals, hidden groups) is kept out of both primary injection vectors: session_state never appears in RunAgentInput or any event. This is precisely why Approach E (round-tripping approval state through the client via shared state) is rejected in §6 — it would move authorization-relevant state into the documented State Injection vector
Message List Injection threat ("tool call messages to simulate tool executions or extract data") The sanitizer answered-ids guard reads only the same history the host already consumes; with the snapshot store active (a prerequisite of Approach A), that history is backend-owned_reconstruct_messages_from_thread_snapshot + _filter_untrusted_suffix (introduced by PR #6471) drop client-forged non-user messages before the guard ever sees them, so the fix strengthens rather than weakens this boundary. The pending_approvals registry validation of approval responses is untouched
Session ID Management ("Never allow clients to directly access arbitrary Session IDs", "Verify session ownership before processing requests") Persisted approval state is keyed by (scope, thread_id) with snapshot_scope_resolver as the ownership check — the same boundary already protecting snapshot.messages. Consequently the local-dev in-memory default proposed in §6.A must be flagged dev-only: its constant scope performs no ownership verification, so production deployments must derive the scope from authenticated user/tenant context (noted in §6.A)
Trusted Frontend Server Pattern ("Do not expose AG-UI servers directly to untrusted clients") The proposal is compatible with both deployment models and — because approval state is server-side — remains safe even in the discouraged direct-exposure model, where a hostile client controls every protocol field
Sensitive Data Filtering ("Tool responses may inadvertently include sensitive data … Always filter responses before sending to clients") Persisting approval requests (tool names + arguments) adds no new data class: the same store already holds snapshot.messages including tool calls and results. The data-at-rest note above applies; response filtering obligations are unchanged
Human-in-the-Loop for Sensitive Operations ("Implement approval workflows for high-risk tool operations") This defect silently breaks the article's own recommended control on AG-UI hosting (gated calls never run, auto-approval rules never apply, and the model is fed fabricated results). The fix restores the HITL control's integrity — an argument for treating this as security-relevant, not just functional

9. Impact on non-AG-UI hosts (console, DevUI, headless)

Change Console / TUI DevUI Headless hosting (agent-framework-hosting*) Workflows
Snapshot session_state (+ restore) none — AG-UI package only none (DevUI has its own host) none none
Session into _resolve_approval_responses + group release none — AG-UI internal none none none
Sanitizer guard none — AG-UI internal (normalize_agui_input_messages callers only; workflow runs already opt out via sanitize_tool_history=False) none none none
Same-run-only TOOL_CALL_END none none none none
Core: stream approval-execution results visible change: streaming consumers now receive a ChatResponseUpdate(role="tool") with the executed results when a run begins with inbound approval responses. Interactive hosts that render tool results will start showing them (previously silently absent — arguably a fix in itself). Consumers that aggregate updates into a final response will include these result contents same same AgentExecutor streams updates through; additive content, no ordering change for existing events

The core change is additive (a new update, no removed or reordered events) and mirrors the payload shape of the existing in-run execution yield. Full upstream test suites pass unmodified except the intentional snapshot field-contract test. Sessions that persist across runs in-process (console/DevUI) never hit the changed AG-UI code paths at all; their approval flow is byte-identical.

10. Related upstream issues and PRs (duplicate check, 2026-07-03)

Searched microsoft/agent-framework issues and PRs (open and closed, Python and .NET) for: approval, parallel tool calls, queued approval, session state, thread snapshot, TOOL_CALL_END/TOOL_CALL_RESULT, synthetic/skipped tool results, sanitize tool history, interrupt/resume. No existing issue reports the core defect (per-request AgentSession destroying tool-approval state for parallel batches). The closest items:

Same root-cause family (Python) — sub-symptoms this fix addresses or reshapes

Ref State Relation
#6828 open Approval-gated tool reverts to "in progress": resolved results emitted only as transient TOOL_CALL_RESULT, never recorded in flow.tool_results, so MESSAGES_SNAPSHOT omits them — exactly §4.4; the flow.tool_results companion fix resolves it
#6851 open Approved tool re-executes on a later unrelated turn (stale approval payload re-detected from snapshot); self-identifies the same _clean_resolved_approvals_from_snapshot gap as #6828. The result-pairing + answered-ids guard remove the re-detection trigger; verify against its repro before closing as fixed
#6894 open Foundry: pending_approvals registry rejects the approval → tool never executes → stuck "Running" chip + "No tool output found" crash. Different trigger (registry id mismatch on Foundry — not fixed by this branch), identical downstream symptom family; the sanitizer guard changes its failure mode
#6266 open MessagesSnapshotEvent splits/reassigns the streamed text message on mixed turns — the snapshot layout that trips the sanitizer's "assistant continued" branch (§4.3)
#4589 closed Added TOOL_CALL_RESULT emission on approval resume (_make_approval_tool_result_events) — §4.4 extends that fix from the event stream to the persisted history
#5855 closed Origin of _sanitize_tool_history's synthetic results (unpaired-call repair for OpenAI). The answered-ids guard deliberately preserves its behavior for genuinely abandoned calls

Adjacent Python issues/PRs (same subsystem, not addressed by this fix)

Ref State Relation
#5818 open AgentSession pending-request resume failure in workflows (Magentic plan review) — same class of problem (session state across pause/resume), workflow path
#6845 / PR #6867 open Allow disabling approval for SkillsProvider tools — pain amplified by this bug (load_skill could never run on AG-UI at all)
#6788 / PR #6822 open Ollama reuses tool call_ids — amplifier observed in our repro (re-issued lost calls collide with their originals)
#5941 open Multi-turn tool calls 400 on Foundry via AG-UI (replay pairing) — same strict-provider failure class as the companion ordering report; not addressed by this fix
#6652 open Forward HITL approval to hosted/remote FoundryAgent instead of executing locally — adjacent approval-routing gap
#4963 / PR #5300 open Sub-agent tool approval routing

.NET counterparts

Ref State Relation
#6756 open Implement AG-UI interrupts in .NET ("Python is currently implemented") — the port should not replicate this defect; §5–§6 of this report apply directly to that design
#2699 open .NET: AG-UI multi-turn replay produces invalid OpenAI tool_call history — .NET member of the invalid-replay family; see the companion ordering report for the Python mis-ordering variant
#3054 open .NET twin of the batch-wide approval semantics (approval asked for all functions when one is wrapped) — the same design decision Approach C would reverse; cross-language parity argues for keeping semantics aligned
#5600 open GroupChat approval flow → HTTP 400 "tool message without a preceding 'tool_calls'" — .NET manifestation of unpaired approval results (§4.4's failure mode on a strict provider)
#6786 open Durable workflows lose executor TInput state after HITL approval — .NET cousin of state loss across an approval boundary
#6220, #5587 open .NET AG-UI client errors on approval rejection / non-JSON TOOL_CALL_RESULT — client-side strictness of the same event contract that makes cross-run TOOL_CALL_END fatal (§4.5.1)
PR #5805 open .NET: fix function-approval persistence with per-service-call history persistence — .NET-side recognition that approval state needs durable pairing

PRs the proposed fix builds on

Ref State Relation
PR #6471 merged Introduced AGUIThreadSnapshot persistence/hydration — the store Approach A extends. Its motivation ("clients that resend full history … allows tampered transcripts") is the same trust argument that disqualifies Approach E for approval state
PR #6376 merged pending_approvals registry argument matching — the validation layer the released-groups path relies on (§8)
PR #3212 merged MCP tool support in AG-UI approval flows — the code path this repro exercises
PR #6646 open AG-UI workflow checkpointing — the workflow-side analogue of persisting resume state in the AG-UI host
PR #6594 open Foundry hosting: extend hosted-session scoping to approval handling — parallel effort recognizing that approval state needs session-scoped ownership on another host

11. Suggested issue framing / follow-ups

  • Primary issue: "AG-UI host loses parallel tool calls when any call requires approval (per-request AgentSession destroys tool-approval state)" — this report. Cross-link #6828 and #6851 as sub-symptoms of the same family that the accompanying branch also addresses, and #6756 so the .NET AG-UI interrupts port takes this into account.
  • Separable follow-ups worth their own issues:
    1. core: approval-execution results invisible to streaming consumers (§4.5.2) — fixes independently of AG-UI;
    2. ag-ui: adopt the AG-UI core interrupt taxonomy (reason: "tool_call", responseSchema) instead of CustomEvent + synthetic confirm_changes (§5, last row) — larger, orthogonal; overlaps with #6756's intent;
    3. harness: optional surface_all_approvals mode (Approach B) for hosts/clients that support multi-card approval UIs;
    4. ag-ui: approval-resolved results appended out of order (after the snapshot's split-off assistant text), producing protocol-invalid history for strict chat providers

Metadata

Metadata

Assignees

No one assigned

    Labels

    .NETUsage: [Issues, PRs], Target: .NetpythonUsage: [Issues, PRs], Target: PythonreproducedUsage: [Issues], Target: all issues that can be reproduced by the triage workflowtriageUsage: [Issues], Target: All issues that still need to be triaged

    Type

    Fields

    No fields configured for Bug.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions