Skip to content

Python: [Bug]: AG-UI: 'No tool output found' on Foundry provider #6894

Description

@antsok

Description

With CopilotKit and Foundry provider ...

When an approval-gated tool (e.g. microsoft_docs_search, approval_mode="always_require") is surfaced as an
AG-UI confirm_changes human-in-the-loop card and the user clicks Approve, the backend rejects the
approval
with:

WARNING:agent_framework_ag_ui._agent_run:Rejected approval response id=call_MPgkkd1maUPTs4ToqH4Pj7ja:
    no matching pending approval request

The gated tool then never executes. Two symptoms follow from that single failure:

  1. UI: the tool chip is stuck on "Running" forever (no result ever arrives).

  2. Backend crash on the next model call:

    agent_framework.exceptions.ChatClientException: FoundryChatClient service failed to complete the prompt:
    Error code: 400 - {'error': {'message': 'No tool output found for function call
    call_MPgkkd1maUPTs4ToqH4Pj7ja.', 'type': 'invalid_request_error', 'param': 'input', 'code': None}}
    

The root cause is that the server-side pending-approval registry key is built from two different
thread_id values
: the registration (during the pausing run) uses the provider's server-side
conversation id
(conv_id), while the resolution (during the resume run) uses the client-supplied
thread_id. With a server-side-stateful backend (Azure AI Foundry / OpenAI Responses API) plus a client that
pins its own thread id (CopilotKit), these two values diverge and the lookup always misses.

Code Sample

Error Messages / Stack Traces

Package Versions

agent-framework-ag-ui: 1.0.0rc6, agent-framework-core: 1.10.0, agent-framework-foundry: 1.10.0

Python Version

Python 3.12

Additional Context

Reproduction

  1. Run the full stack (aspire), provider = Foundry, model gpt-5.4-mini.
  2. In the web UI send: "I want a landing zone for my AI app like Microsoft Learn recommends".
  3. The model emits a preamble + parallel tool calls, including approval-gated microsoft_docs_search.
  4. Click Approve on the confirm_changes card.
  5. Observe aspire logs:
    • Rejected approval response id=… : no matching pending approval request
    • the chip stays "Running"
    • a subsequent request fails with 400 … No tool output found for function call ….

Key identifiers from the observed run

Role Value
Client (CopilotKit) thread id 95dd2877-ea85-4e63-9eee-878c6f162759
Foundry service conversation id (run 1) resp_02a1c1deeef7d5b2006a464e5927888197b1714203c4e1da6f
Gated tool call id call_MPgkkd1maUPTs4ToqH4Pj7ja (microsoft_docs_search)

The incoming request logged Thread ID: 95dd2877…, but RUN_STARTED advertised
thread_id: resp_02a1… — direct evidence of the mid-stream reassignment.

Preconditions that arm the trap

  1. AgentFrameworkAgent(require_confirmation=True) turns each gated tool into a confirm_changes HITL card.
  2. Several tools are gated (approval_mode="always_require").
  3. The Foundry Responses API is server-side stateful: run_agent_stream sets
    run_kwargs["options"] = {"metadata": …, "store": True}
    (_agent_run.py run_agent_stream),
    and each response carries its own conversation_id.
  4. A single in-memory registry pending_approvals (an OrderedDict on the AgentFrameworkAgent instance)
    is keyed "{thread_id}:{request_id}" and persists across runs
    (_agent.py self._pending_approvals).

Step-by-step build-up

Run 1 — pause (request → approval card)

  1. UI sends the first request with stable thread_id = 95dd2877….
  2. run_agent_stream sets thread_id = input_data["thread_id"]95dd2877….
  3. _resolve_approval_responses(…, thread_id=95dd2877…) runs — no approvals yet, no-op.
  4. The stream starts. On the first update the provider's conversation id is read and the variable is
    overwritten:
    conv_id = get_conversation_id_from_update(update)   # resp_02a1…
    if conv_id:
        thread_id = conv_id                              # 95dd2877… → resp_02a1…
    # NOW emit RunStarted with proper IDs
    yield RunStartedEvent(run_id=run_id, thread_id=thread_id)
  5. The model streams a phase='commentary' preamble, then parallel tool calls incl. the gated
    microsoft_docs_search (call_MPgk…).
  6. The gated call surfaces as a function_approval_request; the registry entry is written using the
    now-reassigned thread_id
    :
    pending_approvals[f"{thread_id}:{content.id}"] =# key = "resp_02a1…:call_MPgk…"
  7. The confirm_changes card (carrying function_call_id = call_MPgk…) is emitted; the run finishes,
    pausing for input.

Trap armed: the pending approval is stored under the service id resp_02a1…:call_MPgk….

User clicks Approve

  1. CopilotKit resumes with a new request, using its stable thread id 95dd2877…, and includes the
    approval response for call_MPgk… (a role="user" turn carrying function_approval_response).

Run 2 — resume (where it breaks)

  1. run_agent_stream again sets thread_id = 95dd2877… from the incoming request.
  2. Before the stream starts, _resolve_approval_responses(…, thread_id=95dd2877…) builds the lookup key:
    registry_key = f"{thread_id}:{resp_id}"     # "95dd2877…:call_MPgk…"
    if registry_key not in pending_approvals:    # registered key was "resp_02a1…:call_MPgk…"
        logger.warning("Rejected approval response id=%s: no matching pending approval request", resp_id)
  3. Key mismatch → the approval is rejected and stripped from messages. The tool is never executed;
    no function_call_output is produced for call_MPgk….

The compounding defect → hard 400

  1. Separately, the history sanitizer _sanitize_tool_history sees the function_approval_response for
    call_MPgk…, removes it from the pending set assuming "the framework will execute it," and therefore
    does not inject a synthetic skip result for it
    (_message_adapters.py).
    The other, ungated calls that are still pending do receive
    Tool execution skipped - user provided follow-up message.
  2. The two subsystems disagree:
    • sanitizer: "don't add a result — it will be executed";
    • approval validator: "rejected — I will not execute it".
      Neither produces an output for call_MPgk….
  3. The outbound transcript now has 7 function_call items but only 6 function_call_output items — the
    missing one is exactly call_MPgk….
  4. The Foundry Responses API validates the request input, finds a function_call with no matching output,
    and returns 400 - No tool output found for function call call_MPgkkd1maUPTs4ToqH4Pj7ja, surfaced as
    ChatClientException / BadRequestError.

Diagram

sequenceDiagram
  autonumber
  participant UI as CopilotKit UI
  participant RUN as run_agent_stream
  participant REG as pending_approvals
  participant LLM as Foundry Responses API

  rect rgb(245,245,255)
  Note over UI,LLM: RUN 1 — pause
  UI->>RUN: POST / (thread_id=95dd2877…, "landing zone…")
  Note over RUN: thread_id = 95dd2877…
  RUN->>RUN: _resolve_approval_responses (nothing pending)
  RUN->>LLM: agent.run(stream=True, store=True)
  LLM-->>RUN: first update → conv_id = resp_02a1…
  Note over RUN: thread_id = conv_id ⇒ resp_02a1…
  LLM-->>RUN: commentary preamble + gated microsoft_docs_search(call_MPgk…)
  RUN->>REG: register "resp_02a1…:call_MPgk…"
  RUN-->>UI: confirm_changes(function_call_id=call_MPgk…) + RUN_FINISHED
  end

  rect rgb(255,245,245)
  Note over UI,LLM: RUN 2 — resume (Approve)
  UI->>RUN: POST / (thread_id=95dd2877…, approval for call_MPgk…)
  Note over RUN: thread_id = 95dd2877… (pre-stream)
  RUN->>REG: lookup "95dd2877…:call_MPgk…"
  REG-->>RUN: MISS (registered under resp_02a1…)
  Note over RUN: reject + strip approval ⇒ tool NOT executed
  RUN->>RUN: sanitizer skipped injecting a result (assumed execution)
  Note over RUN: transcript has function_call call_MPgk… with NO output
  RUN->>LLM: next request (7 calls, 6 outputs)
  LLM-->>RUN: 400 No tool output found for call_MPgk…
  RUN-->>UI: ChatClientException (chip stuck "Running")
  end
Loading

Why the reassignment exists (it is intentional)

The thread_id = conv_id rewrite is a deliberate conversation-continuity mechanism for stateful backends,
not a mistake:

  • The AG-UI client contract is "the server must handle history via thread_id"
    (_client.py docstring). For a Responses-API
    backend, the only handle that can resume the stored conversation is the provider's conversation_id, so the
    server rewrites thread_id to conv_id and advertises it in RUN_STARTED.
  • The provider only reveals its conversation_id in the first streamed update, so the rewrite must happen
    mid-stream (hence "emit RunStarted after first update to get service IDs").
  • After the rewrite, the same thread_id also keys the snapshot store and the RUN_FINISHED event,
    aligning one durable id across client thread, provider conversation, and server snapshot.

The defect is that the pending-approval registry reused this lifecycle-sensitive variable without
accounting for the fact that it is mutated mid-run and is sourced differently across runs. The stable,
run-invariant id already exists: the original client thread id, captured before the rewrite into
base_metadata["ag_ui_thread_id"] (and listed in AG_UI_INTERNAL_METADATA_KEYS).

Registry usage surface (blast radius)

# Site Operation Keyed by
1 _agent.py self._pending_approvals allocation (one per agent, persists across runs)
2 _agent.py run_agent_stream(pending_approvals=…) pass by reference
3 _agent_run.py registration write (only write site) reassigned thread_id (=conv_id)
4 _agent_run.py _evict_oldest_approvals LRU eviction (popitem(last=False)) oldest key
5 _agent_run.py _resolve_approval_responses membership check in incoming thread_id
6 _agent_run.py _resolve_approval_responses read + validate (name + canonical args) incoming thread_id
7 _agent_run.py _resolve_approval_responses consume del incoming thread_id

_pending_approvals is private; there are no external readers or serialization. The dual-key change touches
only sites 3–7 plus the entry TypedDict.

Solution analysis

The registry only works when the key it is registered under (pause run) equals the key the client sends back
(resume run)
. Clients fall into two families:

  • Case A — pins its own thread id across turns (CopilotKit sends 95dd2877…).
  • Case B — echoes the server-advertised conv_id (the AG-UI reference client's documented direct pattern:
    thread_id = response.additional_properties.get("thread_id") → resend).

Resolution already keys off the incoming id; only registration uses the reassigned conv_id.

Backend Client register key resolve key Today Option 1: register under client id Option 2: request-id only Option 3: dual-key
Stateless any client id (no reassignment) client id
Stateful A: pins own id conv_id client id
Stateful B: echoes conv_id conv_id conv_id breaks
  • Option 1 — register under the stable client id (ag_ui_thread_id). Fixes Case A but breaks Case B
    (including the framework's own reference client) on stateful backends. Rejected.
  • Option 2 — drop the thread prefix; key by request_id (call id) only, relying on the existing
    name + canonical-argument validation. Works everywhere, minimal, but discards the per-thread scoping the
    maintainers added as defense-in-depth. Viable fallback, but weakens isolation.
  • Option 3 — dual-key: one entry registered under BOTH the client id and the conv_id. Works across all
    combinations, preserves thread scoping and name/args validation, and collapses to a single key in the
    stateless case. Requires the entry to carry its own key list so consume/eviction can purge all aliases.

Stateless LLMs are unaffected by the bug (no reassignment → both sites use the client id) and remain correct
under Options 2 and 3.

Proposed solution — thread-safe dual-key registry

Replace the bare shared OrderedDict with a small PendingApprovalRegistry container that (a) registers each
pending approval as one entry referenced by up to two keys, (b) makes the entry self-describing so
consumption and eviction remove all of its aliases, and (c) guards every compound operation with a
threading.Lock so it is safe under concurrent runs on the same agent instance.

  1. Extend the entry TypedDict: _PendingApproval = {name, arguments, keys: list[str]}.
  2. PendingApprovalRegistry wraps an OrderedDict[str, _PendingApproval] plus a threading.Lock. It
    exposes three atomic operations plus read-only dict-like dunders (__iter__, __contains__, __len__,
    __getitem__) for introspection, and a compatibility __setitem__ that wraps a legacy bare-str value
    into a single-key entry:
    • register(keys, name, arguments) — dedupe keys, store the same entry object under each, then evict.
    • consume(key, name, arguments) -> (status, entry) — membership + name + canonical-argument validation
      and removal of all sibling keys, performed atomically under the lock. status is one of
      ok | missing | name_mismatch | arguments_mismatch.
    • entry-aware LRU eviction — when trimming to max_size, pop the oldest entry and all its alias keys.
  3. Register (site 3): compute both candidate keys and hand them to the registry:
    client_key = f"{client_thread_id}:{content.id}"   # stable client id (captured pre-reassignment)
    conv_key   = f"{thread_id}:{content.id}"           # reassigned conv_id (post-reassignment)
    pending_approvals.register([client_key, conv_key], name, canonical_args)  # dedupes when stateless
  4. Resolve/consume (sites 5–7): one atomic call replaces the previous membership-check → validate → del
    sequence:
    status, entry = pending_approvals.consume(f"{thread_id}:{resp_id}", resp_name, response_arguments)
    # log + strip on missing/name_mismatch/arguments_mismatch; accept on ok
  5. Validation semantics unchanged (name + canonical arguments), preserving anti-spoof/replay guarantees;
    they now execute inside the lock so validate-then-consume cannot race another run.

Properties:

  • Case A: resolve under client id → hits client_key. ✅
  • Case B: resolve under conv_id → hits conv_key. ✅
  • Stateless: client_key == conv_key → single key, identical to today. ✅
  • Replay protection: consume removes all sibling keys atomically → single-use even under concurrency.
  • Thread safety: every read/insert/delete/evict runs under one threading.Lock; no await is held inside
    the lock, so it is safe for both concurrent asyncio tasks and true OS threads sharing the agent instance.
  • Memory: ~2 keys/entry, bounded by entry-aware LRU eviction (max_size = 10_000).
  • str legacy variant: the compatibility __setitem__ wraps a bare-str value into a {name, keys}
    entry, so existing direct-assignment call sites keep working.

Related: snapshot-store keying (same root, separate structure)

The AG-UI thread snapshot store exhibits the same thread_id divergence but in a different structure:

  • Save uses the post-reassignment thread_id (conv_id): _save_thread_snapshot(…, thread_id=thread_id).
  • Hydrate/get uses the incoming thread_id: config.snapshot_store.get(…, thread_id=thread_id) and
    _hydrate_thread_snapshot(…) — both before the stream reassigns it.

Consequences:

  • A Case A client (pins its own id) that relies on the server snapshot store would miss its own
    snapshot
    across turns, for the same reason approvals miss.
  • It does not currently bite this deployment because CopilotKit replays history client-side, so the
    server snapshot store is effectively unused here. It remains a latent inconsistency.

Recommended handling: fix consistently by keying the snapshot store on the stable client id
(ag_ui_thread_id) as well, or by explicitly documenting that the snapshot store requires clients to echo the
advertised conv_id.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pythonUsage: [Issues, PRs], Target: PythonreproducedUsage: [Issues], Target: all issues that can be reproduced by the triage workflowtriageUsage: [Issues], Target: All issues that still need to be triaged

    Type

    Fields

    No fields configured for Bug.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions