Skip to content

StdioBridgeHost: long ExecuteMenuItem runs N times due to broker retry storm under 30s frame timeout #1130

@vrxgaminginc

Description

@vrxgaminginc

Summary

A long, synchronous ExecuteMenuItem (longer than FrameIOTimeoutMs = 30s) gets executed N times back-to-back instead of once, because StdioBridgeHost.HandleClientAsync times out the awaiting TCS but does not cancel/remove the queued command, and the broker reconnects and enqueues duplicates that all fire when the original finally completes.

In our repro: a Unity menu item that runs BuildPipeline.BuildPlayer for ~6 minutes ran 6 consecutive Android builds before we manually cancelled. Each was a full ~6-min build.

Repro

  1. Define an EditorApplication.ExecuteMenuItem-callable menu that does any synchronous work taking longer than FrameIOTimeoutMs (currently 30000 ms). Easy stub: a menu that does Thread.Sleep(35_000). Realistic case: any BuildPipeline.BuildPlayer invocation.
  2. From an MCP client (e.g. Claude Code), invoke the menu via the execute_menu_item tool.
  3. Observe in Editor.log that the menu item runs N times sequentially, where N equals the number of broker reconnect attempts before it gives up.

The Editor.log shows the same stack repeated for each iteration:

<your menu method>
MCPForUnity.Editor.Tools.ExecuteMenuItem:HandleCommand
MCPForUnity.Editor.Tools.CommandRegistry:ExecuteCommand
TransportCommandDispatcher:ProcessCommand → ProcessQueue
StdioBridgeHost:ProcessCommands
UnityEditor.EditorApplication:Internal_CallUpdateFunctions

Verified against branch beta.

Root cause

There are two independent timeout paths, both set to the same 30s value (FrameIOTimeoutMs = 30_000 in StdioBridgeHost.cs), and one of them creates the retry-storm.

Path 1 — outer per-frame TCS timeout in StdioBridgeHost.HandleClientAsync (~lines 582-637)

HandleClientAsync reads a framed command, enqueues it into commandQueue with a TaskCompletionSource<string>, then awaits that TCS with a Task.Delay(FrameIOTimeoutMs) race.

If the queued command takes longer than 30s to complete (e.g. because ExecuteMenuItem is synchronously running a 6-minute build on the Unity main thread, blocking EditorApplication.update where ProcessCommands is hooked), the outer await sees a timeout, returns a "Command processing timed out after 30000 ms" error frame to the broker, and the loop continues trying to read the next frame.

Crucially: the queued command is NOT cancelled. ExecuteQueuedCommand fires its own Runner() (async void) which only removes the entry from commandQueue in its finally (~lines 989-993) — i.e. when the build eventually finishes.

Path 2 — broker reconnects, enqueueing duplicate commands

When the broker (Python side) sees the timeout response and the original operation never completed semantically, it reconnects. HandleClientAsync detects the new connection and explicitly closes stale clients (~lines 525-537, "Closing N stale client(s) after new connection"). The new connection produces a new commandId for the retried request, which is enqueued behind the still-running first one in commandQueue.

When the first 6-minute build finally returns control to Unity, processingCommands clears, the next EditorApplication.update tick picks up duplicate #2, marks it IsExecuting = true, and runs another full build. And so on, once per stalled retry that the broker queued up. Six retries = six builds.

Why the inner dispatcher does NOT itself loop

TransportCommandDispatcher.ProcessQueue is well-behaved on this axis:

  • Sets pending.IsExecuting = true before dispatch (~lines 298-299).
  • Calls RemovePending immediately for sync results (~line 498).
  • Schedules RemovePending via delayCall from a ContinueWith for async results (~lines 442-475).
  • Guards with _processingFlag (~lines 273-316) so an in-flight ProcessQueue cannot re-enter from another EditorApplication.update tick.

The user-supplied stack repeats not because the dispatcher re-fires the same PendingCommand id, but because ProcessCommandsExecuteQueuedCommandExecuteCommandJsonAsync is invoked once per duplicate commandId that the broker enqueued at the StdioBridgeHost layer.

Suggested fixes (ordered smallest → most correct)

C — Idempotency / dedup on the bridge side (smallest patch, fully fixes symptom)

When HandleClientAsync enqueues a command, hash the payload + tool name and deduplicate against the in-flight set. If an identical command is already executing, return a "duplicate suppressed" response instead of enqueueing a second copy.

A — Make outer timeout cancel the queued work (most correct)

In HandleClientAsync, when the outer TCS race times out:

  1. Look up the QueuedCommand by commandId and call Tcs.TrySetCanceled (or set a "client gone" flag).
  2. Remove the entry from commandQueue immediately so a future ProcessCommands tick won't even start it. If it has already started (IsExecuting == true), pass cancellation down through the CancellationTokenSource already created in ExecuteQueuedCommand (~line 961) so TransportCommandDispatcher.ExecuteCommandJsonAsync honours it. The dispatcher already supports cancellation via its CancellationToken parameter.
  3. Note that EditorApplication.ExecuteMenuItem itself is synchronous and not cancellable, so cancellation only prevents future duplicates, not the in-flight build. That is acceptable — the goal is "don't run it 6 times", not "stop the running build".

B — Decouple the timeouts (architectural)

The 30s FrameIOTimeoutMs is a reasonable I/O frame timeout (don't sit on a half-read header forever), but it is the wrong ceiling for "how long an MCP command may run". Possible refactors:

  • Send the broker a periodic in-band keepalive frame while a command is executing, so the broker doesn't think the channel is stalled.
  • Configurable per-tool execution timeout (with a much higher default, e.g. 10 min) separate from the I/O frame timeout.
  • The broker side should not auto-retry execute_menu_item-class tools at all; they are not idempotent. Idempotency hint should be set on the tool metadata (McpForUnityTool attribute) and respected by the broker.

Code pointers

Concern File Approx. lines
Outer per-command TCS timeout (Path 1) MCPForUnity/Editor/Services/Transport/Transports/StdioBridgeHost.cs 582-637
commandQueue dequeue / IsExecuting set same 856-897
ExecuteQueuedCommand.Runner finally → only place stuck commands are reaped same 955-997
New-connection stale-client close same 525-537
Inner dispatcher reentrancy guard _processingFlag MCPForUnity/Editor/Services/Transport/TransportCommandDispatcher.cs 273-316
Inner dispatcher pending bookkeeping same 320-506
The synchronous, non-cancellable menu invocation MCPForUnity/Editor/Tools/ExecuteMenuItem.cs 38

The dead-letter / dedup / cancellation logic should live in StdioBridgeHost.HandleClientAsync (the only layer that knows the broker disconnected), not in TransportCommandDispatcher (which has no concept of a broker connection).

Suggested regression test

A stub tool that Thread.Sleeps 35s, invoked via the MCP client; assert it executes exactly once on the Editor side regardless of broker timeout.

Environment

  • mcp-unity branch beta
  • Unity 2021.3.45f2
  • Windows host, stdio transport
  • MCP client: Claude Code (Anthropic)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions