Summary
The MCP server serializes every tool call behind a single
Arc<Mutex<Option<Engine>>> (server.rs:821),
and with_engine holds that lock across the entire blocking hyperd operation
(server.rs:1228). With no
per-operation timeout, a single slow or stalled hyperd call blocks all
other tool calls — including lightweight ones like status — for as long as
the slow call runs.
How it surfaced
A user reported status "hanging." Investigation showed status itself is
trivial, but it was queued behind another operation on a long-idle connection
that had gone half-open (laptop sleep / network blip). The immediate trigger —
no TCP keepalive, so a half-open socket blocked for the 2h OS idle default — is
fixed separately (TCP keepalive, ~90s dead-peer detection). This issue is the
amplifier: the global mutex turns one stalled connection into a total stall of
the MCP surface, and the missing timeout removes the only other backstop.
This became more impactful in v0.5.0, where the daemon became
resident-by-default and connections now live indefinitely across suspends.
Why this is filed as a follow-up (not fixed in the keepalive PR)
Keepalive bounds the worst case to ~90s and is a safe, surgical change.
Removing the serialization is an architectural change (concurrency model of
the engine) and deserves its own design pass rather than a rushed patch. Filing
to track it.
Options to evaluate (not a decision)
- Per-operation timeout / cancellation. Wrap blocking
hyperd calls so a
stalled op can't hold the lock unboundedly. The connection builder already
exposes query_timeout (connection_builder.rs:136)
— but a blanket query timeout is wrong for HyperDB (legitimate long
analytics queries). Any timeout must target liveness (is the peer
responding) not duration (how long the query runs).
- Connection pool instead of a single engine. Let independent tool calls
use independent connections so a slow call doesn't block unrelated ones.
Larger change; interacts with the ephemeral-primary / per-session workspace
model and the catalog-bootstrap-once logic.
- Cheap read-only fast path. Let
status (and other non-engine-mutating
introspection) answer without taking the engine lock — e.g. from cached
metadata + the daemon health port — so diagnostics never hang even if the
data plane is stalled.
- Run blocking calls on
spawn_blocking with a watchdog that drops/replaces
the engine if an op exceeds a liveness deadline (reusing the existing
ConnectionLost → drop-and-reconnect path in with_engine).
Acceptance criteria (rough)
- A stalled or very slow
hyperd operation on one connection does not make
unrelated tool calls (especially status) hang indefinitely.
- Legitimate long-running analytics queries are not aborted by a
duration-based cutoff.
- The fix is verified with a test that simulates a wedged/slow connection and
asserts a second concurrent call still returns (or fails fast).
Related
Summary
The MCP server serializes every tool call behind a single
Arc<Mutex<Option<Engine>>>(server.rs:821),and
with_engineholds that lock across the entire blockinghyperdoperation(
server.rs:1228). With noper-operation timeout, a single slow or stalled
hyperdcall blocks allother tool calls — including lightweight ones like
status— for as long asthe slow call runs.
How it surfaced
A user reported
status"hanging." Investigation showedstatusitself istrivial, but it was queued behind another operation on a long-idle connection
that had gone half-open (laptop sleep / network blip). The immediate trigger —
no TCP keepalive, so a half-open socket blocked for the 2h OS idle default — is
fixed separately (TCP keepalive, ~90s dead-peer detection). This issue is the
amplifier: the global mutex turns one stalled connection into a total stall of
the MCP surface, and the missing timeout removes the only other backstop.
This became more impactful in v0.5.0, where the daemon became
resident-by-default and connections now live indefinitely across suspends.
Why this is filed as a follow-up (not fixed in the keepalive PR)
Keepalive bounds the worst case to ~90s and is a safe, surgical change.
Removing the serialization is an architectural change (concurrency model of
the engine) and deserves its own design pass rather than a rushed patch. Filing
to track it.
Options to evaluate (not a decision)
hyperdcalls so astalled op can't hold the lock unboundedly. The connection builder already
exposes
query_timeout(connection_builder.rs:136)— but a blanket query timeout is wrong for HyperDB (legitimate long
analytics queries). Any timeout must target liveness (is the peer
responding) not duration (how long the query runs).
use independent connections so a slow call doesn't block unrelated ones.
Larger change; interacts with the ephemeral-primary / per-session workspace
model and the catalog-bootstrap-once logic.
status(and other non-engine-mutatingintrospection) answer without taking the engine lock — e.g. from cached
metadata + the daemon health port — so diagnostics never hang even if the
data plane is stalled.
spawn_blockingwith a watchdog that drops/replacesthe engine if an op exceeds a liveness deadline (reusing the existing
ConnectionLost→ drop-and-reconnect path inwith_engine).Acceptance criteria (rough)
hyperdoperation on one connection does not makeunrelated tool calls (especially
status) hang indefinitely.duration-based cutoff.
asserts a second concurrent call still returns (or fails fast).
Related
fix/daemon-cold-start-dedup, PR for v0.5.1.