You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In HttpServletStreamableServerTransportProvider, a single failed SSE write
immediately removes the session from the in-memory map, so the client's next
POST gets Session not found even though the failure was transient (LB
response-timeout, NEG rebalance, pod eviction, laptop sleep, mobile network
blip). The client is forced into a full initialize round-trip and loses
any server-side session state.
Adding a short, configurable grace period before sessions.remove(...)
— during which a reconnect with the same session id reattaches — would solve
this without touching where sessions are stored or how resumability works.
A secondary, unrelated observation about a duplicate asyncContext.complete()
call is noted at the bottom.
Relationship to existing work
I want to flag upfront how this differs from related open issues/PRs, so it's
easy to triage.
feat: add McpSessionStore SPI for pluggable session storage #914 (McpSessionStore SPI) — changes where sessions live
(pluggable store: Redis, JDBC, Hazelcast). Great for restart/cluster
scenarios. Orthogonal to this issue: even with a Redis-backed store, one
failed write to one client still triggers sessions.remove(sessionId)
and orphans that client's session.
Support last event Id for resumability of sse #830 (Last-Event-ID resumability) — adds an event store and wires
replay through GET /mcp. Complementary to this issue but requires the
session to still exist when the client reconnects. With today's eager
removal, replay is unreachable for the case described here.
Run any gateway built on HttpServletStreamableServerTransportProvider
behind an L7 LB that caps per-response duration below the session lifetime
(GKE BackendConfig.timeoutSec ≤ 60 s is a clean repro).
Connect an MCP client (seen with claude-code, but any streamable-HTTP
client triggers it).
Wait for the LB to close the SSE stream (~60 s in our case).
Observe — server log:
KeepAliveScheduler: Failed to send keep-alive ping to session ...:
Did not observe any item or terminal signal within 10000ms in 'source(MonoCreate)'
ServletStreamableServerTransportProvider:
Failed to send message to session ...: Client disconnected
Client gets Session not found on its next POST.
Root cause
HttpServletStreamableMcpSessionTransport.sendMessage (v1.1.0, lines 738–767)
hard-codes session removal in the catch block:
No grace, no policy hook, no listener. sessions is a private final Map
and HttpServletStreamableMcpSessionTransport is a private inner class, so
downstream apps cannot override the behaviour without reflection.
Proposal
Two shapes, from least to most invasive:
1. Configurable session-retention grace period (minimal, recommended)
Add Builder.sessionReconnectGracePeriod(Duration), defaulting to Duration.ZERO (current behaviour, fully backward compatible). On write
failure:
mark the session as detached (new flag on the session transport);
schedule a removal task at now + grace on a shared scheduled executor;
on an incoming GET /mcp whose session id matches, clear the detach flag,
cancel the scheduled removal, and let the existing Last-Event-ID replay
path (SDK lines 327–349) or Support last event Id for resumability of sse #830's event-store path run.
With grace = Duration.ZERO, behaviour is identical to today. With grace > 0, transient blips no longer orphan clients. Composes naturally
with #914 (pluggable store) and #830 (replay).
Add a SessionLifecycleListener interface on the builder with onSessionDetached, onSessionReconnected, onSessionClosed. Ship the
current eager-remove behaviour as the default listener; apps can register
a custom listener with whatever retention policy suits their deployment.
I'd prefer option 1 — it solves the concrete problem without opening a
broader API-surface question.
Same file: the catch block in sendMessage (line 761) and close()
(line ~811) both call asyncContext.complete(). When a write failure is
followed by a close, the second call races with the servlet container's
state machine and produces:
Failed to complete async context ... Async state [COMPLETING]
Cosmetic, but pollutes logs. A flag (e.g. asyncContextCompleted set on
first call, checked before the second) would silence it.
Willing to contribute
Happy to open a PR for either of the above if the maintainers would welcome
the contribution — just want to confirm the approach and that option 1 is
the direction you'd prefer before writing code.
Summary
In
HttpServletStreamableServerTransportProvider, a single failed SSE writeimmediately removes the session from the in-memory map, so the client's next
POST gets
Session not foundeven though the failure was transient (LBresponse-timeout, NEG rebalance, pod eviction, laptop sleep, mobile network
blip). The client is forced into a full
initializeround-trip and losesany server-side session state.
Adding a short, configurable grace period before
sessions.remove(...)— during which a reconnect with the same session id reattaches — would solve
this without touching where sessions are stored or how resumability works.
A secondary, unrelated observation about a duplicate
asyncContext.complete()call is noted at the bottom.
Relationship to existing work
I want to flag upfront how this differs from related open issues/PRs, so it's
easy to triage.
McpSessionStoreSPI) — changes where sessions live(pluggable store: Redis, JDBC, Hazelcast). Great for restart/cluster
scenarios. Orthogonal to this issue: even with a Redis-backed store, one
failed write to one client still triggers
sessions.remove(sessionId)and orphans that client's session.
Last-Event-IDresumability) — adds an event store and wiresreplay through
GET /mcp. Complementary to this issue but requires thesession to still exist when the client reconnects. With today's eager
removal, replay is unreachable for the case described here.
Session not found) — same symptom, differenttrigger (process restart vs in-process transient write failure). feat: add McpSessionStore SPI for pluggable session storage #914
addresses Session not found handling #107; neither addresses the scenario below.
the same family as feat: add McpSessionStore SPI for pluggable session storage #914.
The ask here is narrower than any of those: don't drop the session on the
first transient write failure.
Version
io.modelcontextprotocol.sdk:mcp-core1.1.0Reproduction
Run any gateway built on
HttpServletStreamableServerTransportProviderbehind an L7 LB that caps per-response duration below the session lifetime
(GKE
BackendConfig.timeoutSec≤ 60 s is a clean repro).Connect an MCP client (seen with
claude-code, but any streamable-HTTPclient triggers it).
Wait for the LB to close the SSE stream (~60 s in our case).
Observe — server log:
Client gets
Session not foundon its next POST.Root cause
HttpServletStreamableMcpSessionTransport.sendMessage(v1.1.0, lines 738–767)hard-codes session removal in the catch block:
No grace, no policy hook, no listener.
sessionsis aprivate final Mapand
HttpServletStreamableMcpSessionTransportis a private inner class, sodownstream apps cannot override the behaviour without reflection.
Proposal
Two shapes, from least to most invasive:
1. Configurable session-retention grace period (minimal, recommended)
Add
Builder.sessionReconnectGracePeriod(Duration), defaulting toDuration.ZERO(current behaviour, fully backward compatible). On writefailure:
now + graceon a shared scheduled executor;GET /mcpwhose session id matches, clear the detach flag,cancel the scheduled removal, and let the existing
Last-Event-IDreplaypath (SDK lines 327–349) or Support last event Id for resumability of sse #830's event-store path run.
With
grace = Duration.ZERO, behaviour is identical to today. Withgrace > 0, transient blips no longer orphan clients. Composes naturallywith #914 (pluggable store) and #830 (replay).
2. Session lifecycle listener (larger, general-purpose)
Add a
SessionLifecycleListenerinterface on the builder withonSessionDetached,onSessionReconnected,onSessionClosed. Ship thecurrent eager-remove behaviour as the default listener; apps can register
a custom listener with whatever retention policy suits their deployment.
I'd prefer option 1 — it solves the concrete problem without opening a
broader API-surface question.
Secondary observation — duplicate
asyncContext.complete()Same file: the catch block in
sendMessage(line 761) andclose()(line ~811) both call
asyncContext.complete(). When a write failure isfollowed by a close, the second call races with the servlet container's
state machine and produces:
Cosmetic, but pollutes logs. A flag (e.g.
asyncContextCompletedset onfirst call, checked before the second) would silence it.
Willing to contribute
Happy to open a PR for either of the above if the maintainers would welcome
the contribution — just want to confirm the approach and that option 1 is
the direction you'd prefer before writing code.