Skip to content

HttpServletStreamableServerTransportProvider orphans sessions on first transient SSE write failure (no grace period) #920

@andersenleo

Description

@andersenleo

Summary

In HttpServletStreamableServerTransportProvider, a single failed SSE write
immediately removes the session from the in-memory map, so the client's next
POST gets Session not found even though the failure was transient (LB
response-timeout, NEG rebalance, pod eviction, laptop sleep, mobile network
blip). The client is forced into a full initialize round-trip and loses
any server-side session state.

Adding a short, configurable grace period before sessions.remove(...)
— during which a reconnect with the same session id reattaches — would solve
this without touching where sessions are stored or how resumability works.

A secondary, unrelated observation about a duplicate asyncContext.complete()
call is noted at the bottom.

Relationship to existing work

I want to flag upfront how this differs from related open issues/PRs, so it's
easy to triage.

The ask here is narrower than any of those: don't drop the session on the
first transient write failure
.

Version

  • io.modelcontextprotocol.sdk:mcp-core 1.1.0
  • JDK 25, Tomcat (Spring Boot 4), behind a GKE HTTP(S) LB

Reproduction

  1. Run any gateway built on HttpServletStreamableServerTransportProvider
    behind an L7 LB that caps per-response duration below the session lifetime
    (GKE BackendConfig.timeoutSec ≤ 60 s is a clean repro).

  2. Connect an MCP client (seen with claude-code, but any streamable-HTTP
    client triggers it).

  3. Wait for the LB to close the SSE stream (~60 s in our case).

  4. Observe — server log:

    KeepAliveScheduler: Failed to send keep-alive ping to session ...:
      Did not observe any item or terminal signal within 10000ms in 'source(MonoCreate)'
    ServletStreamableServerTransportProvider:
      Failed to send message to session ...: Client disconnected
    

    Client gets Session not found on its next POST.

Root cause

HttpServletStreamableMcpSessionTransport.sendMessage (v1.1.0, lines 738–767)
hard-codes session removal in the catch block:

@Override
public Mono<Void> sendMessage(McpSchema.JSONRPCMessage message, String messageId) {
    return Mono.fromRunnable(() -> {
        ...
        try {
            ...
            String jsonText = jsonMapper.writeValueAsString(message);
            HttpServletStreamableServerTransportProvider.this.sendEvent(writer, MESSAGE_EVENT_TYPE, jsonText,
                    messageId != null ? messageId : this.sessionId);
            ...
        }
        catch (Exception e) {
            logger.error("Failed to send message to session {}: {}", this.sessionId, e.getMessage());
            HttpServletStreamableServerTransportProvider.this.sessions.remove(this.sessionId); // <— here
            this.asyncContext.complete();
        }
        ...
    });
}

No grace, no policy hook, no listener. sessions is a private final Map
and HttpServletStreamableMcpSessionTransport is a private inner class, so
downstream apps cannot override the behaviour without reflection.

Proposal

Two shapes, from least to most invasive:

1. Configurable session-retention grace period (minimal, recommended)

Add Builder.sessionReconnectGracePeriod(Duration), defaulting to
Duration.ZERO (current behaviour, fully backward compatible). On write
failure:

  • mark the session as detached (new flag on the session transport);
  • schedule a removal task at now + grace on a shared scheduled executor;
  • on an incoming GET /mcp whose session id matches, clear the detach flag,
    cancel the scheduled removal, and let the existing Last-Event-ID replay
    path (SDK lines 327–349) or Support last event Id for resumability of sse #830's event-store path run.

With grace = Duration.ZERO, behaviour is identical to today. With
grace > 0, transient blips no longer orphan clients. Composes naturally
with #914 (pluggable store) and #830 (replay).

2. Session lifecycle listener (larger, general-purpose)

Add a SessionLifecycleListener interface on the builder with
onSessionDetached, onSessionReconnected, onSessionClosed. Ship the
current eager-remove behaviour as the default listener; apps can register
a custom listener with whatever retention policy suits their deployment.

I'd prefer option 1 — it solves the concrete problem without opening a
broader API-surface question.

Secondary observation — duplicate asyncContext.complete()

Same file: the catch block in sendMessage (line 761) and close()
(line ~811) both call asyncContext.complete(). When a write failure is
followed by a close, the second call races with the servlet container's
state machine and produces:

Failed to complete async context ... Async state [COMPLETING]

Cosmetic, but pollutes logs. A flag (e.g. asyncContextCompleted set on
first call, checked before the second) would silence it.

Willing to contribute

Happy to open a PR for either of the above if the maintainers would welcome
the contribution — just want to confirm the approach and that option 1 is
the direction you'd prefer before writing code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions