HttpServletStreamableServerTransportProvider orphans sessions on first transient SSE write failure (no grace period)

## Summary

In `HttpServletStreamableServerTransportProvider`, a single failed SSE write
immediately removes the session from the in-memory map, so the client's next
POST gets `Session not found` even though the failure was transient (LB
response-timeout, NEG rebalance, pod eviction, laptop sleep, mobile network
blip). The client is forced into a full `initialize` round-trip and loses
any server-side session state.

Adding a short, configurable **grace period** before `sessions.remove(...)`
— during which a reconnect with the same session id reattaches — would solve
this without touching where sessions are stored or how resumability works.

A secondary, unrelated observation about a duplicate `asyncContext.complete()`
call is noted at the bottom.

## Relationship to existing work

I want to flag upfront how this differs from related open issues/PRs, so it's
easy to triage.

- **#914** (`McpSessionStore` SPI) — changes *where* sessions live
  (pluggable store: Redis, JDBC, Hazelcast). Great for restart/cluster
  scenarios. Orthogonal to this issue: even with a Redis-backed store, one
  failed write to one client still triggers `sessions.remove(sessionId)`
  and orphans **that** client's session.
- **#830** (`Last-Event-ID` resumability) — adds an event store and wires
  replay through `GET /mcp`. Complementary to this issue but requires the
  session to still exist when the client reconnects. With today's eager
  removal, replay is unreachable for the case described here.
- **#107** (server-restart `Session not found`) — same symptom, different
  trigger (process restart vs in-process transient write failure). #914
  addresses #107; neither addresses the scenario below.
- **#274 / #738 / #376 / #201** — persistence / clustering angles, all in
  the same family as #914.

The ask here is narrower than any of those: **don't drop the session on the
first transient write failure**.

## Version

- `io.modelcontextprotocol.sdk:mcp-core` 1.1.0
- JDK 25, Tomcat (Spring Boot 4), behind a GKE HTTP(S) LB

## Reproduction

1. Run any gateway built on `HttpServletStreamableServerTransportProvider`
   behind an L7 LB that caps per-response duration below the session lifetime
   (GKE `BackendConfig.timeoutSec` ≤ 60 s is a clean repro).
2. Connect an MCP client (seen with `claude-code`, but any streamable-HTTP
   client triggers it).
3. Wait for the LB to close the SSE stream (~60 s in our case).
4. Observe — server log:

   ```
   KeepAliveScheduler: Failed to send keep-alive ping to session ...:
     Did not observe any item or terminal signal within 10000ms in 'source(MonoCreate)'
   ServletStreamableServerTransportProvider:
     Failed to send message to session ...: Client disconnected
   ```

   Client gets `Session not found` on its next POST.

## Root cause

`HttpServletStreamableMcpSessionTransport.sendMessage` (v1.1.0, lines 738–767)
hard-codes session removal in the catch block:

```java
@Override
public Mono<Void> sendMessage(McpSchema.JSONRPCMessage message, String messageId) {
    return Mono.fromRunnable(() -> {
        ...
        try {
            ...
            String jsonText = jsonMapper.writeValueAsString(message);
            HttpServletStreamableServerTransportProvider.this.sendEvent(writer, MESSAGE_EVENT_TYPE, jsonText,
                    messageId != null ? messageId : this.sessionId);
            ...
        }
        catch (Exception e) {
            logger.error("Failed to send message to session {}: {}", this.sessionId, e.getMessage());
            HttpServletStreamableServerTransportProvider.this.sessions.remove(this.sessionId); // <— here
            this.asyncContext.complete();
        }
        ...
    });
}
```

No grace, no policy hook, no listener. `sessions` is a `private final Map`
and `HttpServletStreamableMcpSessionTransport` is a private inner class, so
downstream apps cannot override the behaviour without reflection.

## Proposal

Two shapes, from least to most invasive:

### 1. Configurable session-retention grace period (minimal, recommended)

Add `Builder.sessionReconnectGracePeriod(Duration)`, defaulting to
`Duration.ZERO` (current behaviour, fully backward compatible). On write
failure:

- mark the session as detached (new flag on the session transport);
- schedule a removal task at `now + grace` on a shared scheduled executor;
- on an incoming `GET /mcp` whose session id matches, clear the detach flag,
  cancel the scheduled removal, and let the existing `Last-Event-ID` replay
  path (SDK lines 327–349) or #830's event-store path run.

With `grace = Duration.ZERO`, behaviour is identical to today. With
`grace > 0`, transient blips no longer orphan clients. Composes naturally
with #914 (pluggable store) and #830 (replay).

### 2. Session lifecycle listener (larger, general-purpose)

Add a `SessionLifecycleListener` interface on the builder with
`onSessionDetached`, `onSessionReconnected`, `onSessionClosed`. Ship the
current eager-remove behaviour as the default listener; apps can register
a custom listener with whatever retention policy suits their deployment.

I'd prefer option 1 — it solves the concrete problem without opening a
broader API-surface question.

## Secondary observation — duplicate `asyncContext.complete()`

Same file: the catch block in `sendMessage` (line 761) and `close()`
(line ~811) both call `asyncContext.complete()`. When a write failure is
followed by a close, the second call races with the servlet container's
state machine and produces:

```
Failed to complete async context ... Async state [COMPLETING]
```

Cosmetic, but pollutes logs. A flag (e.g. `asyncContextCompleted` set on
first call, checked before the second) would silence it.

## Willing to contribute

Happy to open a PR for either of the above if the maintainers would welcome
the contribution — just want to confirm the approach and that option 1 is
the direction you'd prefer before writing code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HttpServletStreamableServerTransportProvider orphans sessions on first transient SSE write failure (no grace period) #920

Summary

Relationship to existing work

Version

Reproduction

Root cause

Proposal

1. Configurable session-retention grace period (minimal, recommended)

2. Session lifecycle listener (larger, general-purpose)

Secondary observation — duplicate `asyncContext.complete()`

Willing to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HttpServletStreamableServerTransportProvider orphans sessions on first transient SSE write failure (no grace period) #920

Description

Summary

Relationship to existing work

Version

Reproduction

Root cause

Proposal

1. Configurable session-retention grace period (minimal, recommended)

2. Session lifecycle listener (larger, general-purpose)

Secondary observation — duplicate asyncContext.complete()

Willing to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Secondary observation — duplicate `asyncContext.complete()`