Fix process termination in stdio client#496
Fix process termination in stdio client#496ashrobertsdragon wants to merge 3 commits intomodelcontextprotocol:mainfrom
Conversation
|
@jerome3o-anthropic @jspahrsummers This is my first pull request. I know am supposed to add the two of you as reviewers, but I haven't been able to figure out how to do that. I never got the gear icon that is shown in online tutorials |
src/mcp/client/stdio/__init__.py
Outdated
| else: | ||
| process.terminate() | ||
| tg.cancel_scope.cancel() | ||
| process.terminate() |
There was a problem hiding this comment.
This won't work if the process doesn't react to SIGTERM. We need a timeout and a fallback on SIGKILL.
Please test it with https://github.com/modelcontextprotocol/servers/blob/main/src/everything/index.ts#L13 (windows & unix)
…) and infinite hang (modelcontextprotocol#552) by recursively terminating all spawned child processes before primary subprocess using psutil
|
I've found a solution that works without breaking the termination of unresponsive servers. I have added psutil as a dependency and am using it to kill the child processes attached to the main subprocess before process.terminate() is called. This should negate the need for terminate_windows_process() again, but this time I kept it as a fallback. But in my tests against all of the servers in the MCP servers repo and a dozen others installed on my system on both Windows 11 and Ubuntu 22.04.5 via WSL, using Python versions 3.10-3.13, I was able to terminate all of the servers, even non-responsive ones, cleanly even when terminate_windows_process() was removed. By terminating the child processes first, we eliminate the race condition between the Windows kernel and the Python garbage collector when process.terminate() was closing those child processes. I also wrapped the process termination in the suppression of the ProcessLookupError, because one failing MCP server would find that exception being raise while terminating the process that was never fully realized. This behavior was observed in both Windows and Linux, but is now fixed. |
Kludex
left a comment
There was a problem hiding this comment.
I don't think psutil is necessary. Also, I think there's a similar PR around...
|
Thank you for your contribution! Sorry for delays in reviews. There are a few PRs solving the same issue. Let's continue discussion in #555 |
Github-Issue: #391
tg.cancel_scope.cancel()in thefinallyblock ofstdio_clientto ensure thatstdin_writerandstdout_readertasks are properly cancelledprocess.terminate()terminate_windows_process()functionMotivation and Context
This change addresses a race condition specific to Windows' Proactor event loop, where ignored exceptions would occur during shutdown of the stdio client.
The underlying problem was that subprocess transports (stdin/stdout) remained open when
process.terminate()was called. On Windows, this could trigger a race during garbage collection and transport teardown, resulting in uncatchable exceptions.The previous workaround,
terminate_windows_process(), attempted to mitigate this by waiting before forcefully terminating the process. However, on Windows,kill()is functionally identical toterminate()(https://docs.python.org/3/library/asyncio-subprocess.html#asyncio.subprocess.Process.kill), so the method provided no real benefit. Explicitly cancelling the stdio tasks ensures a clean and predictable shutdown.How Has This Been Tested?
Tested on Windows 11 by repeatedly connecting and disconnecting multiple clients over stdio transport to multiple servers found both within the repository and using installed locally.
The ignored exception could not be captured via standard exception handling (
try/except,sys.unraisablehook, or pytest'scapsys/capfd). As a result, an automated test was not feasible for this specific issue. All testing was conducted through repeated manual runs and verification of clean shutdown behavior.Breaking Changes
No breaking changes are expected. The removed function,
terminate_windows_process(), was internal and not designed for external use. Unless it was being used directly (which would be a misuse), no user code should be affected.Types of changes
Checklist
Additional context
There are two failing tests that predate this change:
test_messages_are_executed_concurrentlyfails with a timing difference of less than 5mstest_desktopfails due to assumptions about POSIX-style file pathsNo changes to documentation or error handling were necessary for this fix.