Skip to content

fix(streaming): drain VM bridge on RPC ctx cancel before vmConn.Close#177

Open
ndeloof wants to merge 1 commit intocontainerd:mainfrom
ndeloof:fix-streaming-vmconn-lifetime
Open

fix(streaming): drain VM bridge on RPC ctx cancel before vmConn.Close#177
ndeloof wants to merge 1 commit intocontainerd:mainfrom
ndeloof:fix-streaming-vmconn-lifetime

Conversation

@ndeloof
Copy link
Copy Markdown

@ndeloof ndeloof commented May 5, 2026

Bug

On Windows, vmConn.Close() traverses the AF_UNIX ↔ vsock proxy and arrives at vminitd as a vsock SHUTDOWN. When the per-RPC ctx cancels mid-stream, the deferred Close() races with in-flight bridge reads; the SHUTDOWN cascades into Task.Kill -> Delete -> Shutdown, tearing down the VM (state_error="ttrpc: closed"). Linux/macOS use native vsock without the proxy, so no cascade.

Fix

In the <-ctx.Done() branch, call vmConn.SetReadDeadline(time.Now()) to interrupt the bridge's binary.Read via the Go runtime poller (no wire-level packet) and drain v2t before the deferred Close() runs.

Tests

  • TestStreamReturnsAfterContextCancelOnIdleStream — cancel ctx with no traffic; handler must drain.
  • TestStreamPreservesStartStreamCancel — cancel ctx during a slow StartStream; abort must propagate.

Both fail on baseline, pass on this PR. The 4 pre-existing tests still pass.

/sign-build

Copilot AI review requested due to automatic review settings May 5, 2026 20:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the shim streaming bridge so the VM-side stream connection is no longer tied directly to the per-RPC ttrpc context, aiming to stop normal end-of-exec context cancellation from cascading into a full VM teardown on Windows.

Changes:

  • Starts the VM stream with context.WithoutCancel(ctx) instead of the per-RPC context.
  • Removes the handler’s early return on ctx.Done() and waits only for the VM-to-client bridge to finish.
  • Expands inline documentation around the stream-closing protocol and the Windows-specific shutdown cascade.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/shim/streaming/plugin.go Outdated
Comment thread plugins/shim/streaming/plugin.go Outdated
Comment thread plugins/shim/streaming/plugin.go Outdated
@ndeloof ndeloof marked this pull request as ready for review May 6, 2026 05:09
@ndeloof ndeloof marked this pull request as draft May 6, 2026 06:13
Copilot AI review requested due to automatic review settings May 6, 2026 06:44
@ndeloof ndeloof force-pushed the fix-streaming-vmconn-lifetime branch from ba85164 to 338339c Compare May 6, 2026 06:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread plugins/shim/streaming/plugin.go
Comment thread plugins/shim/streaming/plugin.go Outdated
Comment thread plugins/shim/streaming/plugin_test.go Outdated
Comment thread plugins/shim/streaming/plugin_test.go Outdated
@ndeloof ndeloof force-pushed the fix-streaming-vmconn-lifetime branch from 338339c to 750e1e2 Compare May 6, 2026 12:18
@ndeloof ndeloof changed the title fix(streaming): detach vmConn lifetime from per-RPC Stream context fix(streaming): drain VM bridge on RPC ctx cancel before vmConn.Close May 6, 2026
On Windows, vmConn.Close() traverses the AF_UNIX <-> vsock proxy and
arrives at vminitd as a vsock SHUTDOWN packet. When the per-RPC ctx
cancels mid-stream, the deferred Close() fires while the VM may still
have in-flight data; the SHUTDOWN cascades into Task.Kill -> Delete ->
Shutdown and tears down the entire VM, surfacing as
state_error="rpc error: code = Unknown desc = ttrpc: closed".

Linux/macOS use native vsock without the AF_UNIX intermediate, so the
cascade does not occur there.

Empirical evidence (docker/sandboxes#2529): across 10 parallel CI runs
of identical content, every PASS produced exactly 1 EOF event on the
proxy AF_UNIX socket while every FAIL produced 13-30. Zero overlap.

This change has two pieces, both of which serialise teardown so that
v2t (the VM->client bridge) drains cleanly before vmConn.Close() runs:

1. In Stream's <-ctx.Done() branch (per-RPC cancel mid-stream): call
   vmConn.SetReadDeadline(time.Now()) to interrupt binary.Read via the
   Go runtime poller -- no wire-level packet -- and drain v2t before
   the deferred Close() fires.

2. On shim shutdown (ContainerStop of the kit container): track every
   in-flight stream's vmConn so the shutdown callback can SetReadDeadline
   on all of them and wait (via a WaitGroup) for every Stream handler
   to return before reporting back. Without this, N concurrent handlers
   race ttrpc.server.Close() and the transport may go away before all
   bridges have drained, leaving partially-drained streams whose
   eventual Close() re-triggers the SHUTDOWN cascade against the
   already-stopping VM.

   Late-arriving Stream RPCs (after shutdown has set closing=true) are
   rejected with errdefs.ErrUnavailable rather than being allowed to
   register a vmConn the drain has already finished waiting for.

Tests:

- TestStreamReturnsAfterContextCancelOnIdleStream (per-RPC drain).
- TestStreamPreservesStartStreamCancel (StartStream observes ctx).
- TestShutdownDrainsAllInFlightStreams (shim-shutdown drain across
  N concurrent streams).
- TestStreamRejectedAfterShutdown (post-shutdown new-stream rejection).
- 3 pre-existing tests still pass.

Refs: docker/sandboxes#2529

Signed-off-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Copilot AI review requested due to automatic review settings May 6, 2026 12:22
@ndeloof ndeloof force-pushed the fix-streaming-vmconn-lifetime branch from 750e1e2 to b12ea73 Compare May 6, 2026 12:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread plugins/shim/streaming/plugin_test.go
@ndeloof ndeloof marked this pull request as ready for review May 6, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants