Skip to content

feat(orchestrator): drain sandboxes during shutdown#3005

Open
wj-e2b wants to merge 1 commit into
mainfrom
wj-orch-fix-4
Open

feat(orchestrator): drain sandboxes during shutdown#3005
wj-e2b wants to merge 1 commit into
mainfrom
wj-orch-fix-4

Conversation

@wj-e2b

@wj-e2b wj-e2b commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Introduce a shared draingate.Gate (counter plus notification channel) and use it in the sandbox factory and gRPC server to reject new sandbox starts while draining, wait for in-flight starts, and drain or force-stop live sandboxes before closers run. Forced shutdown preserves buffered close errors on context cancellation and avoids duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a WaitGroup with context cancellation.

The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when it expires the drain escalates to a forced sandbox shutdown. By default the drain waits forever, until sandboxes exit on their own or a force-stop API call empties the node.

@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes the production shutdown and sandbox lifecycle path (drain ordering, force-close, and gRPC rejection during drain), which can affect rolling deploys, long-running sandboxes, and checkpoint/resume timing at node exit.

Overview
Orchestrator shutdown now drains sandboxes before tearing down other services instead of closing immediately after marking the node draining.

A reusable draingate.Gate tracks in-flight “start” work: after drain begins, new entries are rejected and Wait blocks until counters drop to zero. The gRPC server and shared sandbox factory each use a gate so API paths (create/checkpoint/acquire) return Unavailable while draining, and CreateSandbox / ResumeSandbox cannot begin once the factory is in drain mode—ordering is chosen so checkpoint can still finish a resume after the old VM is torn down.

Shutdown wiring adds optional SHUTDOWN_DRAIN_TIMEOUT (default: wait indefinitely until sandboxes exit or force-stop). The run loop calls DrainSandboxes (poll until live count is zero, then wait lifecycles) or ForceStopSandboxes on timeout/FORCE_STOP, with a follow-on forced path if graceful drain fails. Force-stop loops until in-flight starts finish, using WaitGroupWait so context cancellation still collects close errors. Minor shutdown noise fixes ignore expected logger sync/EINVAL and expected errgroup “service finished” errors.

Reviewed by Cursor Bugbot for commit 1aa9982. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/server/sandboxes.go

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Compilation errors exist in packages/orchestrator/pkg/server/main.go and packages/orchestrator/pkg/draingate/gate_test.go where wg.Go() is called on a sync.WaitGroup which does not have a Go method. In packages/orchestrator/pkg/server/main.go, calling context.WithoutCancel(ctx) strips the configured shutdown deadline, which can cause the orchestrator to hang indefinitely if a sandbox cleanup hangs. Additionally, the spin-loop using runtime.Gosched() in packages/shared/pkg/utils/waitgroup.go is flaky and can cause premature context cancellation errors under heavy CPU load.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/orchestrator/pkg/server/main.go
Comment thread packages/orchestrator/pkg/draingate/gate_test.go
stopped := make(map[string]struct{})
var errs []error

cleanupCtx := context.WithoutCancel(ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling context.WithoutCancel(ctx) here strips the 30-second forceShutdownTimeout deadline that was explicitly configured during shutdown. This causes the subsequent sandbox close operations to run without any timeout, meaning the orchestrator can hang indefinitely if a sandbox cleanup hangs. You should use the passed ctx directly to ensure the force-shutdown phase is properly bounded.

Suggested change
cleanupCtx := context.WithoutCancel(ctx)
cleanupCtx := ctx

Comment on lines +17 to +31
for range 10 {
select {
case <-done:
return nil
default:
runtime.Gosched()
}
}

select {
case <-ctx.Done():
return fmt.Errorf("waiting for wait group: %w", ctx.Err())
case <-done:
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The spin-loop using runtime.Gosched() is flaky and does not guarantee that the goroutine closing done will be scheduled and executed within 10 iterations. Under heavy CPU load or thread starvation, the loop can easily fall through to the select block and return a context error even if the WaitGroup was already completed, making the associated test fragile. You should use a standard select block to wait for either the WaitGroup completion or context cancellation.

	select {
	case <-done:
		return nil
	case <-ctx.Done():
		return fmt.Errorf("waiting for wait group: %w", ctx.Err())
	}

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2974 1 2973 8
View the top 1 failed test(s) by shortest run time
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics
Stack Traces | 15s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:47: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:47
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (14.98s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d6b2e7976

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

defer func(g *errgroup.Group) {
err := g.Wait()
if err != nil {
if err != nil && !isServiceDoneError(err) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve service failures when ignoring shutdown sentinels

When a service exits because f() returned an actual error (for example a non-benign grpcServer.Serve or cmuxServer.Serve failure), startService sends that error on serviceError but then returns only serviceDoneError; the serviceError select branch only logs it and never flips success to false. With this new guard, the deferred g.Wait() now also ignores the only error value it can see, so Run can report a clean shutdown and remove the lock file after a real service failure. Either preserve the original service error in the errgroup result or mark success = false when the serviceError branch receives a non-nil error.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Comment thread packages/orchestrator/pkg/factories/run.go
Introduce a shared draingate.Gate (counter plus notification channel)
and use it in the sandbox factory and gRPC server to reject new
sandbox starts while draining, wait for in-flight starts, and drain
or force-stop live sandboxes before closers run. Forced shutdown
preserves buffered close errors on context cancellation and avoids
duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a
WaitGroup with context cancellation.

The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when
it expires the drain escalates to a forced sandbox shutdown. By
default the drain waits forever, until sandboxes exit on their own or
a force-stop API call empties the node.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1aa9982. Configure here.

s.StartDraining(ctx)
// The sandbox factory is shared by API sandboxes and template-build sandboxes.
// Drain it before waiting so no new starts can enter while shutdown proceeds.
s.sandboxFactory.StartDraining(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force stop drains factory early

Medium Severity

ForceStopSandboxes calls sandboxFactory.StartDraining before in-flight gRPC sandbox starts finish, while DrainSandboxes waits on the server drain gate first so checkpoint resume can still enter the factory. A FORCE_STOP shutdown (or any path that only uses force stop) can reject ResumeSandbox mid-checkpoint or mid-create after the server gate already admitted the RPC.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1aa9982. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant