Skip to content

Fix sandbox cleanup leak paths#1381

Draft
rasdani wants to merge 1 commit into
mainfrom
codex/sandbox-cleanup-weak-spots
Draft

Fix sandbox cleanup leak paths#1381
rasdani wants to merge 1 commit into
mainfrom
codex/sandbox-cleanup-weak-spots

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 15, 2026

Summary

Fixes four sandbox cleanup weak spots with focused fake-backed repro tests, without requiring real remote sandboxes or model inference.

  1. ThreadedAsyncSandboxClient.teardown() now closes each worker thread's thread-local AsyncSandboxClient with aclose() before executor shutdown, so keepalive sockets/fds are not left to interpreter teardown.
  2. SandboxMixin.create_sandbox() now handles cancellation during an in-flight create() by waiting for the shielded create result and deleting any sandbox that was created before re-raising cancellation, instead of scheduling an untracked orphan delete task.
  3. Environment.cleanup() and Rubric.cleanup() now attempt all cleanup handlers and then re-raise the first cleanup failure, so later cleanup handlers such as sandbox destruction are not skipped.
  4. SandboxMixin.run_background_job() now classifies prime_sandboxes.CommandTimeoutError as a sandbox timeout: it sets state["sandbox_timeout"] = True and raises vf.SandboxError consistently with SandboxTimeoutError.

Tests / Repros Added

  • tests/test_threaded_sandbox_client.py::test_teardown_closes_per_thread_async_clients
    • Forces two worker threads with fake async clients and asserts both clients receive aclose() during teardown.
  • tests/test_sandbox_mixin.py::test_create_sandbox_cancellation_deletes_sandbox_created_after_cancel
    • Cancels while create() is pending, returns a fake sandbox after cancellation, blocks fake deletion, and asserts cancellation does not finish until deletion is drained.
  • tests/test_decorator_ranks.py::TestCleanupPriorityOrdering::test_cleanup_failure_does_not_skip_later_handler
    • Proves an environment cleanup handler that raises does not skip a later destroy-style handler.
  • tests/test_decorator_ranks.py::TestRubricCleanupTeardown::test_rubric_cleanup_failure_does_not_skip_later_handler
    • Proves a rubric cleanup handler that raises does not skip a later destroy-style handler.
  • Updated tests/test_sandbox_mixin.py::test_run_background_job_command_timeout
    • Proves command timeouts set sandbox_timeout and raise vf.SandboxError.

Validation

/usr/bin/env -u PRIME_API_KEY -u PRIME_TEAM_ID -u PRIME_ORG_ID -u PRIME_SANDBOX_API_KEY -u VIRTUAL_ENV \
  uv run --no-env-file pytest tests/test_threaded_sandbox_client.py tests/test_sandbox_mixin.py tests/test_decorator_ranks.py -q

/usr/bin/env -u VIRTUAL_ENV \
  uv run --no-env-file ruff check verifiers/utils/threaded_sandbox_client.py verifiers/envs/experimental/sandbox_mixin.py verifiers/envs/environment.py verifiers/rubrics/rubric.py tests/test_threaded_sandbox_client.py tests/test_sandbox_mixin.py tests/test_decorator_ranks.py

git diff --check

Local pre-push hooks also passed: ruff check, ruff format, Semgrep v1 policy, Sync AGENTS.md, and ty (ci parity).

Note

Fix sandbox cleanup leak paths by continuing after handler failures and closing per-thread clients on teardown

  • Environment.cleanup and Rubric.cleanup now catch exceptions from individual handlers, log them, continue running remaining handlers, and re-raise only the first exception after all handlers complete.
  • ThreadedAsyncSandboxClient tracks which worker threads have created per-thread AsyncSandboxClient instances and closes them during teardown, preventing connection leaks on shutdown.
  • SandboxMixin.run_background_job now catches CommandTimeoutError, sets state['sandbox_timeout'] = True, and raises vf.SandboxError instead of propagating the raw timeout error.
  • Behavioral Change: cleanup methods previously stopped at the first failure; now all handlers always run, which may surface previously hidden errors in later handlers.
📊 Macroscope summarized 246bf91. 4 files reviewed, 1 issue evaluated, 0 issues filtered, 1 comment posted

🗂️ Filtered Issues

self._client_thread_ids.discard(thread_id)
return thread_id, True

def _close_thread_clients(self, wait: bool) -> None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium utils/threaded_sandbox_client.py:133

_close_thread_clients submits one task per pending_thread_id, but ThreadPoolExecutor assigns tasks to arbitrary available workers—not necessarily the threads that own the clients. If workers T1-T3 have clients but T4-T6 pick up the close tasks, the barrier completes, those threads find no clients and return False, leaving the actual clients unclosed. Consider tracking which worker thread owns each client and targeting close operations to those specific threads.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file verifiers/utils/threaded_sandbox_client.py around line 133:

`_close_thread_clients` submits one task per `pending_thread_id`, but `ThreadPoolExecutor` assigns tasks to arbitrary available workers—not necessarily the threads that own the clients. If workers T1-T3 have clients but T4-T6 pick up the close tasks, the barrier completes, those threads find no clients and return `False`, leaving the actual clients unclosed. Consider tracking which worker thread owns each client and targeting close operations to those specific threads.

Evidence trail:
verifiers/utils/threaded_sandbox_client.py lines 107-170 (REVIEWED_COMMIT): `_close_thread_client` checks thread-local `sandbox_client` on whatever thread runs it; `_close_thread_clients` submits N tasks but ThreadPoolExecutor picks arbitrary idle threads. verifiers/utils/thread_utils.py lines 1-10 (REVIEWED_COMMIT): `THREAD_LOCAL_STORAGE = threading.local()` is module-level, so each thread has its own view. Python docs: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor — no mechanism to target specific worker threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant