Skip to content

Fix sandbox cleanup on failed rollout groups#1412

Draft
rasdani wants to merge 1 commit into
mainfrom
fix/sandbox-rollout-cleanup
Draft

Fix sandbox cleanup on failed rollout groups#1412
rasdani wants to merge 1 commit into
mainfrom
fix/sandbox-rollout-cleanup

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 19, 2026

Summary

Fixes the sandbox lifecycle hole behind the SWE sandbox leak/churn incident: sandbox-backed rollouts can finish their env-side cleanup with keep_sandbox_for_scoring=True, leave deletion deferred to rubric cleanup, and then skip that rubric cleanup if grouped rollout gather/scoring fails or is cancelled.

This patch makes the cleanup path fail-safe:

  • Run rubric cleanup in finally for single-rollout and grouped scoring paths.
  • When grouped rollout gather fails, cancel pending sibling rollouts and cleanup any sibling states that already completed.
  • Continue environment/rubric/rubric-group cleanup after an earlier cleanup handler fails, then re-raise the first cleanup error after all cleanup handlers have had a chance to run.
  • Ensure CliAgentEnv.destroy_sandbox() still deletes the sandbox if post_rollout() fails before it can defer scoring cleanup.
  • For SWE taskset rubrics, skip sandbox test/scoring work when the rollout already has any error, including model/inference errors such as no-worker/no-available-worker failures, and log sandbox delete failures instead of silently swallowing them.

Leak Mechanism

SWE/composable rollouts create the sandbox during CliAgentEnv.setup_state() via SandboxMixin.create_sandbox(). For SWE scoring, CliAgentEnv.destroy_sandbox() defers deletion when keep_sandbox_for_scoring=True, so the taskset rubric can run tests and then delete the sandbox in rubric cleanup.

Before this change, Environment._run_group_states() only called rubric cleanup after asyncio.gather(*rollout_tasks) and group scoring both succeeded. A failed gather, failed scoring pass, or cancellation could bypass cleanup for already-created sandbox states. With inference down, those failures combined with rescheduling could keep allocating sandbox resources without a reliable per-group cleanup barrier.

Upstream History Checked

Tests

/usr/bin/env -u PRIME_API_KEY -u PRIME_TEAM_ID -u PRIME_ORG_ID -u PRIME_SANDBOX_API_KEY -u VIRTUAL_ENV \
  uv run --no-env-file pytest tests/test_rollout_cleanup.py -q

/usr/bin/env -u PRIME_API_KEY -u PRIME_TEAM_ID -u PRIME_ORG_ID -u PRIME_SANDBOX_API_KEY -u VIRTUAL_ENV \
  uv run --no-env-file pytest tests/test_rollout_cleanup.py tests/test_decorator_ranks.py tests/test_multiturn_env.py tests/test_composable_env.py -q

uv run ruff check verifiers/envs/environment.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/rubrics/rubric.py verifiers/rubrics/rubric_group.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py tests/test_rollout_cleanup.py

uv run ruff format --check verifiers/envs/environment.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/rubrics/rubric.py verifiers/rubrics/rubric_group.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py tests/test_rollout_cleanup.py

git diff --check

Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.


Note

High Risk
High risk because it changes core rollout execution/cleanup semantics (task cancellation, exception propagation, and cleanup ordering) in Environment and sandbox-backed CliAgentEnv, which can affect resource lifecycle and failure behavior across many envs.

Overview
Fixes sandbox/resource leaks by making cleanup best-effort but comprehensive across failure paths.

Environment._run_rollout_state and _run_group_states now run rubric cleanup in finally, cancel pending sibling rollouts on group failure, and attempt cleanup for any states that completed before gather/scoring errored. Environment.cleanup, Rubric.cleanup, RubricGroup.cleanup, and HarnessMetricsRubricGroup.cleanup now continue executing later cleanup handlers even if an earlier one fails, then re-raise the first cleanup error.

CliAgentEnv.destroy_sandbox now deletes the sandbox (instead of deferring) if post_rollout fails, and SWE taskset rubrics short-circuit scoring when any state['error'] is present plus log sandbox deletion failures rather than silently swallowing them. Adds tests/test_rollout_cleanup.py to cover these error/cleanup scenarios.

Reviewed by Cursor Bugbot for commit ed56ed2. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix sandbox cleanup on failed rollout groups to run all cleanup handlers despite errors

  • Environment._run_group_states now cancels pending rollout tasks and cleans up all completed states when any rollout or group scoring fails, instead of skipping cleanup on failure.
  • Environment._run_rollout_state wraps scoring in try/finally so rubric cleanup always runs; if both scoring and cleanup fail, the scoring error takes priority.
  • Rubric.cleanup, RubricGroup.cleanup, and Environment.cleanup now run all registered handlers even if earlier ones fail, logging all errors and re-raising the first.
  • All SWE rubric solved methods now short-circuit to 0.0 for any non-None state.error, not just InfraError; sandbox deletion failures are now logged as warnings instead of silently suppressed.
  • Behavioral Change: previously a failed rollout could leave sandboxes running; now cleanup is guaranteed to be attempted for all completed states in a group.

Macroscope summarized ed56ed2.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.

if isinstance(key, str) and isinstance(value, (int, float)):
state_metrics[key] = float(value)
if cleanup_error is not None:
raise cleanup_error
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HarnessMetricsRubricGroup skips parent cleanup handlers

Low Severity

HarnessMetricsRubricGroup.cleanup reimplements the rubric iteration loop from RubricGroup.cleanup without calling super().cleanup(). The parent RubricGroup.cleanup was updated in this PR to run Rubric._cleanup_handlers via super().cleanup() before iterating child rubrics, but HarnessMetricsRubricGroup.cleanup skips that entirely. This means any @vf.cleanup-decorated handlers registered on the group itself are silently dropped. Delegating to super().cleanup() wrapped in try/except (then processing harness metrics before re-raising) would eliminate the duplication and ensure consistency with the parent class's new fail-safe pattern.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Approvability

Verdict: Needs human review

This PR changes runtime behavior by broadening error conditions that skip sandbox scoring (from InfraError-only to any error) and modifies async cleanup flows across multiple components. An open review comment also identifies that HarnessMetricsRubricGroup may not properly call parent cleanup handlers.

You can customize Macroscope's approvability policy. Learn more.

@rasdani rasdani marked this pull request as draft May 19, 2026 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant