Fix sandbox cleanup on failed rollout groups by rasdani · Pull Request #1412 · PrimeIntellect-ai/verifiers

rasdani · 2026-05-19T01:02:51Z

Summary

Fixes the sandbox lifecycle hole behind the SWE sandbox leak/churn incident: sandbox-backed rollouts can finish their env-side cleanup with keep_sandbox_for_scoring=True, leave deletion deferred to rubric cleanup, and then skip that rubric cleanup if grouped rollout gather/scoring fails or is cancelled.

This patch makes the cleanup path fail-safe:

Run rubric cleanup in finally for single-rollout and grouped scoring paths.
When grouped rollout gather fails, cancel pending sibling rollouts and cleanup any sibling states that already completed.
Continue environment/rubric/rubric-group cleanup after an earlier cleanup handler fails, then re-raise the first cleanup error after all cleanup handlers have had a chance to run.
Ensure CliAgentEnv.destroy_sandbox() still deletes the sandbox if post_rollout() fails before it can defer scoring cleanup.
For SWE taskset rubrics, skip sandbox test/scoring work when the rollout already has any error, including model/inference errors such as no-worker/no-available-worker failures, and log sandbox delete failures instead of silently swallowing them.

Leak Mechanism

SWE/composable rollouts create the sandbox during CliAgentEnv.setup_state() via SandboxMixin.create_sandbox(). For SWE scoring, CliAgentEnv.destroy_sandbox() defers deletion when keep_sandbox_for_scoring=True, so the taskset rubric can run tests and then delete the sandbox in rubric cleanup.

Before this change, Environment._run_group_states() only called rubric cleanup after asyncio.gather(*rollout_tasks) and group scoring both succeeded. A failed gather, failed scoring pass, or cancellation could bypass cleanup for already-created sandbox states. With inference down, those failures combined with rescheduling could keep allocating sandbox resources without a reliable per-group cleanup barrier.

Upstream History Checked

Fix sandbox cleanup leak paths #1381 Fix sandbox cleanup leak paths: open draft, not merged into current origin/main. Related cleanup work, but it does not fix grouped rollout gather/scoring cleanup and has an unresolved review comment on thread-local client teardown targeting.
[codex] Add structured SWE rollout failure observability #1283 [codex] Add structured SWE rollout failure observability: open draft, related observability/log classification, not a cleanup lifecycle fix.
feat: rollout timeout #1258 feat: rollout timeout: merged. Established the MultiTurnEnv finalize/finally pattern that this PR follows for scoring/rubric cleanup.
Add SWE debug environment #1306 Add SWE debug environment: merged. Related sandbox-backed SWE paths, but not the grouped rollout cleanup hole.
apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes #1284 apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes: merged. Related sandbox setup churn/flakiness, not lifecycle cleanup.
Recent path commits inspected on origin/main: 58b119fa (fix(renderer-client): translate renderers.OverlongPromptError into vf.OverlongPromptError #1408), 242f84af (fix(save_utils): reset delta baseline on non-monotonic trajectories #1400), 4f3516ef (perf(save): skip backward-byte scan on hot path #1376), d968fe1c (Tighten v1 config, bindings, and handler typing #1362), 291fd4ae (apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes #1284), 3eddb081 (Add SWE debug environment #1306), 0b743fb8 (Make sandbox worker caps configurable #1305), a0c6efd1 (Taskset Harness (v1) #1277), eae28482 (feat: rollout timeout #1258). None already covers this group cleanup finalization path.

Tests

/usr/bin/env -u PRIME_API_KEY -u PRIME_TEAM_ID -u PRIME_ORG_ID -u PRIME_SANDBOX_API_KEY -u VIRTUAL_ENV \
  uv run --no-env-file pytest tests/test_rollout_cleanup.py -q

/usr/bin/env -u PRIME_API_KEY -u PRIME_TEAM_ID -u PRIME_ORG_ID -u PRIME_SANDBOX_API_KEY -u VIRTUAL_ENV \
  uv run --no-env-file pytest tests/test_rollout_cleanup.py tests/test_decorator_ranks.py tests/test_multiturn_env.py tests/test_composable_env.py -q

uv run ruff check verifiers/envs/environment.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/rubrics/rubric.py verifiers/rubrics/rubric_group.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py tests/test_rollout_cleanup.py

uv run ruff format --check verifiers/envs/environment.py verifiers/envs/experimental/cli_agent_env.py verifiers/envs/experimental/composable/composable_env.py verifiers/rubrics/rubric.py verifiers/rubrics/rubric_group.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py tests/test_rollout_cleanup.py

git diff --check

Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.

Note

High Risk
High risk because it changes core rollout execution/cleanup semantics (task cancellation, exception propagation, and cleanup ordering) in Environment and sandbox-backed CliAgentEnv, which can affect resource lifecycle and failure behavior across many envs.

Overview
Fixes sandbox/resource leaks by making cleanup best-effort but comprehensive across failure paths.

Environment._run_rollout_state and _run_group_states now run rubric cleanup in finally, cancel pending sibling rollouts on group failure, and attempt cleanup for any states that completed before gather/scoring errored. Environment.cleanup, Rubric.cleanup, RubricGroup.cleanup, and HarnessMetricsRubricGroup.cleanup now continue executing later cleanup handlers even if an earlier one fails, then re-raise the first cleanup error.

CliAgentEnv.destroy_sandbox now deletes the sandbox (instead of deferring) if post_rollout fails, and SWE taskset rubrics short-circuit scoring when any state['error'] is present plus log sandbox deletion failures rather than silently swallowing them. Adds tests/test_rollout_cleanup.py to cover these error/cleanup scenarios.

^{Reviewed by Cursor Bugbot for commit ed56ed2. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Fix sandbox cleanup on failed rollout groups to run all cleanup handlers despite errors

Environment._run_group_states now cancels pending rollout tasks and cleans up all completed states when any rollout or group scoring fails, instead of skipping cleanup on failure.
Environment._run_rollout_state wraps scoring in try/finally so rubric cleanup always runs; if both scoring and cleanup fail, the scoring error takes priority.
Rubric.cleanup, RubricGroup.cleanup, and Environment.cleanup now run all registered handlers even if earlier ones fail, logging all errors and re-raising the first.
All SWE rubric solved methods now short-circuit to 0.0 for any non-None state.error, not just InfraError; sandbox deletion failures are now logged as warnings instead of silently suppressed.
Behavioral Change: previously a failed rollout could leave sandboxes running; now cleanup is guaranteed to be attempted for all completed states in a group.

^{Macroscope summarized ed56ed2.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.}

cursor · 2026-05-19T01:15:02Z

+                if isinstance(key, str) and isinstance(value, (int, float)):
+                    state_metrics[key] = float(value)
+        if cleanup_error is not None:
+            raise cleanup_error


HarnessMetricsRubricGroup skips parent cleanup handlers

Low Severity

HarnessMetricsRubricGroup.cleanup reimplements the rubric iteration loop from RubricGroup.cleanup without calling super().cleanup(). The parent RubricGroup.cleanup was updated in this PR to run Rubric._cleanup_handlers via super().cleanup() before iterating child rubrics, but HarnessMetricsRubricGroup.cleanup skips that entirely. This means any @vf.cleanup-decorated handlers registered on the group itself are silently dropped. Delegating to super().cleanup() wrapped in try/except (then processing harness metrics before re-raising) would eliminate the duplication and ensure consistency with the parent class's new fail-safe pattern.

Additional Locations (1)

verifiers/rubrics/rubric_group.py#L101-L120

^{Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.}

macroscopeapp · 2026-05-19T01:15:33Z

Approvability

Verdict: Needs human review

This PR changes runtime behavior by broadening error conditions that skip sandbox scoring (from InfraError-only to any error) and modifies async cleanup flows across multiple components. An open review comment also identifies that HarnessMetricsRubricGroup may not properly call parent cleanup handlers.

^{You can customize Macroscope's approvability policy. Learn more.}

Fix sandbox cleanup on failed rollouts

ed56ed2

cursor Bot reviewed May 19, 2026

View reviewed changes

rasdani marked this pull request as draft May 19, 2026 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sandbox cleanup on failed rollout groups#1412

Fix sandbox cleanup on failed rollout groups#1412
rasdani wants to merge 1 commit into
mainfrom
fix/sandbox-rollout-cleanup

rasdani commented May 19, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 19, 2026

Uh oh!

macroscopeapp Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rasdani commented May 19, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Leak Mechanism

Upstream History Checked

Tests

Fix sandbox cleanup on failed rollout groups to run all cleanup handlers despite errors

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 19, 2026

Choose a reason for hiding this comment

HarnessMetricsRubricGroup skips parent cleanup handlers

Uh oh!

macroscopeapp Bot commented May 19, 2026

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented May 19, 2026 •

edited by macroscopeapp Bot

Loading