Fix sandbox cleanup on failed rollout groups#1412
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.
| if isinstance(key, str) and isinstance(value, (int, float)): | ||
| state_metrics[key] = float(value) | ||
| if cleanup_error is not None: | ||
| raise cleanup_error |
There was a problem hiding this comment.
HarnessMetricsRubricGroup skips parent cleanup handlers
Low Severity
HarnessMetricsRubricGroup.cleanup reimplements the rubric iteration loop from RubricGroup.cleanup without calling super().cleanup(). The parent RubricGroup.cleanup was updated in this PR to run Rubric._cleanup_handlers via super().cleanup() before iterating child rubrics, but HarnessMetricsRubricGroup.cleanup skips that entirely. This means any @vf.cleanup-decorated handlers registered on the group itself are silently dropped. Delegating to super().cleanup() wrapped in try/except (then processing harness metrics before re-raising) would eliminate the duplication and ensure consistency with the parent class's new fail-safe pattern.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.
ApprovabilityVerdict: Needs human review This PR changes runtime behavior by broadening error conditions that skip sandbox scoring (from InfraError-only to any error) and modifies async cleanup flows across multiple components. An open review comment also identifies that HarnessMetricsRubricGroup may not properly call parent cleanup handlers. You can customize Macroscope's approvability policy. Learn more. |


Summary
Fixes the sandbox lifecycle hole behind the SWE sandbox leak/churn incident: sandbox-backed rollouts can finish their env-side cleanup with
keep_sandbox_for_scoring=True, leave deletion deferred to rubric cleanup, and then skip that rubric cleanup if grouped rollout gather/scoring fails or is cancelled.This patch makes the cleanup path fail-safe:
finallyfor single-rollout and grouped scoring paths.CliAgentEnv.destroy_sandbox()still deletes the sandbox ifpost_rollout()fails before it can defer scoring cleanup.Leak Mechanism
SWE/composable rollouts create the sandbox during
CliAgentEnv.setup_state()viaSandboxMixin.create_sandbox(). For SWE scoring,CliAgentEnv.destroy_sandbox()defers deletion whenkeep_sandbox_for_scoring=True, so the taskset rubric can run tests and then delete the sandbox in rubric cleanup.Before this change,
Environment._run_group_states()only called rubric cleanup afterasyncio.gather(*rollout_tasks)and group scoring both succeeded. A failed gather, failed scoring pass, or cancellation could bypass cleanup for already-created sandbox states. With inference down, those failures combined with rescheduling could keep allocating sandbox resources without a reliable per-group cleanup barrier.Upstream History Checked
Fix sandbox cleanup leak paths: open draft, not merged into currentorigin/main. Related cleanup work, but it does not fix grouped rollout gather/scoring cleanup and has an unresolved review comment on thread-local client teardown targeting.[codex] Add structured SWE rollout failure observability: open draft, related observability/log classification, not a cleanup lifecycle fix.feat: rollout timeout: merged. Established theMultiTurnEnvfinalize/finally pattern that this PR follows for scoring/rubric cleanup.Add SWE debug environment: merged. Related sandbox-backed SWE paths, but not the grouped rollout cleanup hole.apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes: merged. Related sandbox setup churn/flakiness, not lifecycle cleanup.origin/main:58b119fa(fix(renderer-client): translate renderers.OverlongPromptError into vf.OverlongPromptError #1408),242f84af(fix(save_utils): reset delta baseline on non-monotonic trajectories #1400),4f3516ef(perf(save): skip backward-byte scan on hot path #1376),d968fe1c(Tighten v1 config, bindings, and handler typing #1362),291fd4ae(apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes #1284),3eddb081(Add SWE debug environment #1306),0b743fb8(Make sandbox worker caps configurable #1305),a0c6efd1(Taskset Harness (v1) #1277),eae28482(feat: rollout timeout #1258). None already covers this group cleanup finalization path.Tests
Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.
Note
High Risk
High risk because it changes core rollout execution/cleanup semantics (task cancellation, exception propagation, and cleanup ordering) in
Environmentand sandbox-backedCliAgentEnv, which can affect resource lifecycle and failure behavior across many envs.Overview
Fixes sandbox/resource leaks by making cleanup best-effort but comprehensive across failure paths.
Environment._run_rollout_stateand_run_group_statesnow run rubric cleanup infinally, cancel pending sibling rollouts on group failure, and attempt cleanup for any states that completed beforegather/scoring errored.Environment.cleanup,Rubric.cleanup,RubricGroup.cleanup, andHarnessMetricsRubricGroup.cleanupnow continue executing later cleanup handlers even if an earlier one fails, then re-raise the first cleanup error.CliAgentEnv.destroy_sandboxnow deletes the sandbox (instead of deferring) ifpost_rolloutfails, and SWE taskset rubrics short-circuit scoring when anystate['error']is present plus log sandbox deletion failures rather than silently swallowing them. Addstests/test_rollout_cleanup.pyto cover these error/cleanup scenarios.Reviewed by Cursor Bugbot for commit ed56ed2. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fix sandbox cleanup on failed rollout groups to run all cleanup handlers despite errors
Environment._run_group_statesnow cancels pending rollout tasks and cleans up all completed states when any rollout or group scoring fails, instead of skipping cleanup on failure.Environment._run_rollout_statewraps scoring in try/finally so rubric cleanup always runs; if both scoring and cleanup fail, the scoring error takes priority.Rubric.cleanup,RubricGroup.cleanup, andEnvironment.cleanupnow run all registered handlers even if earlier ones fail, logging all errors and re-raising the first.solvedmethods now short-circuit to0.0for any non-Nonestate.error, not justInfraError; sandbox deletion failures are now logged as warnings instead of silently suppressed.Macroscope summarized ed56ed2.