A reproducible benchmark for running LoopGain on outer loops — "fix until green" agent loops where a single iteration is an entire headless agent session, not one model call.
LoopGain is a framework- and model-agnostic convergence monitor for agent verify-revise loops. This benchmark answers one question: does the convergence math still hold when an "iteration" is a whole agent session spending real API dollars? We ran 90 real fix-until-green loops (271 worker sessions, $6.50 total), plus a matched 36-loop cross-model run, and measured it.
Every trial record, both workers, the analysis scripts, and the verdicts are in
this repo. Provenance for every number is aggregate — recomputed from the raw
trial records under results/trials/, not from a summary.
The band classifier is coherent at session scale: across all 90 loops it never
once labeled a monotonically-improving loop as DIVERGING or OSCILLATING, even
though each session does completely different work in different files. When sessions
have budget the governed stop is clean (well-budgeted cell: 30/30 converged, zero
false stops). The honest limit is the stop rule: on a deliberately budget-tight
cell it false-stopped 13 of 30 loops, 9 of them one session before the fix
would have landed — a hardcoded 2-consecutive-STALLING terminal count that is too
impatient when session-boundary progress is bursty. Raising that count from 2 to 5
halves false stops (14 → 7) while catching every genuine non-converging grind. The
savings a governed stop buys depend entirely on the baseline: ~78% against a bare
Ralph loop with no completion check, but only ~19% against a loop that already stops
on success.
Each trial is a generated scratch repo with seeded bugs and a pytest suite that acts as the spec (27–51 tests, 7–21 failing at the start). One loop runs like this:
- Spawn a real headless agent session. A minimal raw-API worker
(
claude_min_worker.py/oai_worker.py) gets two tools —bashandwrite_file— and a bounded turn budget, with one instruction: run the tests, fix the code, never touch the tests. - The harness computes the error signal, not the worker. When the session ends, the harness restores the test files from git and runs pytest itself. The failing count is the error signal. The worker never grades its own work.
LoopGain.observe(failing_count)reads the count; every iteration is a git commit, so "roll back to the best iteration" is a real SHA.- Repeat to a cap of 10 sessions — the outer-loop cap an uninstrumented loop would hit — so a governed stop can be compared against the full run's real cost.
Three cells, run on purpose:
| cell | n | session budget | design intent |
|---|---|---|---|
| convergent | 30 | 10 turns | budget comfortably exceeds the work — baseline behavior |
| plateau | 30 | 8 turns | base bugs plus a genuinely hard module (recur / calcexpr) |
| regression | 30 | 6 turns | budget-tight coupled repos — a fix often can't finish in one session |
A benchmark has to hold the driver constant at the simplest, most transparent layer. The worker here runs on the raw API, one inference per turn — not inside a production agent CLI. The reason is that "a turn" has to mean the same amount of work from one run to the next. A production CLI spends turns on orientation and tool planning, so its "turn" is not one inference, and turn-budget comparisons across different drivers are not comparing equivalent units.
We learned this the expensive way. The first run of this study used the Claude
Code CLI (claude -p) as the worker, and the budget-tight cell looked dramatic:
24 of 30 false stops. It was an artifact. At a 6-turn budget the CLI was starving
the worker — 262 of 267 of that cell's sessions hit the turn cap without finishing
— which manufactured exactly the long flat trajectories that trip the stall rule. Re-run
on the raw API at the identical cells, seeds, and budgets, the same cell mostly
converged and false stops dropped to 13/30. The 24/30 measured claude -p's turn
semantics, not anything about loops. Production-harness behavior is a real and useful
question — it is just a different, separately-labeled question. See
results/RAW-API-MAINRUN-VERDICT.md.
- Coherence: 0 violations in 90 loops. The central claim, and it holds.
- Well-budgeted cell (10 turns): 30/30 converged, 0 false stops.
- Budget-tight cell (6 turns): 13/30 false stops, 9 recoverable. 54% of that cell's sessions left the failing count unchanged — session-boundary progress is a coarse, decimated sample of the real fix trajectory, so a single flat reading is weak evidence.
- Stall-count sweep (K = 2→5): false stops fall 14 → 7; true catches stay flat at 5. Patience trades false stops for nothing on genuine grinds.
Recomputed over the cliraw cohort with two baselines
(analyze_savings_table.py). Every quantity is the
trial's mean per-session cost times a session count, so the table reflects the stop
rule, not run-to-run price noise.
| stop rule | vs a bare cap-10 loop (no completion check) | vs an until-green loop (stops on success) |
|---|---|---|
| shipped behavior (K = 2) | saves 78.3% | saves 19.4% |
| patient variant (K = 5) | saves 75.4% | saves 8.5% |
The first column is the real Ralph loop a lot of people run — grinding to a fixed cap with no success check. The honest column is the second: against a loop that already quits on success, a governed stop's marginal value is just the spend on loops that were never going to converge. Real, worth having, not the headline.
Same minimal driver, same two tools, identical budgets, same repos and bugs, two
models. The findings replicate: 0 coherence violations in all 36 loops, a clean
12-for-12 in the well-budgeted cells, and the same near-breakthrough false stops.
What differs is capability, not budget — Haiku converged 14 of 18 loops to
gpt-5-mini's 10 — but at a 6-turn budget both models reached a file edit on every
fresh session, and late no-edit sessions appeared at the same rate (~69% each). The
matched cells cost $1.26 (Haiku) vs $0.32 (gpt-5-mini) for identical work. See
results/MATCHED-HARNESS-VERDICT.md.
results/trials/ holds every record. Identify a cohort by its id prefix:
| prefix | cohort | role |
|---|---|---|
cliraw- |
90 raw-API Haiku loops | the main study |
cmin- / gptm- |
36 matched raw-API loops (Haiku / gpt-5-mini) | cross-model |
| (no prefix) | 89 claude -p CLI loops |
the superseded first run — kept as the cautionary contrast, never a headline |
gpt- |
18 early unmatched gpt loops | retained for provenance; not used in any reported result |
The loopgain library is pure-Python with zero runtime dependencies.
pip install loopgain pytest # into the interpreter you'll run with
export ANTHROPIC_API_KEY=... # and/or OPENAI_API_KEY for the gpt worker
# (or put them in a local .env)
# regenerate the analyses from the committed trial records (no API spend):
python3 analyze_driver_swap.py # main-study driver-swap aggregates
python3 analyze_matched.py # cross-model aggregates
python3 analyze_savings_table.py # the savings table
python3 sweep_kill_rule.py # the K = 2..5 stall-count sweep
# re-run the loops from scratch (spends real API dollars):
python3 validate_families.py # pre-flight: prove the bug catalogs
python3 run_fulltest.py --parallel 5Set LOOPGAIN_PY to point pytest and the worker subprocesses at a specific
interpreter if you are not running everything from one environment.
- These are scratch repos with seeded bugs, sized so hundreds of real agent sessions stay affordable — not production codebases.
- The workers are mini-tier models. Absolute dollars are small by design; the percentages are the point.
- The savings number that matters is the modest one (~19% against a loop that already stops on success). The big one is against a baseline that barely tries.
- The plateau cell stalled less than intended — the workers beat the "hard" modules more often than expected.
- A convergence monitor inherits its verifier's blind spots: a loop that converges on the wrong tests stops confidently on a wrong answer. We measure that wrong-fixed-point rate at ~4.5% in related work.
The way to know what your loop wastes is to measure your loop. This benchmark is how we measure ours.