Skip to content

Commit 4b0545a

Browse files
mikasenghaashalleritecursoragentwillccbb
authored
resume evals (#803)
* attempt 1 * stateful load/save * functional * simpler * remove old stuff * less git diff * fix * update toml config * refactor to use callbacks consistently * correct usage of callbacks * deprecate use_tqdm * add docs * fix group increments and progress init * fix error rate by computing in metadata * to not trigger assert * remove hf ref * do not show tqdm in gepa * fix(eval): harden resume by tolerating partial JSONL tail and validating metadata * fix style * allow increased num_examples * Fix typo: 'evaluaton' -> 'evaluation' in resume log message Co-authored-by: will brown <willccbb@users.noreply.github.com> * Remove unused self.logger from GenerateOutputsBuilder The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Reuse metadata from build_metadata() instead of calling it twice per iteration The build_metadata() method was called twice per iteration in the as_completed loop—once to pass to on_progress, and again to save. Since build_metadata() computes averages over all accumulated outputs, this duplication was wasteful. Now the metadata computed for on_progress is reused for the save operation. Co-authored-by: will brown <willccbb@users.noreply.github.com> * Make eval `--resume` optional and auto-detect latest incomplete run (#842) * Add optional --resume auto-detection for eval runs * Fix resume=false handling and dedupe output path resolution * Harden eval results path validation to require files * Fix append handling corrupt outputs * Fix resume append corruption * Fix resume output appending * Fix resume append and typing errors * set path create time directly * use -R shorthand for resume, -i for independent scoring --------- Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: will brown <willccbb@users.noreply.github.com> Co-authored-by: will brown <williambrown97@gmail.com>
1 parent 3270821 commit 4b0545a

14 files changed

Lines changed: 1021 additions & 275 deletions

docs/evaluation.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ This section explains how to run evaluations with Verifiers environments. See [E
1111
- [Evaluation Scope](#evaluation-scope)
1212
- [Concurrency](#concurrency)
1313
- [Output and Saving](#output-and-saving)
14+
- [Resuming Evaluations](#resuming-evaluations)
1415
- [Environment Defaults](#environment-defaults)
1516
- [Multi-Environment Evaluation](#multi-environment-evaluation)
1617
- [TOML Configuration](#toml-configuration)
@@ -124,6 +125,7 @@ Multiple rollouts per example enable metrics like pass@k and help measure varian
124125
| `--max-concurrent-generation` || same as `-c` | Concurrent generation requests |
125126
| `--max-concurrent-scoring` || same as `-c` | Concurrent scoring requests |
126127
| `--no-interleave-scoring` | `-N` | false | Disable interleaved scoring |
128+
| `--independent-scoring` | `-i` | false | Score each rollout individually instead of by group |
127129
| `--max-retries` || 0 | Retries per rollout on transient `InfraError` |
128130

129131
By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes.
@@ -138,12 +140,60 @@ The `--max-retries` flag enables automatic retry with exponential backoff when r
138140
| `--tui` | `-u` | false | Use alternate screen mode (TUI) for display |
139141
| `--debug` | `-d` | false | Disable Rich display; use normal logging and tqdm progress |
140142
| `--save-results` | `-s` | false | Save results to disk |
141-
| `--save-every` | `-f` | -1 | Save checkpoint every N rollouts |
143+
| `--resume [PATH]` | `-R` | | Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted) |
142144
| `--state-columns` | `-C` || Extra state columns to save (comma-separated) |
143145
| `--save-to-hf-hub` | `-H` | false | Push results to Hugging Face Hub |
144146
| `--hf-hub-dataset-name` | `-D` || Dataset name for HF Hub |
145147

146-
Results are saved to `./outputs/evals/{env_id}--{model}/` as a Hugging Face dataset.
148+
Results are saved to `./outputs/evals/{env_id}--{model}/{run_id}/`, containing:
149+
150+
- `results.jsonl` — rollout outputs, one per line
151+
- `metadata.json` — evaluation configuration and aggregate metrics
152+
153+
### Resuming Evaluations
154+
155+
Long-running evaluations can be interrupted and resumed using checkpointing. When `--save-results` is enabled, results are saved incrementally after each completed group of rollouts. Use `--resume` to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.
156+
157+
**Running with checkpoints:**
158+
159+
```bash
160+
prime eval run my-env -n 1000 -s
161+
```
162+
163+
With `-s` (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.
164+
165+
**Resuming from a checkpoint:**
166+
167+
```bash
168+
prime eval run my-env -n 1000 -s --resume ./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/abc12345
169+
```
170+
171+
When a resume path is provided, it must point to a valid evaluation results directory containing both `results.jsonl` and `metadata.json`. With `--resume` and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching `env_id`, `model`, and `rollouts_per_example` where saved `num_examples` is less than or equal to the current run. When resuming:
172+
173+
1. Existing completed rollouts are loaded from the checkpoint
174+
2. Remaining rollouts are computed based on the example ids and group size
175+
3. Only incomplete rollouts are executed
176+
4. New results are appended to the existing checkpoint
177+
178+
If all rollouts are already complete, the evaluation returns immediately with the existing results.
179+
180+
**Configuration compatibility:**
181+
182+
When resuming, the current run configuration should match the original run. Mismatches in parameters like `--model`, `--env-args`, or `--rollouts-per-example` can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing `--num-examples` if you need additional rollouts beyond the original target.
183+
184+
**Example workflow:**
185+
186+
```bash
187+
# Start a large evaluation with checkpointing
188+
prime eval run math-python -n 500 -r 3 -s
189+
190+
# If interrupted, find the run directory
191+
ls ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/
192+
193+
# Resume from the checkpoint
194+
prime eval run math-python -n 500 -r 3 -s \
195+
--resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
196+
```
147197

148198
The `--state-columns` flag allows saving environment-specific state fields that your environment stores during rollouts:
149199

docs/reference.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -598,11 +598,10 @@ class EvalConfig(BaseModel):
598598
independent_scoring: bool = False
599599
extra_env_kwargs: dict = {}
600600
max_retries: int = 0
601-
print_results: bool = False
602601
verbose: bool = False
603602
state_columns: list[str] | None = None
604603
save_results: bool = False
605-
save_every: int = -1
604+
resume_path: Path | None = None
606605
save_to_hf_hub: bool = False
607606
hf_hub_dataset_name: str | None = None
608607
```

tests/test_environment_extra.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from __future__ import annotations
1313

1414
import asyncio
15+
import json
1516
from typing import Callable
1617

1718
import pytest
@@ -222,3 +223,34 @@ def test_make_dataset_basic_without_tools(make_metadata, make_output):
222223
results = GenerateOutputs(outputs=[make_output()], metadata=make_metadata())
223224
ds = build_dataset(results)
224225
assert len(ds) == 1 and "foo" in ds.column_names
226+
227+
228+
@pytest.mark.asyncio
229+
async def test_generate_resume_raises_on_metadata_mismatch(
230+
tmp_path, mock_openai_client, make_dummy_env, make_input
231+
):
232+
env = make_dummy_env(mock_openai_client)
233+
234+
results_path = tmp_path / "resume"
235+
results_path.mkdir()
236+
(results_path / "results.jsonl").write_text("", encoding="utf-8")
237+
(results_path / "metadata.json").write_text(
238+
json.dumps(
239+
{
240+
"env_id": env.env_id,
241+
"model": "test-model",
242+
"num_examples": 2,
243+
"rollouts_per_example": 1,
244+
}
245+
),
246+
encoding="utf-8",
247+
)
248+
249+
inputs = [make_input(example_id=0)]
250+
with pytest.raises(ValueError, match="metadata mismatch"):
251+
await env.generate(
252+
inputs=inputs,
253+
client=mock_openai_client,
254+
model="test-model",
255+
results_path=results_path,
256+
)

tests/test_eval_cli.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
import argparse
2+
import os
23
import tempfile
4+
import time
35
from pathlib import Path
46
from types import SimpleNamespace
57

@@ -42,6 +44,7 @@ def _run_cli(monkeypatch, overrides, capture_all_configs: bool = False):
4244
"no_interleave_scoring": False,
4345
"state_columns": [],
4446
"save_results": False,
47+
"resume": None,
4548
"save_every": -1,
4649
"save_to_hf_hub": False,
4750
"hf_hub_dataset_name": "",
@@ -459,3 +462,94 @@ def test_load_toml_config_invalid_global_field():
459462
f.flush()
460463
with pytest.raises(ValueError):
461464
load_toml_config(Path(f.name))
465+
466+
467+
def test_cli_resume_explicit_path(monkeypatch, run_cli, tmp_path: Path):
468+
"""--resume with explicit path sets resume_path."""
469+
resume_dir = tmp_path / "resume"
470+
resume_dir.mkdir(parents=True)
471+
(resume_dir / "results.jsonl").write_text("", encoding="utf-8")
472+
(resume_dir / "metadata.json").write_text("{}", encoding="utf-8")
473+
474+
captured = run_cli(
475+
monkeypatch,
476+
{
477+
"resume": str(resume_dir),
478+
},
479+
)
480+
481+
assert captured["configs"][0].resume_path == resume_dir
482+
483+
484+
def test_cli_resume_auto_detects_latest_incomplete(
485+
monkeypatch, run_cli, tmp_path: Path
486+
):
487+
"""--resume with no path auto-detects latest matching incomplete run."""
488+
env_id = "dummy-env"
489+
model = "gpt-4.1-mini"
490+
run_base = tmp_path / "outputs" / "evals" / f"{env_id}--{model.replace('/', '--')}"
491+
old_run = run_base / "oldrun"
492+
new_run = run_base / "newrun"
493+
old_run.mkdir(parents=True)
494+
new_run.mkdir(parents=True)
495+
496+
metadata = (
497+
'{"env_id":"dummy-env","model":"gpt-4.1-mini",'
498+
'"num_examples":4,"rollouts_per_example":1}'
499+
)
500+
(old_run / "metadata.json").write_text(metadata, encoding="utf-8")
501+
(new_run / "metadata.json").write_text(metadata, encoding="utf-8")
502+
503+
(old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
504+
(new_run / "results.jsonl").write_text(
505+
'{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
506+
)
507+
now = time.time()
508+
os.utime(old_run, (now, now))
509+
os.utime(new_run, (now + 1, now + 1))
510+
511+
monkeypatch.chdir(tmp_path)
512+
captured = run_cli(
513+
monkeypatch,
514+
{
515+
"resume": True,
516+
"num_examples": 4,
517+
"rollouts_per_example": 1,
518+
"env_dir_path": str(tmp_path / "environments"),
519+
},
520+
)
521+
522+
assert captured["configs"][0].resume_path is not None
523+
assert captured["configs"][0].resume_path.resolve() == new_run.resolve()
524+
525+
526+
def test_cli_toml_resume_false_disables_global_resume(monkeypatch, run_cli):
527+
"""Per-eval resume=false overrides global resume=true in TOML configs."""
528+
with tempfile.NamedTemporaryFile(suffix=".toml", delete=False, mode="w") as f:
529+
f.write(
530+
"resume = true\n"
531+
"\n"
532+
"[[eval]]\n"
533+
'env_id = "env-a"\n'
534+
"\n"
535+
"[[eval]]\n"
536+
'env_id = "env-b"\n'
537+
"resume = false\n"
538+
)
539+
f.flush()
540+
captured = run_cli(
541+
monkeypatch,
542+
{
543+
"env_id_or_config": f.name,
544+
"num_examples": 1,
545+
"rollouts_per_example": 1,
546+
"env_dir_path": "./environments",
547+
},
548+
)
549+
550+
configs = captured["configs"]
551+
assert len(configs) == 2
552+
assert configs[0].env_id == "env-a"
553+
assert configs[0].resume_path is None
554+
assert configs[1].env_id == "env-b"
555+
assert configs[1].resume_path is None

tests/test_path_utils.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import os
2+
from pathlib import Path
3+
4+
from verifiers.utils.path_utils import (
5+
find_latest_incomplete_eval_results_path,
6+
is_valid_eval_results_path,
7+
)
8+
9+
10+
def test_find_latest_incomplete_eval_results_path_picks_newest_matching(
11+
tmp_path: Path, monkeypatch
12+
):
13+
env_id = "dummy-env"
14+
model = "openai/gpt-4.1-mini"
15+
runs_dir = (
16+
tmp_path
17+
/ "outputs"
18+
/ "evals"
19+
/ f"{env_id}--{model.replace('/', '--')}"
20+
)
21+
22+
old_run = runs_dir / "11111111"
23+
new_run = runs_dir / "22222222"
24+
complete_run = runs_dir / "33333333"
25+
for run in [old_run, new_run, complete_run]:
26+
run.mkdir(parents=True)
27+
28+
metadata = (
29+
'{"env_id":"dummy-env","model":"openai/gpt-4.1-mini",'
30+
'"num_examples":4,"rollouts_per_example":1}'
31+
)
32+
for run in [old_run, new_run, complete_run]:
33+
(run / "metadata.json").write_text(metadata, encoding="utf-8")
34+
35+
(old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
36+
(new_run / "results.jsonl").write_text(
37+
'{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
38+
)
39+
(complete_run / "results.jsonl").write_text(
40+
'{"example_id":0}\n{"example_id":1}\n{"example_id":2}\n{"example_id":3}\n',
41+
encoding="utf-8",
42+
)
43+
44+
os.utime(old_run, (1, 1))
45+
os.utime(new_run, (2, 2))
46+
os.utime(complete_run, (3, 3))
47+
48+
monkeypatch.chdir(tmp_path)
49+
50+
result = find_latest_incomplete_eval_results_path(
51+
env_id=env_id,
52+
model=model,
53+
num_examples=4,
54+
rollouts_per_example=1,
55+
env_dir_path=str(tmp_path / "environments"),
56+
)
57+
58+
assert result is not None
59+
assert result.resolve() == new_run.resolve()
60+
61+
62+
def test_find_latest_incomplete_eval_results_path_returns_none_when_no_match(
63+
tmp_path: Path, monkeypatch
64+
):
65+
monkeypatch.chdir(tmp_path)
66+
67+
result = find_latest_incomplete_eval_results_path(
68+
env_id="dummy-env",
69+
model="openai/gpt-4.1-mini",
70+
num_examples=4,
71+
rollouts_per_example=1,
72+
env_dir_path=str(tmp_path / "environments"),
73+
)
74+
assert result is None
75+
76+
77+
def test_is_valid_eval_results_path_requires_files(tmp_path: Path):
78+
run_dir = tmp_path / "run"
79+
run_dir.mkdir()
80+
81+
(run_dir / "results.jsonl").mkdir()
82+
(run_dir / "metadata.json").mkdir()
83+
84+
assert not is_valid_eval_results_path(run_dir)
85+
86+
87+
def test_is_valid_eval_results_path_accepts_expected_layout(tmp_path: Path):
88+
run_dir = tmp_path / "run"
89+
run_dir.mkdir()
90+
91+
(run_dir / "results.jsonl").write_text("", encoding="utf-8")
92+
(run_dir / "metadata.json").write_text("{}", encoding="utf-8")
93+
94+
assert is_valid_eval_results_path(run_dir)

0 commit comments

Comments
 (0)