resume evals by mikasenghaas · Pull Request #803 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-01-29T15:17:25Z

Description

this pr implements incremental saving (i.e. we save new rollouts by appending to a file, instead of overwriting the whole file with all rollouts all the time) and resumable evals. the former is useful to save unnecessary i/o and makes resuming evals a whole lot safer because accidental data loss is less likely. the latter is useful for long-running evals and synthetic data gen runs, especially against flaky apis.

main changes:

introduces --resume (-R) flag on vf-eval which by default resumes the latest matching, incompleted run or a run at a specified output directory
deprecates --save-every because we save for every rollout/group by default via incremental saving
deprecates --use-tqdm as an eval arg. users can still disable tqdm by passing null callback functions when they call generate directly

Example

Run an evaluation and save its results

uv run vf-eval gsm8k -n5 -r1 -s

If it finished properly, resuming the run with identical paramters will finish without generating any new rollouts

uv run vf-eval gsm8k -n5 -r1 -s -R environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/9a1d0326

If more rollouts are required to finish the eval (notice how we have -n10 now), the eval resumes the state of the previous run and only generates the remaining rollouts.

uv run vf-eval gsm8k -n10 -r1 -s --resume-path environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/9a1d0326

In practice, a run is likely resumed because of a crash in which case resuming with identical configuration of n and r will also produce the missing rollouts.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

Note

Medium Risk
Touches core evaluation execution/saving and changes callback/type signatures; bugs could lead to incorrect skipping/duplication of rollouts or corrupted saved results during resume.

Overview
Adds resumable evaluations with incremental checkpointing. prime eval now writes each completed rollout/group by appending to results.jsonl and updating metadata.json, and can restart from an existing run directory (skipping already-completed rollouts) after validating the saved metadata matches the current config.

Introduces a --resume [PATH] CLI flag (and TOML resume/legacy resume_path) that either resumes from an explicit results directory or auto-detects the newest incomplete matching run via new helpers in path_utils. Removes the --save-every and use_tqdm plumbing in favor of callback-driven progress reporting, updating callback signatures and the TUI to consume rolling aggregates (avg_reward, avg_metrics, new avg_error, usage). Documentation and tests are expanded to cover resume path validation, auto-detection, and recovery from malformed trailing JSONL lines.

^{Written by Cursor Bugbot for commit 723c4bd. This will update automatically on new commits. Configure here.}

verifiers/utils/eval_utils.py

verifiers/envs/environment.py

verifiers/utils/save_utils.py

verifiers/envs/environment.py

verifiers/utils/eval_utils.py

verifiers/envs/environment.py

verifiers/utils/eval_utils.py

verifiers/envs/environment.py

verifiers/gepa/adapter.py

verifiers/utils/save_utils.py

willccbb · 2026-02-03T04:36:53Z

Approach looks nice + sensible, would wanna maybe add some test cases + confirm it works well with some dogfooding but otherwise LGTM when CI is green

…ing metadata

verifiers/envs/environment.py

verifiers/utils/save_utils.py

verifiers/envs/environment.py

verifiers/utils/save_utils.py

willccbb

LGTM pending Konstantin pinging me about some load tests.

Saving every rollout feels a bit much given that we're rewriting the file and not just appending. If rollouts are finishing in rapid succession on med-large evals, this could cause some contention/bottlenecks potentially? How important is saving in sorted order?

Co-authored-by: will brown <willccbb@users.noreply.github.com>

CLAassistant · 2026-02-06T09:14:51Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ mikasenghaas
✅ hallerite
✅ willccbb
❌ cursoragent
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com>

…iteration The build_metadata() method was called twice per iteration in the as_completed loop—once to pass to on_progress, and again to save. Since build_metadata() computes averages over all accumulated outputs, this duplication was wasteful. Now the metadata computed for on_progress is reused for the save operation. Co-authored-by: will brown <willccbb@users.noreply.github.com>

verifiers/utils/save_utils.py

…842) * Add optional --resume auto-detection for eval runs * Fix resume=false handling and dedupe output path resolution * Harden eval results path validation to require files

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/scripts/eval.py

mikasenghaas changed the base branch from main to env-server January 29, 2026 15:17

mikasenghaas force-pushed the resume-evals branch 2 times, most recently from c72f002 to f7485d4 Compare January 29, 2026 16:04

mikasenghaas marked this pull request as ready for review January 29, 2026 17:08

mikasenghaas requested review from rasdani and willccbb January 29, 2026 17:17

cursor bot reviewed Jan 29, 2026

View reviewed changes

verifiers/utils/eval_utils.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

verifiers/utils/save_utils.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 29, 2026

View reviewed changes

verifiers/utils/eval_utils.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

mikasenghaas marked this pull request as draft January 29, 2026 17:47

willccbb changed the base branch from env-server to main January 30, 2026 03:02

mikasenghaas added 8 commits February 2, 2026 16:53

attempt 1

ea8d2de

stateful load/save

394dc00

functional

8642aac

simpler

c5cebdb

remove old stuff

4b0ea24

less git diff

ef51ce6

fix

3999037

update toml config

e73288a

mikasenghaas force-pushed the resume-evals branch from 0aeb614 to e73288a Compare February 2, 2026 16:53

mikasenghaas added 4 commits February 2, 2026 18:22

refactor to use callbacks consistently

c6d50a5

correct usage of callbacks

2cf9e62

deprecate use_tqdm

a94a622

add docs

a854aba

mikasenghaas marked this pull request as ready for review February 2, 2026 19:38

cursor bot reviewed Feb 2, 2026

View reviewed changes

verifiers/utils/eval_utils.py Outdated Show resolved Hide resolved

verifiers/envs/environment.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

mikasenghaas added 4 commits February 2, 2026 19:53

fix group increments and progress init

c03d7e2

fix error rate by computing in metadata

9ffd82b

to not trigger assert

6b36e9e

remove hf ref

0f6ec75

cursor bot reviewed Feb 2, 2026

View reviewed changes

verifiers/envs/environment.py Show resolved Hide resolved

verifiers/gepa/adapter.py Show resolved Hide resolved

do not show tqdm in gepa

2b171c1

cursor bot reviewed Feb 2, 2026

View reviewed changes

verifiers/utils/save_utils.py Show resolved Hide resolved

willccbb approved these changes Feb 3, 2026

View reviewed changes

rasdani approved these changes Feb 3, 2026

View reviewed changes

estsauver mentioned this pull request Feb 4, 2026

feat request: resume evaluation from an output dir #762

Closed

hallerite added 3 commits February 5, 2026 22:12

Merge remote-tracking branch 'origin/main' into resume-evals

8c34eb3

fix(eval): harden resume by tolerating partial JSONL tail and validat…

721fb30

…ing metadata

fix style

a486abd

cursor bot reviewed Feb 5, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

verifiers/utils/save_utils.py Show resolved Hide resolved

allow increased num_examples

15d2e21

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

verifiers/utils/save_utils.py Outdated Show resolved Hide resolved

willccbb approved these changes Feb 6, 2026

View reviewed changes

Fix typo: 'evaluaton' -> 'evaluation' in resume log message

49fd285

Co-authored-by: will brown <willccbb@users.noreply.github.com>

cursoragent and others added 2 commits February 6, 2026 09:15

Remove unused self.logger from GenerateOutputsBuilder

800d891

The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com>

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/utils/save_utils.py Show resolved Hide resolved

willccbb added 2 commits February 6, 2026 05:25

Make eval --resume optional and auto-detect latest incomplete run (#…

c588afd

…842) * Add optional --resume auto-detection for eval runs * Fix resume=false handling and dedupe output path resolution * Harden eval results path validation to require files

mc

355b998

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/scripts/eval.py Show resolved Hide resolved

willccbb and others added 6 commits February 6, 2026 05:35

Fix append handling corrupt outputs

78f31b6

Fix resume append corruption

59c02f1

Fix resume output appending

2d2737f

Fix resume append and typing errors

eb54360

set path create time directly

723c4bd

use -R shorthand for resume, -i for independent scoring

e6276bc

mikasenghaas merged commit 4b0545a into main Feb 6, 2026
6 checks passed

Conversation

mikasenghaas commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willccbb commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

willccbb left a comment

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mikasenghaas commented Jan 29, 2026 •

edited

Loading

willccbb commented Feb 3, 2026 •

edited

Loading

CLAassistant commented Feb 6, 2026 •

edited

Loading