Skip to content

resume evals#803

Merged
mikasenghaas merged 32 commits intomainfrom
resume-evals
Feb 6, 2026
Merged

resume evals#803
mikasenghaas merged 32 commits intomainfrom
resume-evals

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Jan 29, 2026

Description

this pr implements incremental saving (i.e. we save new rollouts by appending to a file, instead of overwriting the whole file with all rollouts all the time) and resumable evals. the former is useful to save unnecessary i/o and makes resuming evals a whole lot safer because accidental data loss is less likely. the latter is useful for long-running evals and synthetic data gen runs, especially against flaky apis.

main changes:

  • introduces --resume (-R) flag on vf-eval which by default resumes the latest matching, incompleted run or a run at a specified output directory
  • deprecates --save-every because we save for every rollout/group by default via incremental saving
  • deprecates --use-tqdm as an eval arg. users can still disable tqdm by passing null callback functions when they call generate directly

Example

Run an evaluation and save its results

uv run vf-eval gsm8k -n5 -r1 -s

If it finished properly, resuming the run with identical paramters will finish without generating any new rollouts

uv run vf-eval gsm8k -n5 -r1 -s -R environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/9a1d0326

If more rollouts are required to finish the eval (notice how we have -n10 now), the eval resumes the state of the previous run and only generates the remaining rollouts.

uv run vf-eval gsm8k -n10 -r1 -s --resume-path environments/gsm8k/outputs/evals/gsm8k--openai--gpt-4.1-mini/9a1d0326

In practice, a run is likely resumed because of a crash in which case resuming with identical configuration of n and r will also produce the missing rollouts.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes


Note

Medium Risk
Touches core evaluation execution/saving and changes callback/type signatures; bugs could lead to incorrect skipping/duplication of rollouts or corrupted saved results during resume.

Overview
Adds resumable evaluations with incremental checkpointing. prime eval now writes each completed rollout/group by appending to results.jsonl and updating metadata.json, and can restart from an existing run directory (skipping already-completed rollouts) after validating the saved metadata matches the current config.

Introduces a --resume [PATH] CLI flag (and TOML resume/legacy resume_path) that either resumes from an explicit results directory or auto-detects the newest incomplete matching run via new helpers in path_utils. Removes the --save-every and use_tqdm plumbing in favor of callback-driven progress reporting, updating callback signatures and the TUI to consume rolling aggregates (avg_reward, avg_metrics, new avg_error, usage). Documentation and tests are expanded to cover resume path validation, auto-detection, and recovery from malformed trailing JSONL lines.

Written by Cursor Bugbot for commit 723c4bd. This will update automatically on new commits. Configure here.

@mikasenghaas mikasenghaas changed the base branch from main to env-server January 29, 2026 15:17
@mikasenghaas mikasenghaas force-pushed the resume-evals branch 2 times, most recently from c72f002 to f7485d4 Compare January 29, 2026 16:04
@mikasenghaas mikasenghaas marked this pull request as ready for review January 29, 2026 17:08
@mikasenghaas mikasenghaas marked this pull request as draft January 29, 2026 17:47
@willccbb willccbb changed the base branch from env-server to main January 30, 2026 03:02
@mikasenghaas mikasenghaas marked this pull request as ready for review February 2, 2026 19:38
@willccbb
Copy link
Member

willccbb commented Feb 3, 2026

Approach looks nice + sensible, would wanna maybe add some test cases + confirm it works well with some dogfooding but otherwise LGTM when CI is green

Copy link
Member

@willccbb willccbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending Konstantin pinging me about some load tests.

Saving every rollout feels a bit much given that we're rewriting the file and not just appending. If rollouts are finishing in rapid succession on med-large evals, this could cause some contention/bottlenecks potentially? How important is saving in sorted order?

Co-authored-by: will brown <willccbb@users.noreply.github.com>
@CLAassistant
Copy link

CLAassistant commented Feb 6, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ mikasenghaas
✅ hallerite
✅ willccbb
❌ cursoragent
You have signed the CLA already but the status is still pending? Let us recheck it.

cursoragent and others added 2 commits February 6, 2026 09:15
The constructor created self.logger but it was never used in any method.
The module-level logger is used elsewhere in the file for all logging.

Co-authored-by: will brown <willccbb@users.noreply.github.com>
…iteration

The build_metadata() method was called twice per iteration in the
as_completed loop—once to pass to on_progress, and again to save.
Since build_metadata() computes averages over all accumulated outputs,
this duplication was wasteful. Now the metadata computed for on_progress
is reused for the save operation.

Co-authored-by: will brown <willccbb@users.noreply.github.com>
…842)

* Add optional --resume auto-detection for eval runs

* Fix resume=false handling and dedupe output path resolution

* Harden eval results path validation to require files
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@mikasenghaas mikasenghaas merged commit 4b0545a into main Feb 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants