Conversation
c72f002 to
f7485d4
Compare
0aeb614 to
e73288a
Compare
|
Approach looks nice + sensible, would wanna maybe add some test cases + confirm it works well with some dogfooding but otherwise LGTM when CI is green |
willccbb
left a comment
There was a problem hiding this comment.
LGTM pending Konstantin pinging me about some load tests.
Saving every rollout feels a bit much given that we're rewriting the file and not just appending. If rollouts are finishing in rapid succession on med-large evals, this could cause some contention/bottlenecks potentially? How important is saving in sorted order?
Co-authored-by: will brown <willccbb@users.noreply.github.com>
|
|
The constructor created self.logger but it was never used in any method. The module-level logger is used elsewhere in the file for all logging. Co-authored-by: will brown <willccbb@users.noreply.github.com>
…iteration The build_metadata() method was called twice per iteration in the as_completed loop—once to pass to on_progress, and again to save. Since build_metadata() computes averages over all accumulated outputs, this duplication was wasteful. Now the metadata computed for on_progress is reused for the save operation. Co-authored-by: will brown <willccbb@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Description
this pr implements incremental saving (i.e. we save new rollouts by appending to a file, instead of overwriting the whole file with all rollouts all the time) and resumable evals. the former is useful to save unnecessary i/o and makes resuming evals a whole lot safer because accidental data loss is less likely. the latter is useful for long-running evals and synthetic data gen runs, especially against flaky apis.
main changes:
--resume(-R) flag onvf-evalwhich by default resumes the latest matching, incompleted run or a run at a specified output directory--save-everybecause we save for every rollout/group by default via incremental saving--use-tqdmas an eval arg. users can still disable tqdm by passing null callback functions when they call generate directlyExample
Run an evaluation and save its results
If it finished properly, resuming the run with identical paramters will finish without generating any new rollouts
If more rollouts are required to finish the eval (notice how we have
-n10now), the eval resumes the state of the previous run and only generates the remaining rollouts.In practice, a run is likely resumed because of a crash in which case resuming with identical configuration of
nandrwill also produce the missing rollouts.Type of Change
Testing
uv run pytestlocally.Checklist
Additional Notes
Note
Medium Risk
Touches core evaluation execution/saving and changes callback/type signatures; bugs could lead to incorrect skipping/duplication of rollouts or corrupted saved results during resume.
Overview
Adds resumable evaluations with incremental checkpointing.
prime evalnow writes each completed rollout/group by appending toresults.jsonland updatingmetadata.json, and can restart from an existing run directory (skipping already-completed rollouts) after validating the saved metadata matches the current config.Introduces a
--resume [PATH]CLI flag (and TOMLresume/legacyresume_path) that either resumes from an explicit results directory or auto-detects the newest incomplete matching run via new helpers inpath_utils. Removes the--save-everyanduse_tqdmplumbing in favor of callback-driven progress reporting, updating callback signatures and the TUI to consume rolling aggregates (avg_reward,avg_metrics, newavg_error,usage). Documentation and tests are expanded to cover resume path validation, auto-detection, and recovery from malformed trailing JSONL lines.Written by Cursor Bugbot for commit 723c4bd. This will update automatically on new commits. Configure here.