You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* attempt 1
* stateful load/save
* functional
* simpler
* remove old stuff
* less git diff
* fix
* update toml config
* refactor to use callbacks consistently
* correct usage of callbacks
* deprecate use_tqdm
* add docs
* fix group increments and progress init
* fix error rate by computing in metadata
* to not trigger assert
* remove hf ref
* do not show tqdm in gepa
* fix(eval): harden resume by tolerating partial JSONL tail and validating metadata
* fix style
* allow increased num_examples
* Fix typo: 'evaluaton' -> 'evaluation' in resume log message
Co-authored-by: will brown <willccbb@users.noreply.github.com>
* Remove unused self.logger from GenerateOutputsBuilder
The constructor created self.logger but it was never used in any method.
The module-level logger is used elsewhere in the file for all logging.
Co-authored-by: will brown <willccbb@users.noreply.github.com>
* Reuse metadata from build_metadata() instead of calling it twice per iteration
The build_metadata() method was called twice per iteration in the
as_completed loop—once to pass to on_progress, and again to save.
Since build_metadata() computes averages over all accumulated outputs,
this duplication was wasteful. Now the metadata computed for on_progress
is reused for the save operation.
Co-authored-by: will brown <willccbb@users.noreply.github.com>
* Make eval `--resume` optional and auto-detect latest incomplete run (#842)
* Add optional --resume auto-detection for eval runs
* Fix resume=false handling and dedupe output path resolution
* Harden eval results path validation to require files
* Fix append handling corrupt outputs
* Fix resume append corruption
* Fix resume output appending
* Fix resume append and typing errors
* set path create time directly
* use -R shorthand for resume, -i for independent scoring
---------
Co-authored-by: hallerite <git@hallerite.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: will brown <willccbb@users.noreply.github.com>
Co-authored-by: will brown <williambrown97@gmail.com>
|`--independent-scoring`|`-i`| false | Score each rollout individually instead of by group |
127
129
|`--max-retries`| — | 0 | Retries per rollout on transient `InfraError`|
128
130
129
131
By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes.
@@ -138,12 +140,60 @@ The `--max-retries` flag enables automatic retry with exponential backoff when r
138
140
|`--tui`|`-u`| false | Use alternate screen mode (TUI) for display |
139
141
|`--debug`|`-d`| false | Disable Rich display; use normal logging and tqdm progress |
140
142
|`--save-results`|`-s`| false | Save results to disk |
141
-
|`--save-every`|`-f`|-1|Save checkpoint every N rollouts|
143
+
|`--resume [PATH]`|`-R`|—|Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted)|
142
144
|`--state-columns`|`-C`| — | Extra state columns to save (comma-separated) |
143
145
|`--save-to-hf-hub`|`-H`| false | Push results to Hugging Face Hub |
144
146
|`--hf-hub-dataset-name`|`-D`| — | Dataset name for HF Hub |
145
147
146
-
Results are saved to `./outputs/evals/{env_id}--{model}/` as a Hugging Face dataset.
148
+
Results are saved to `./outputs/evals/{env_id}--{model}/{run_id}/`, containing:
149
+
150
+
-`results.jsonl` — rollout outputs, one per line
151
+
-`metadata.json` — evaluation configuration and aggregate metrics
152
+
153
+
### Resuming Evaluations
154
+
155
+
Long-running evaluations can be interrupted and resumed using checkpointing. When `--save-results` is enabled, results are saved incrementally after each completed group of rollouts. Use `--resume` to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.
156
+
157
+
**Running with checkpoints:**
158
+
159
+
```bash
160
+
prime eval run my-env -n 1000 -s
161
+
```
162
+
163
+
With `-s` (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.
164
+
165
+
**Resuming from a checkpoint:**
166
+
167
+
```bash
168
+
prime eval run my-env -n 1000 -s --resume ./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/abc12345
169
+
```
170
+
171
+
When a resume path is provided, it must point to a valid evaluation results directory containing both `results.jsonl` and `metadata.json`. With `--resume` and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching `env_id`, `model`, and `rollouts_per_example` where saved `num_examples` is less than or equal to the current run. When resuming:
172
+
173
+
1. Existing completed rollouts are loaded from the checkpoint
174
+
2. Remaining rollouts are computed based on the example ids and group size
175
+
3. Only incomplete rollouts are executed
176
+
4. New results are appended to the existing checkpoint
177
+
178
+
If all rollouts are already complete, the evaluation returns immediately with the existing results.
179
+
180
+
**Configuration compatibility:**
181
+
182
+
When resuming, the current run configuration should match the original run. Mismatches in parameters like `--model`, `--env-args`, or `--rollouts-per-example` can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing `--num-examples` if you need additional rollouts beyond the original target.
183
+
184
+
**Example workflow:**
185
+
186
+
```bash
187
+
# Start a large evaluation with checkpointing
188
+
prime eval run math-python -n 500 -r 3 -s
189
+
190
+
# If interrupted, find the run directory
191
+
ls ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/
0 commit comments