PrimeIntellect-ai
diff --git a/‎docs/evaluation.md‎
Lines changed: 52 additions & 2 deletions b/‎docs/evaluation.md‎
Lines changed: 52 additions & 2 deletions
diff --git a/‎docs/reference.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/reference.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎tests/test_environment_extra.py‎
Lines changed: 32 additions & 0 deletions b/‎tests/test_environment_extra.py‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎tests/test_eval_cli.py‎
Lines changed: 94 additions & 0 deletions b/‎tests/test_eval_cli.py‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎tests/test_path_utils.py‎
Lines changed: 94 additions & 0 deletions b/‎tests/test_path_utils.py‎
Lines changed: 94 additions & 0 deletions
@@ -11,6 +11,7 @@ This section explains how to run evaluations with Verifiers environments. See [E
   - [Evaluation Scope](#evaluation-scope)
   - [Concurrency](#concurrency)
   - [Output and Saving](#output-and-saving)
+  - [Resuming Evaluations](#resuming-evaluations)
 - [Environment Defaults](#environment-defaults)
 - [Multi-Environment Evaluation](#multi-environment-evaluation)
   - [TOML Configuration](#toml-configuration)
@@ -124,6 +125,7 @@ Multiple rollouts per example enable metrics like pass@k and help measure varian
 | `--max-concurrent-generation` | — | same as `-c` | Concurrent generation requests |
 | `--max-concurrent-scoring` | — | same as `-c` | Concurrent scoring requests |
 | `--no-interleave-scoring` | `-N` | false | Disable interleaved scoring |
+| `--independent-scoring` | `-i` | false | Score each rollout individually instead of by group |
 | `--max-retries` | — | 0 | Retries per rollout on transient `InfraError` |
 
 By default, scoring runs interleaved with generation. Use `--no-interleave-scoring` to score all rollouts after generation completes.
@@ -138,12 +140,60 @@ The `--max-retries` flag enables automatic retry with exponential backoff when r
 | `--tui` | `-u` | false | Use alternate screen mode (TUI) for display |
 | `--debug` | `-d` | false | Disable Rich display; use normal logging and tqdm progress |
 | `--save-results` | `-s` | false | Save results to disk |
-| `--save-every` | `-f` | -1 | Save checkpoint every N rollouts |
+| `--resume [PATH]` | `-R` | — | Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted) |
 | `--state-columns` | `-C` | — | Extra state columns to save (comma-separated) |
 | `--save-to-hf-hub` | `-H` | false | Push results to Hugging Face Hub |
 | `--hf-hub-dataset-name` | `-D` | — | Dataset name for HF Hub |
 
-Results are saved to `./outputs/evals/{env_id}--{model}/` as a Hugging Face dataset.
+Results are saved to `./outputs/evals/{env_id}--{model}/{run_id}/`, containing:
+
+- `results.jsonl` — rollout outputs, one per line
+- `metadata.json` — evaluation configuration and aggregate metrics
+
+### Resuming Evaluations
+
+Long-running evaluations can be interrupted and resumed using checkpointing. When `--save-results` is enabled, results are saved incrementally after each completed group of rollouts. Use `--resume` to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.
+
+**Running with checkpoints:**
+
+```bash
+prime eval run my-env -n 1000 -s
+```
+
+With `-s` (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.
+
+**Resuming from a checkpoint:**
+
+```bash
+prime eval run my-env -n 1000 -s --resume ./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/abc12345
+```
+
+When a resume path is provided, it must point to a valid evaluation results directory containing both `results.jsonl` and `metadata.json`. With `--resume` and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching `env_id`, `model`, and `rollouts_per_example` where saved `num_examples` is less than or equal to the current run. When resuming:
+
+1. Existing completed rollouts are loaded from the checkpoint
+2. Remaining rollouts are computed based on the example ids and group size
+3. Only incomplete rollouts are executed
+4. New results are appended to the existing checkpoint
+
+If all rollouts are already complete, the evaluation returns immediately with the existing results.
+
+**Configuration compatibility:**
+
+When resuming, the current run configuration should match the original run. Mismatches in parameters like `--model`, `--env-args`, or `--rollouts-per-example` can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing `--num-examples` if you need additional rollouts beyond the original target.
+
+**Example workflow:**
+
+```bash
+# Start a large evaluation with checkpointing
+prime eval run math-python -n 500 -r 3 -s
+
+# If interrupted, find the run directory
+ls ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/
+
+# Resume from the checkpoint
+prime eval run math-python -n 500 -r 3 -s \
+  --resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
+```
 
 The `--state-columns` flag allows saving environment-specific state fields that your environment stores during rollouts:
 
 
@@ -598,11 +598,10 @@ class EvalConfig(BaseModel):
     independent_scoring: bool = False
     extra_env_kwargs: dict = {}
     max_retries: int = 0
-    print_results: bool = False
     verbose: bool = False
     state_columns: list[str] | None = None
     save_results: bool = False
-    save_every: int = -1
+    resume_path: Path | None = None
     save_to_hf_hub: bool = False
     hf_hub_dataset_name: str | None = None
 ```
 
@@ -12,6 +12,7 @@
 from __future__ import annotations
 
 import asyncio
+import json
 from typing import Callable
 
 import pytest
@@ -222,3 +223,34 @@ def test_make_dataset_basic_without_tools(make_metadata, make_output):
     results = GenerateOutputs(outputs=[make_output()], metadata=make_metadata())
     ds = build_dataset(results)
     assert len(ds) == 1 and "foo" in ds.column_names
+
+
+@pytest.mark.asyncio
+async def test_generate_resume_raises_on_metadata_mismatch(
+    tmp_path, mock_openai_client, make_dummy_env, make_input
+):
+    env = make_dummy_env(mock_openai_client)
+
+    results_path = tmp_path / "resume"
+    results_path.mkdir()
+    (results_path / "results.jsonl").write_text("", encoding="utf-8")
+    (results_path / "metadata.json").write_text(
+        json.dumps(
+            {
+                "env_id": env.env_id,
+                "model": "test-model",
+                "num_examples": 2,
+                "rollouts_per_example": 1,
+            }
+        ),
+        encoding="utf-8",
+    )
+
+    inputs = [make_input(example_id=0)]
+    with pytest.raises(ValueError, match="metadata mismatch"):
+        await env.generate(
+            inputs=inputs,
+            client=mock_openai_client,
+            model="test-model",
+            results_path=results_path,
+        )
@@ -1,5 +1,7 @@
 import argparse
+import os
 import tempfile
+import time
 from pathlib import Path
 from types import SimpleNamespace
 
@@ -42,6 +44,7 @@ def _run_cli(monkeypatch, overrides, capture_all_configs: bool = False):
             "no_interleave_scoring": False,
             "state_columns": [],
             "save_results": False,
+            "resume": None,
             "save_every": -1,
             "save_to_hf_hub": False,
             "hf_hub_dataset_name": "",
@@ -459,3 +462,94 @@ def test_load_toml_config_invalid_global_field():
         f.flush()
         with pytest.raises(ValueError):
             load_toml_config(Path(f.name))
+
+
+def test_cli_resume_explicit_path(monkeypatch, run_cli, tmp_path: Path):
+    """--resume with explicit path sets resume_path."""
+    resume_dir = tmp_path / "resume"
+    resume_dir.mkdir(parents=True)
+    (resume_dir / "results.jsonl").write_text("", encoding="utf-8")
+    (resume_dir / "metadata.json").write_text("{}", encoding="utf-8")
+
+    captured = run_cli(
+        monkeypatch,
+        {
+            "resume": str(resume_dir),
+        },
+    )
+
+    assert captured["configs"][0].resume_path == resume_dir
+
+
+def test_cli_resume_auto_detects_latest_incomplete(
+    monkeypatch, run_cli, tmp_path: Path
+):
+    """--resume with no path auto-detects latest matching incomplete run."""
+    env_id = "dummy-env"
+    model = "gpt-4.1-mini"
+    run_base = tmp_path / "outputs" / "evals" / f"{env_id}--{model.replace('/', '--')}"
+    old_run = run_base / "oldrun"
+    new_run = run_base / "newrun"
+    old_run.mkdir(parents=True)
+    new_run.mkdir(parents=True)
+
+    metadata = (
+        '{"env_id":"dummy-env","model":"gpt-4.1-mini",'
+        '"num_examples":4,"rollouts_per_example":1}'
+    )
+    (old_run / "metadata.json").write_text(metadata, encoding="utf-8")
+    (new_run / "metadata.json").write_text(metadata, encoding="utf-8")
+
+    (old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
+    (new_run / "results.jsonl").write_text(
+        '{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
+    )
+    now = time.time()
+    os.utime(old_run, (now, now))
+    os.utime(new_run, (now + 1, now + 1))
+
+    monkeypatch.chdir(tmp_path)
+    captured = run_cli(
+        monkeypatch,
+        {
+            "resume": True,
+            "num_examples": 4,
+            "rollouts_per_example": 1,
+            "env_dir_path": str(tmp_path / "environments"),
+        },
+    )
+
+    assert captured["configs"][0].resume_path is not None
+    assert captured["configs"][0].resume_path.resolve() == new_run.resolve()
+
+
+def test_cli_toml_resume_false_disables_global_resume(monkeypatch, run_cli):
+    """Per-eval resume=false overrides global resume=true in TOML configs."""
+    with tempfile.NamedTemporaryFile(suffix=".toml", delete=False, mode="w") as f:
+        f.write(
+            "resume = true\n"
+            "\n"
+            "[[eval]]\n"
+            'env_id = "env-a"\n'
+            "\n"
+            "[[eval]]\n"
+            'env_id = "env-b"\n'
+            "resume = false\n"
+        )
+        f.flush()
+        captured = run_cli(
+            monkeypatch,
+            {
+                "env_id_or_config": f.name,
+                "num_examples": 1,
+                "rollouts_per_example": 1,
+                "env_dir_path": "./environments",
+            },
+        )
+
+    configs = captured["configs"]
+    assert len(configs) == 2
+    assert configs[0].env_id == "env-a"
+    assert configs[0].resume_path is None
+    assert configs[1].env_id == "env-b"
+    assert configs[1].resume_path is None
@@ -0,0 +1,94 @@
+import os
+from pathlib import Path
+
+from verifiers.utils.path_utils import (
+    find_latest_incomplete_eval_results_path,
+    is_valid_eval_results_path,
+)
+
+
+def test_find_latest_incomplete_eval_results_path_picks_newest_matching(
+    tmp_path: Path, monkeypatch
+):
+    env_id = "dummy-env"
+    model = "openai/gpt-4.1-mini"
+    runs_dir = (
+        tmp_path
+        / "outputs"
+        / "evals"
+        / f"{env_id}--{model.replace('/', '--')}"
+    )
+
+    old_run = runs_dir / "11111111"
+    new_run = runs_dir / "22222222"
+    complete_run = runs_dir / "33333333"
+    for run in [old_run, new_run, complete_run]:
+        run.mkdir(parents=True)
+
+    metadata = (
+        '{"env_id":"dummy-env","model":"openai/gpt-4.1-mini",'
+        '"num_examples":4,"rollouts_per_example":1}'
+    )
+    for run in [old_run, new_run, complete_run]:
+        (run / "metadata.json").write_text(metadata, encoding="utf-8")
+
+    (old_run / "results.jsonl").write_text('{"example_id":0}\n', encoding="utf-8")
+    (new_run / "results.jsonl").write_text(
+        '{"example_id":0}\n{"example_id":1}\n', encoding="utf-8"
+    )
+    (complete_run / "results.jsonl").write_text(
+        '{"example_id":0}\n{"example_id":1}\n{"example_id":2}\n{"example_id":3}\n',
+        encoding="utf-8",
+    )
+
+    os.utime(old_run, (1, 1))
+    os.utime(new_run, (2, 2))
+    os.utime(complete_run, (3, 3))
+
+    monkeypatch.chdir(tmp_path)
+
+    result = find_latest_incomplete_eval_results_path(
+        env_id=env_id,
+        model=model,
+        num_examples=4,
+        rollouts_per_example=1,
+        env_dir_path=str(tmp_path / "environments"),
+    )
+
+    assert result is not None
+    assert result.resolve() == new_run.resolve()
+
+
+def test_find_latest_incomplete_eval_results_path_returns_none_when_no_match(
+    tmp_path: Path, monkeypatch
+):
+    monkeypatch.chdir(tmp_path)
+
+    result = find_latest_incomplete_eval_results_path(
+        env_id="dummy-env",
+        model="openai/gpt-4.1-mini",
+        num_examples=4,
+        rollouts_per_example=1,
+        env_dir_path=str(tmp_path / "environments"),
+    )
+    assert result is None
+
+
+def test_is_valid_eval_results_path_requires_files(tmp_path: Path):
+    run_dir = tmp_path / "run"
+    run_dir.mkdir()
+
+    (run_dir / "results.jsonl").mkdir()
+    (run_dir / "metadata.json").mkdir()
+
+    assert not is_valid_eval_results_path(run_dir)
+
+
+def test_is_valid_eval_results_path_accepts_expected_layout(tmp_path: Path):
+    run_dir = tmp_path / "run"
+    run_dir.mkdir()
+
+    (run_dir / "results.jsonl").write_text("", encoding="utf-8")
+    (run_dir / "metadata.json").write_text("{}", encoding="utf-8")
+
+    assert is_valid_eval_results_path(run_dir)