ds4-eval: fix three answer-grader false negatives + golden self-tests by rinaldofesta · Pull Request #319 · antirez/ds4

rinaldofesta · 2026-06-01T14:37:14Z

The ds4-eval answer grader has three reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free --self-test-extractors path (no model or GPU needed). Split into two commits so the unambiguous fixes are separate from the one with a design trade-off.

Commit 1 — integer + regrade (unambiguous)

find_integer_answer took the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:

"Answer: m+n = 256+37 = 293"     ->  256   (should be 293)
"Answer: a+b+c = 12+25+25 = 62"  ->  12    (should be 62)
"Answer: 2*7 + 3*6 + ... = 81"   ->  2     (should be 81)

Reachable on the many embedded AIME2025 "Find $m+n$ / $a+b+c$" cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last =. "Final answer: 082" -> "82" and the loose fallback are preserved.

regrade_trace_file graded every case carrying a MODEL_OUTPUT block, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious changed drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; the rest are reported via a new not_graded counter.

Commit 2 — multiple choice (with a known limitation)

find_answer_letter returned the first boundary-isolated in-range capital after Answer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:

"Answer: I think it is C"          ->  I   (should be C)
"Answer: I'll go with C."          ->  I   (should be C)
"Answer: A careful reading ... C"  ->  A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.

Known limitation (kept in a separate commit on purpose): a distractor explicitly rejected before the chosen letter on the same line ("... rules out C, leaving D") is still misread; that needs sentence-level parsing. Happy to drop or rework this commit if you'd rather handle it differently.

Testing

make                              # Metal: ds4_eval.c compiles clean with -Wall -Wextra
make cpu                          # builds; only pre-existing ds4.c warnings, none from this change
./ds4-eval --self-test-extractors # passes (extractor + regrade cases)

The new self-tests fail on main and pass with the fixes (red → green). A crafted 3-case regrade trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 after the fix (was passed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash).

Two model-free correctness fixes in the answer grader, locked by new --self-test-extractors cases (no model/GPU needed): - find_integer_answer: the answer-marker path took the first integer in the window, so an answer line that shows its arithmetic ("Answer: m+n = 256+37 = 293") was graded as the first summand (256) instead of the stated total (293) -> false negative. Reachable on the many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound to the answer line and, when it shows arithmetic, reads the value after the last '='. "Final answer: 082" -> "82" and the loose fallback are preserved. - regrade_trace_file: every case with a MODEL_OUTPUT block was graded, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious "changed" drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; others are reported via a new not_graded counter. Tested: make (Metal) and make cpu build clean (-Wall -Wextra); ./ds4-eval --self-test-extractors passes; regrade of a 3-case trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 (was passed=2 failed=1 changed=1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

find_answer_letter returned the first boundary-isolated in-range capital after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or article was graded as the choice: "Answer: I think it is C" -> I (should be C) "Answer: I'll go with C." -> I (should be C) "Answer: A careful reading ... C" -> A (should be C) 24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged. Known limitation: a distractor explicitly rejected *before* the chosen letter on the same line ("... rules out C, leaving D") is still misread; that needs sentence-level parsing and is left for discussion. Locked by new --self-test-extractors cases (model-free). All prior self-tests and the integer/regrade cases continue to pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rinaldofesta · 2026-06-01T14:43:19Z

The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ("Answer: rules out C, leaving D" → C). Kept out of this PR because every naive fix trades one failure mode for another; happy to fold a resolution in here if you'd prefer.

rinaldofesta and others added 2 commits June 1, 2026 16:34

rinaldofesta mentioned this pull request Jun 1, 2026

ds4-eval: answer-letter extractor mis-grades a distractor stated before the chosen option (follow-up to #319) #321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds4-eval: fix three answer-grader false negatives + golden self-tests#319

ds4-eval: fix three answer-grader false negatives + golden self-tests#319
rinaldofesta wants to merge 2 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives

rinaldofesta commented Jun 1, 2026

Uh oh!

rinaldofesta commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rinaldofesta commented Jun 1, 2026

Commit 1 — integer + regrade (unambiguous)

Commit 2 — multiple choice (with a known limitation)

Testing

Uh oh!

rinaldofesta commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant