Skip to content

ds4-eval: fix three answer-grader false negatives + golden self-tests#319

Open
rinaldofesta wants to merge 2 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives
Open

ds4-eval: fix three answer-grader false negatives + golden self-tests#319
rinaldofesta wants to merge 2 commits into
antirez:mainfrom
rinaldofesta:fix/eval-grader-false-negatives

Conversation

@rinaldofesta
Copy link
Copy Markdown

The ds4-eval answer grader has three reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free --self-test-extractors path (no model or GPU needed). Split into two commits so the unambiguous fixes are separate from the one with a design trade-off.

Commit 1 — integer + regrade (unambiguous)

find_integer_answer took the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:

"Answer: m+n = 256+37 = 293"     ->  256   (should be 293)
"Answer: a+b+c = 12+25+25 = 62"  ->  12    (should be 62)
"Answer: 2*7 + 3*6 + ... = 81"   ->  2     (should be 81)

Reachable on the many embedded AIME2025 "Find $m+n$ / $a+b+c$" cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last =. "Final answer: 082" -> "82" and the loose fallback are preserved.

regrade_trace_file graded every case carrying a MODEL_OUTPUT block, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflated passed/failed and raised spurious changed drift. Grading is now factored into regrade_case_outcome(), which only grades PASSED/FAILED (and legacy status-less) traces; the rest are reported via a new not_graded counter.

Commit 2 — multiple choice (with a known limitation)

find_answer_letter returned the first boundary-isolated in-range capital after Answer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:

"Answer: I think it is C"          ->  I   (should be C)
"Answer: I'll go with C."          ->  I   (should be C)
"Answer: A careful reading ... C"  ->  A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate ("Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.

Known limitation (kept in a separate commit on purpose): a distractor explicitly rejected before the chosen letter on the same line ("... rules out C, leaving D") is still misread; that needs sentence-level parsing. Happy to drop or rework this commit if you'd rather handle it differently.

Testing

make                              # Metal: ds4_eval.c compiles clean with -Wall -Wextra
make cpu                          # builds; only pre-existing ds4.c warnings, none from this change
./ds4-eval --self-test-extractors # passes (extractor + regrade cases)

The new self-tests fail on main and pass with the fixes (red → green). A crafted 3-case regrade trace (1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0 after the fix (was passed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash).

rinaldofesta and others added 2 commits June 1, 2026 16:34
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):

- find_integer_answer: the answer-marker path took the first integer in
  the window, so an answer line that shows its arithmetic
  ("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
  instead of the stated total (293) -> false negative. Reachable on the
  many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
  to the answer line and, when it shows arithmetic, reads the value after
  the last '='. "Final answer: 082" -> "82" and the loose fallback are
  preserved.

- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
  including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
  partial output inflated passed/failed and raised spurious "changed"
  drift. Grading is now factored into regrade_case_outcome(), which only
  grades PASSED/FAILED (and legacy status-less) traces; others are
  reported via a new not_graded counter.

Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:

    "Answer: I think it is C"        -> I   (should be C)
    "Answer: I'll go with C."        -> I   (should be C)
    "Answer: A careful reading ... C" -> A   (should be C)

24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.

Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.

Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rinaldofesta
Copy link
Copy Markdown
Author

The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ("Answer: rules out C, leaving D"C). Kept out of this PR because every naive fix trades one failure mode for another; happy to fold a resolution in here if you'd prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant