ds4-eval: fix three answer-grader false negatives + golden self-tests#319
Open
rinaldofesta wants to merge 2 commits into
Open
ds4-eval: fix three answer-grader false negatives + golden self-tests#319rinaldofesta wants to merge 2 commits into
rinaldofesta wants to merge 2 commits into
Conversation
Two model-free correctness fixes in the answer grader, locked by new
--self-test-extractors cases (no model/GPU needed):
- find_integer_answer: the answer-marker path took the first integer in
the window, so an answer line that shows its arithmetic
("Answer: m+n = 256+37 = 293") was graded as the first summand (256)
instead of the stated total (293) -> false negative. Reachable on the
many embedded AIME2025 "Find m+n / a+b+c" cases. The scan is now bound
to the answer line and, when it shows arithmetic, reads the value after
the last '='. "Final answer: 082" -> "82" and the loose fallback are
preserved.
- regrade_trace_file: every case with a MODEL_OUTPUT block was graded,
including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their
partial output inflated passed/failed and raised spurious "changed"
drift. Grading is now factored into regrade_case_outcome(), which only
grades PASSED/FAILED (and legacy status-less) traces; others are
reported via a new not_graded counter.
Tested: make (Metal) and make cpu build clean (-Wall -Wextra);
./ds4-eval --self-test-extractors passes; regrade of a 3-case trace
(1 PASSED, 2 STOPPED) reports passed=1 not_graded=2 changed=0
(was passed=2 failed=1 changed=1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
find_answer_letter returned the first boundary-isolated in-range capital
after "Answer:", so on 10-choice (A-J) cases a leading English pronoun or
article was graded as the choice:
"Answer: I think it is C" -> I (should be C)
"Answer: I'll go with C." -> I (should be C)
"Answer: A careful reading ... C" -> A (should be C)
24 embedded cases are 10-choice, so this is reachable. The forward scan
now skips a standalone capital that begins a same-line word or a
contraction; the reverse scan still recovers it when it is the only
candidate ("Answer: A is correct"), and a genuine standalone answer
("Answer: I.") is unchanged.
Known limitation: a distractor explicitly rejected *before* the chosen
letter on the same line ("... rules out C, leaving D") is still misread;
that needs sentence-level parsing and is left for discussion.
Locked by new --self-test-extractors cases (model-free). All prior
self-tests and the integer/regrade cases continue to pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
|
The known limitation of the multiple-choice fix (commit 2) is now tracked separately for discussion: #321 — the extractor still grabs a distractor stated before the chosen letter ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
ds4-evalanswer grader has three reachable bugs that make it under-report model accuracy — it marks correct answers wrong — on the embedded eval set. Each is fixed with a minimal change and locked by new cases in the existing model-free--self-test-extractorspath (no model or GPU needed). Split into two commits so the unambiguous fixes are separate from the one with a design trade-off.Commit 1 — integer + regrade (unambiguous)
find_integer_answertook the first integer in the answer-marker window, so an answer line that shows its arithmetic was graded as the first summand/factor:Reachable on the many embedded AIME2025 "Find$m+n$ / $a+b+c$ " cases. The scan is now bound to the answer line and, when the line shows arithmetic, reads the value after the last
=."Final answer: 082" -> "82"and the loose fallback are preserved.regrade_trace_filegraded every case carrying aMODEL_OUTPUTblock, including interrupted ones (STOPPED/SKIPPED/SWITCHED/ERROR). Their partial output inflatedpassed/failedand raised spuriouschangeddrift. Grading is now factored intoregrade_case_outcome(), which only gradesPASSED/FAILED(and legacy status-less) traces; the rest are reported via a newnot_gradedcounter.Commit 2 — multiple choice (with a known limitation)
find_answer_letterreturned the first boundary-isolated in-range capital afterAnswer:, so on 10-choice (A–J) cases a leading English pronoun/article became the answer:24 embedded cases are 10-choice, so this is reachable. The forward scan now skips a standalone capital that begins a same-line word or a contraction; the reverse scan still recovers it when it is the only candidate (
"Answer: A is correct"), and a genuine standalone answer ("Answer: I.") is unchanged.Known limitation (kept in a separate commit on purpose): a distractor explicitly rejected before the chosen letter on the same line (
"... rules out C, leaving D") is still misread; that needs sentence-level parsing. Happy to drop or rework this commit if you'd rather handle it differently.Testing
The new self-tests fail on
mainand pass with the fixes (red → green). A crafted 3-case regrade trace (1PASSED, 2STOPPED) reportspassed=1 not_graded=2 changed=0after the fix (waspassed=2 failed=1 changed=1). No inference-path code is touched, so there is no speed impact. Tested on macOS / Metal (IQ2_XXS DeepSeek-V4-Flash).