Add aime2026 dataset to load_example_dataset#1399
Open
vedthebear wants to merge 1 commit into
Open
Conversation
Registers MathArena/aime_2026 (30 problems, 2026 AIME I+II released after most current model training cutoffs, making it useful as a held-out math benchmark when running evaluations on top-of-the-line models). Uses the existing ``temp_answer`` rename hook in ``load_example_dataset`` so the int->str cast on the source dataset's ``answer`` column (typed int64) survives ``.map``'s type inference. Same workaround the mmlu preprocessor already uses for its int-typed answer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Registers
MathArena/aime_2026(30 problems from the 2026 AIME I + II) inverifiers/utils/data_utils.pyalongside the existingaime2024/aime2025entries.Why
AIME 2026 was released after most current models' training cutoffs, which makes it a useful held-out math benchmark for evaluating top-tier models. Including it here means anyone running a math env via verifiers can grab it with the standard one-liner:
How
Two small additions matching the surrounding pattern:
get_preprocess_fn: newaime2026branch returning{"question": x["problem"], "temp_answer": str(x["answer"])}.load_example_dataset: newaime2026branch callingload_dataset("MathArena/aime_2026")["train"].Implementation note
The source dataset's
answercolumn is typedint64. The new preprocessor returns the renamed keytemp_answer(which gets renamed back toanswerby the existing hook at the end ofload_example_dataset) so the int-to-string cast survives.map's type inference. Same workaround the existingmmlupreprocessor already uses for its int-typed answer column — no new mechanism introduced.Verification
Ruff passes on the modified file (
uv run ruff check verifiers/utils/data_utils.py).Note
Low Risk
Low risk: small, additive dataset registration and preprocessing changes with no impact to core evaluation logic beyond enabling a new dataset name.
Overview
Adds a new
aime2026option toload_example_datasetthat loadsMathArena/aime_2026(defaulttrain) and wires it intoget_preprocess_fn.The new preprocessor maps
problemtoquestionand stringifies the int-typedanswervia atemp_answerfield that is later renamed back toanswerto preserve the cast through.map.Reviewed by Cursor Bugbot for commit 663e714. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add
aime2026dataset support toload_example_datasetAdds the
MathArena/aime_2026dataset as a loadable option in data_utils.py. The preprocessor mapsproblemtoquestionand casts the integeranswerto a string via atemp_answerintermediate column, which is then renamed toanswerpost-map to avoid type conflicts during dataset mapping.Macroscope summarized 663e714.