[codex] Forward reasoning kwargs from eval TOML sampling sections#1404
[codex] Forward reasoning kwargs from eval TOML sampling sections#1404willccbb wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 86f2024. Configure here.
|
|
||
| `reasoning_effort` and `enable_thinking` from `[sampling]` are forwarded through | ||
| `extra_body.chat_template_kwargs` for OpenAI-compatible servers that read chat | ||
| template options there. |
There was a problem hiding this comment.
Missing skill update for evaluation workflow change
Low Severity
The PR adds a new [sampling] TOML config section and documents it in docs/evaluation.md, but skills/evaluate-environments/SKILL.md is not updated to reflect this user-facing workflow change. The skills update rule requires that changes to docs/evaluation.md be mirrored in the corresponding skill file.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit 86f2024. Configure here.
|
the current implementation will break what we did with #1338 |


Summary
[sampling]tables in eval TOML configs and normalize them intosampling_args.reasoning_effortandenable_thinkingfrom eval TOML sampling sections intoextra_body.chat_template_kwargs.Validation
uv run pytest tests/test_eval_cli.py -quv run ruff check verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.mduv run pre-commit run --files verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.mdgit pushpre-push hooks: ruff check, ruff format, sync AGENTS skipped, ty passedNote
Medium Risk
Touches TOML config parsing/validation for eval runs, so malformed normalization could change sampling parameters passed to model APIs. Risk is moderated by added unit tests covering global/per-eval sampling behavior and error cases.
Overview
Adds support for TOML
[sampling]/[eval.sampling]tables by normalizing them intosampling_args, including validation thatsamplingandsampling_argsaren’t both set.reasoning_effortandenable_thinkingare now automatically moved undersampling_args.extra_body.chat_template_kwargs(merging with any existingextra_body/chat_template_kwargs) to support OpenAI-compatible servers that read these options there.Updates docs to describe the new TOML sampling shape and adds CLI/config parsing tests to ensure the thinking-related fields are piped through correctly.
Reviewed by Cursor Bugbot for commit 86f2024. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Forward
reasoning_effortandenable_thinkingfrom eval TOML[sampling]intoextra_body.chat_template_kwargs[sampling]table as an alternative tosampling_argsin eval TOML configs (global, per-eval, and ablation sections), whichload_toml_configconverts tosampling_args.normalize_sampling_sectionandnormalize_sampling_configineval_utils.pyto movereasoning_effortandenable_thinkingfrom the top-level sampling dict intoextra_body.chat_template_kwargs, as required by OpenAI-compatible servers.[sampling]section with examples indocs/evaluation.md.samplingandsampling_args, or use non-table values forextra_body/chat_template_kwargs, now raiseValueError.Macroscope summarized 86f2024.