[codex] Forward reasoning kwargs from eval TOML sampling sections by willccbb · Pull Request #1404 · PrimeIntellect-ai/verifiers

willccbb · 2026-05-17T22:28:33Z

Summary

Accept [sampling] tables in eval TOML configs and normalize them into sampling_args.
Fold reasoning_effort and enable_thinking from eval TOML sampling sections into extra_body.chat_template_kwargs.
Document the eval TOML sampling shape and add coverage for global and per-eval sampling sections.

Validation

uv run pytest tests/test_eval_cli.py -q
uv run ruff check verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.md
uv run pre-commit run --files verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.md
git push pre-push hooks: ruff check, ruff format, sync AGENTS skipped, ty passed

Note

Medium Risk
Touches TOML config parsing/validation for eval runs, so malformed normalization could change sampling parameters passed to model APIs. Risk is moderated by added unit tests covering global/per-eval sampling behavior and error cases.

Overview
Adds support for TOML [sampling] / [eval.sampling] tables by normalizing them into sampling_args, including validation that sampling and sampling_args aren’t both set.

reasoning_effort and enable_thinking are now automatically moved under sampling_args.extra_body.chat_template_kwargs (merging with any existing extra_body/chat_template_kwargs) to support OpenAI-compatible servers that read these options there.

Updates docs to describe the new TOML sampling shape and adds CLI/config parsing tests to ensure the thinking-related fields are piped through correctly.

^{Reviewed by Cursor Bugbot for commit 86f2024. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Forward `reasoning_effort` and `enable_thinking` from eval TOML `[sampling]` into `extra_body.chat_template_kwargs`

Adds a [sampling] table as an alternative to sampling_args in eval TOML configs (global, per-eval, and ablation sections), which load_toml_config converts to sampling_args.
Introduces normalize_sampling_section and normalize_sampling_config in eval_utils.py to move reasoning_effort and enable_thinking from the top-level sampling dict into extra_body.chat_template_kwargs, as required by OpenAI-compatible servers.
Documents the new [sampling] section with examples in docs/evaluation.md.
Risk: configs that supply both sampling and sampling_args, or use non-table values for extra_body/chat_template_kwargs, now raise ValueError.

^{Macroscope summarized 86f2024.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 86f2024. Configure here.}

cursor · 2026-05-17T22:34:50Z

+
+`reasoning_effort` and `enable_thinking` from `[sampling]` are forwarded through
+`extra_body.chat_template_kwargs` for OpenAI-compatible servers that read chat
+template options there.


Missing skill update for evaluation workflow change

Low Severity

The PR adds a new [sampling] TOML config section and documents it in docs/evaluation.md, but skills/evaluate-environments/SKILL.md is not updated to reflect this user-facing workflow change. The skills update rule requires that changes to docs/evaluation.md be mirrored in the corresponding skill file.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit 86f2024. Configure here.}

xeophon · 2026-05-18T11:20:37Z

the current implementation will break what we did with #1338

Forward reasoning kwargs from eval TOML sampling sections

86f2024

cursor Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Forward reasoning kwargs from eval TOML sampling sections#1404

[codex] Forward reasoning kwargs from eval TOML sampling sections#1404
willccbb wants to merge 1 commit into
mainfrom
codex/pipe-eval-thinking-settings

willccbb commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 17, 2026

Uh oh!

xeophon commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willccbb commented May 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Forward reasoning_effort and enable_thinking from eval TOML [sampling] into extra_body.chat_template_kwargs

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 17, 2026

Choose a reason for hiding this comment

Missing skill update for evaluation workflow change

Uh oh!

xeophon commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willccbb commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

Forward `reasoning_effort` and `enable_thinking` from eval TOML `[sampling]` into `extra_body.chat_template_kwargs`

xeophon commented May 18, 2026 •

edited

Loading