Skip to content

[codex] Forward reasoning kwargs from eval TOML sampling sections#1404

Draft
willccbb wants to merge 1 commit into
mainfrom
codex/pipe-eval-thinking-settings
Draft

[codex] Forward reasoning kwargs from eval TOML sampling sections#1404
willccbb wants to merge 1 commit into
mainfrom
codex/pipe-eval-thinking-settings

Conversation

@willccbb
Copy link
Copy Markdown
Member

@willccbb willccbb commented May 17, 2026

Summary

  • Accept [sampling] tables in eval TOML configs and normalize them into sampling_args.
  • Fold reasoning_effort and enable_thinking from eval TOML sampling sections into extra_body.chat_template_kwargs.
  • Document the eval TOML sampling shape and add coverage for global and per-eval sampling sections.

Validation

  • uv run pytest tests/test_eval_cli.py -q
  • uv run ruff check verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.md
  • uv run pre-commit run --files verifiers/utils/eval_utils.py tests/test_eval_cli.py docs/evaluation.md
  • git push pre-push hooks: ruff check, ruff format, sync AGENTS skipped, ty passed

Note

Medium Risk
Touches TOML config parsing/validation for eval runs, so malformed normalization could change sampling parameters passed to model APIs. Risk is moderated by added unit tests covering global/per-eval sampling behavior and error cases.

Overview
Adds support for TOML [sampling] / [eval.sampling] tables by normalizing them into sampling_args, including validation that sampling and sampling_args aren’t both set.

reasoning_effort and enable_thinking are now automatically moved under sampling_args.extra_body.chat_template_kwargs (merging with any existing extra_body/chat_template_kwargs) to support OpenAI-compatible servers that read these options there.

Updates docs to describe the new TOML sampling shape and adds CLI/config parsing tests to ensure the thinking-related fields are piped through correctly.

Reviewed by Cursor Bugbot for commit 86f2024. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Forward reasoning_effort and enable_thinking from eval TOML [sampling] into extra_body.chat_template_kwargs

  • Adds a [sampling] table as an alternative to sampling_args in eval TOML configs (global, per-eval, and ablation sections), which load_toml_config converts to sampling_args.
  • Introduces normalize_sampling_section and normalize_sampling_config in eval_utils.py to move reasoning_effort and enable_thinking from the top-level sampling dict into extra_body.chat_template_kwargs, as required by OpenAI-compatible servers.
  • Documents the new [sampling] section with examples in docs/evaluation.md.
  • Risk: configs that supply both sampling and sampling_args, or use non-table values for extra_body/chat_template_kwargs, now raise ValueError.

Macroscope summarized 86f2024.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 86f2024. Configure here.

Comment thread docs/evaluation.md

`reasoning_effort` and `enable_thinking` from `[sampling]` are forwarded through
`extra_body.chat_template_kwargs` for OpenAI-compatible servers that read chat
template options there.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing skill update for evaluation workflow change

Low Severity

The PR adds a new [sampling] TOML config section and documents it in docs/evaluation.md, but skills/evaluate-environments/SKILL.md is not updated to reflect this user-facing workflow change. The skills update rule requires that changes to docs/evaluation.md be mirrored in the corresponding skill file.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit 86f2024. Configure here.

@xeophon
Copy link
Copy Markdown
Member

xeophon commented May 18, 2026

the current implementation will break what we did with #1338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants