Per-role GenerationConfig and backend plumbing by alexrs-cohere · Pull Request #41 · OpenEuroLLM/JudgeArena

alexrs-cohere · 2026-04-29T10:25:04Z

Summary

Reworks the CLI surface so every knob that can affect generated text is exposed independently for the three runtime roles — model A, model B, and the LLM judge — and is recorded explicitly in the run.

Public CLI surface

For each of A, B, judge the parser now accepts:

--temperature_<role>   --top_p_<role>   --top_k_<role>   --seed_<role>
--max_out_tokens_<role>   --max_model_len_<role>   --chat_template_<role>
--engine_kwargs_<role>

The legacy non-role-aware flags — --max_out_tokens_models, --max_model_len, --chat_template, --engine_kwargs — keep working as deprecated aliases that fan out to the appropriate roles with a DeprecationWarning. Per-role flags always win when both are set.

Internals

GenerationConfig (frozen dataclass in cli_common) packages the eight knobs above. BaseCliArgs now holds three of them (gen_A / gen_B / gen_judge) instead of the previous flat list of max_out_tokens_models / max_out_tokens_judge / max_model_len / chat_template / engine_kwargs.
resolve_generation_configs turns a parsed argparse.Namespace into the per-role configs.
gen_config_to_invoke_kwargs flattens a config into the kwargs that make_model / generate_* accept.

Backends

ChatVLLM accepts temperature, top_p, top_k, seed and writes them through to SamplingParams (it used to hard-code temperature=0.6 / top_p=0.95); a new set_temperature method lets MT-Bench's category-aware temperature switching keep working without reaching into private attributes.
make_model extracts the cross-backend sampling fields from engine_kwargs and routes them to the provider-appropriate constructor argument: ChatOpenAI / OpenRouter / Together / LlamaCpp all see temperature / top_p / seed directly; top_k is tunneled through model_kwargs for OpenAI-compatible backends.
DummyModel records every kwarg the constructor sees on init_kwargs so tests can assert that per-role configs reach the model layer.
do_inference gains an optional out_metadata argument that collects per-call provider response metadata (system_fingerprint, model_name, etc.) when present.

Call sites

generate_and_evaluate.main, estimate_elo_ratings.main, and mt_bench/mt_bench_utils._generate_mt_bench_completions now build their generators from args.gen_A / args.gen_B / args.gen_judge via gen_config_to_invoke_kwargs. MT-Bench keeps its category-aware temperature defaults but only when the user has not explicitly pinned a role-level temperature; otherwise the CLI override wins. The judge call in MT-Bench keeps its historical temperature=0.0 default but accepts overrides from --temperature_judge.

Backward compatibility & out of scope

The cache-key format and run-metadata.v1.json schema are intentionally unchanged in this PR — bumping the schema and content-addressing the cache off GenerationConfig ship in follow-up PRs in the reproducibility-hardening stack.

Why

Today vLLM's sampling parameters are hard-coded (temperature=0.6, top_p=0.95) at the wrapper layer, hosted providers ignore the seed, and the same --max_out_tokens_models value is forced on both battle models even when you'd want to ablate them independently. This PR is the foundation for everything else: once a run is described by three explicit GenerationConfigs, the cache key, the run metadata, and the --rerun helper all have something concrete to hash and replay.

Test plan

uv run pytest -q — 83/83 pass on this branch in isolation; 117/117 pass when stacked with the rest of the reproducibility-hardening work.
tests/test_seed_plumbing.py (new) — --seed_A / --temperature_judge land on the correct GenerationConfig; DummyModel.init_kwargs captures the values flowing through make_model; gen_config_to_invoke_kwargs skips unset fields.
tests/test_cli.py — new tests for per-role temperature/seed flags, deprecated --engine_kwargs fan-out behaviour, per-role override winning over the deprecation alias, and --max_out_tokens_models fanning out to A and B but not judge.
CI runs the full suite.
Manual smoke run against a vLLM model with --seed_A / --temperature_A 0.0 to confirm the seed reaches SamplingParams.

Reworks the CLI surface so every knob that can affect generated text is exposed independently for the three runtime roles - model A, model B, and the LLM judge - and is recorded explicitly in the run. Public CLI surface ------------------ For each of ``A``, ``B``, ``judge`` the parser now accepts: --temperature_<role> --top_p_<role> --top_k_<role> --seed_<role> --max_out_tokens_<role> --max_model_len_<role> --chat_template_<role> --engine_kwargs_<role> The legacy non-role-aware flags - ``--max_out_tokens_models``, ``--max_model_len``, ``--chat_template``, ``--engine_kwargs`` - keep working as deprecated aliases that fan out to the appropriate roles with a ``DeprecationWarning``. Per-role flags always win when both are set. Internals --------- ``GenerationConfig`` (frozen dataclass in ``cli_common``) packages the eight knobs above; ``BaseCliArgs`` now holds three of them (``gen_A``/``gen_B``/``gen_judge``) instead of the previous flat list of ``max_out_tokens_models`` / ``max_out_tokens_judge`` / ``max_model_len`` / ``chat_template`` / ``engine_kwargs``. ``resolve_generation_configs`` turns a parsed ``argparse.Namespace`` into the per-role configs, and ``gen_config_to_invoke_kwargs`` flattens a config into the kwargs that ``make_model`` / ``generate_*`` accept. Backends -------- - ``ChatVLLM`` accepts ``temperature``, ``top_p``, ``top_k``, ``seed`` and writes them through to ``SamplingParams`` (it used to hard-code ``temperature=0.6`` / ``top_p=0.95``); a ``set_temperature`` method is added so MT-Bench's category-aware temperature switching can keep working without reaching into private attributes. - ``make_model`` extracts the cross-backend sampling fields from ``engine_kwargs`` and routes them to the provider-appropriate constructor argument: ``ChatOpenAI`` / OpenRouter / Together / ``LlamaCpp`` all see ``temperature``/``top_p``/``seed`` directly; ``top_k`` is tunneled through ``model_kwargs`` for OpenAI-compatible backends. - ``DummyModel`` records every kwarg the constructor sees on ``init_kwargs`` so tests can assert that per-role configs reach the model layer. - ``do_inference`` gains an optional ``out_metadata`` argument that collects per-call provider response metadata (``system_fingerprint``, ``model_name``, etc.) when present. Call sites ---------- ``generate_and_evaluate.main``, ``estimate_elo_ratings.main`` and ``mt_bench/mt_bench_utils._generate_mt_bench_completions`` now build their generators from ``args.gen_A`` / ``args.gen_B`` / ``args.gen_judge`` via ``gen_config_to_invoke_kwargs``. MT-Bench keeps its category-aware temperature defaults but only when the user has not explicitly pinned a role-level temperature; otherwise the CLI override wins. The judge call in MT-Bench keeps its historical ``temperature=0.0`` default but accepts overrides from ``--temperature_judge``. Backward compatibility ---------------------- The cache key formats and ``run-metadata.v1.json`` schema are intentionally unchanged in this PR. Follow-up PRs in the reproducibility-hardening stack content-address the cache key off the ``GenerationConfig`` and add the resolved per-role configs to the run metadata. Tests ----- - ``tests/test_seed_plumbing.py`` (new) - asserts that ``--seed_A`` and ``--temperature_judge`` reach the right ``GenerationConfig`` and that ``DummyModel.init_kwargs`` captures the values flowing through ``make_model``. - ``tests/test_cli.py`` - new tests for per-role temperature/seed flags, deprecated ``--engine_kwargs`` fan-out behaviour, per-role override over deprecation alias, and ``--max_out_tokens_models`` fan-out. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-role GenerationConfig and backend plumbing#41

Per-role GenerationConfig and backend plumbing#41
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/per-role-generation-config

alexrs-cohere commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexrs-cohere commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public CLI surface

Internals

Backends

Call sites

Backward compatibility & out of scope

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alexrs-cohere commented Apr 29, 2026 •

edited

Loading