Skip to content

Per-role GenerationConfig and backend plumbing#41

Open
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/per-role-generation-config
Open

Per-role GenerationConfig and backend plumbing#41
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/per-role-generation-config

Conversation

@alexrs-cohere
Copy link
Copy Markdown

@alexrs-cohere alexrs-cohere commented Apr 29, 2026

Summary

Reworks the CLI surface so every knob that can affect generated text is exposed independently for the three runtime roles — model A, model B, and the LLM judge — and is recorded explicitly in the run.

Public CLI surface

For each of A, B, judge the parser now accepts:

--temperature_<role>   --top_p_<role>   --top_k_<role>   --seed_<role>
--max_out_tokens_<role>   --max_model_len_<role>   --chat_template_<role>
--engine_kwargs_<role>

The legacy non-role-aware flags — --max_out_tokens_models, --max_model_len, --chat_template, --engine_kwargs — keep working as deprecated aliases that fan out to the appropriate roles with a DeprecationWarning. Per-role flags always win when both are set.

Internals

  • GenerationConfig (frozen dataclass in cli_common) packages the eight knobs above. BaseCliArgs now holds three of them (gen_A / gen_B / gen_judge) instead of the previous flat list of max_out_tokens_models / max_out_tokens_judge / max_model_len / chat_template / engine_kwargs.
  • resolve_generation_configs turns a parsed argparse.Namespace into the per-role configs.
  • gen_config_to_invoke_kwargs flattens a config into the kwargs that make_model / generate_* accept.

Backends

  • ChatVLLM accepts temperature, top_p, top_k, seed and writes them through to SamplingParams (it used to hard-code temperature=0.6 / top_p=0.95); a new set_temperature method lets MT-Bench's category-aware temperature switching keep working without reaching into private attributes.
  • make_model extracts the cross-backend sampling fields from engine_kwargs and routes them to the provider-appropriate constructor argument: ChatOpenAI / OpenRouter / Together / LlamaCpp all see temperature / top_p / seed directly; top_k is tunneled through model_kwargs for OpenAI-compatible backends.
  • DummyModel records every kwarg the constructor sees on init_kwargs so tests can assert that per-role configs reach the model layer.
  • do_inference gains an optional out_metadata argument that collects per-call provider response metadata (system_fingerprint, model_name, etc.) when present.

Call sites

generate_and_evaluate.main, estimate_elo_ratings.main, and mt_bench/mt_bench_utils._generate_mt_bench_completions now build their generators from args.gen_A / args.gen_B / args.gen_judge via gen_config_to_invoke_kwargs. MT-Bench keeps its category-aware temperature defaults but only when the user has not explicitly pinned a role-level temperature; otherwise the CLI override wins. The judge call in MT-Bench keeps its historical temperature=0.0 default but accepts overrides from --temperature_judge.

Backward compatibility & out of scope

The cache-key format and run-metadata.v1.json schema are intentionally unchanged in this PR — bumping the schema and content-addressing the cache off GenerationConfig ship in follow-up PRs in the reproducibility-hardening stack.

Why

Today vLLM's sampling parameters are hard-coded (temperature=0.6, top_p=0.95) at the wrapper layer, hosted providers ignore the seed, and the same --max_out_tokens_models value is forced on both battle models even when you'd want to ablate them independently. This PR is the foundation for everything else: once a run is described by three explicit GenerationConfigs, the cache key, the run metadata, and the --rerun helper all have something concrete to hash and replay.

Test plan

  • uv run pytest -q — 83/83 pass on this branch in isolation; 117/117 pass when stacked with the rest of the reproducibility-hardening work.
  • tests/test_seed_plumbing.py (new) — --seed_A / --temperature_judge land on the correct GenerationConfig; DummyModel.init_kwargs captures the values flowing through make_model; gen_config_to_invoke_kwargs skips unset fields.
  • tests/test_cli.py — new tests for per-role temperature/seed flags, deprecated --engine_kwargs fan-out behaviour, per-role override winning over the deprecation alias, and --max_out_tokens_models fanning out to A and B but not judge.
  • CI runs the full suite.
  • Manual smoke run against a vLLM model with --seed_A / --temperature_A 0.0 to confirm the seed reaches SamplingParams.

Reworks the CLI surface so every knob that can affect generated text is
exposed independently for the three runtime roles - model A, model B,
and the LLM judge - and is recorded explicitly in the run.

Public CLI surface
------------------

For each of ``A``, ``B``, ``judge`` the parser now accepts:

  --temperature_<role>   --top_p_<role>   --top_k_<role>   --seed_<role>
  --max_out_tokens_<role>   --max_model_len_<role>   --chat_template_<role>
  --engine_kwargs_<role>

The legacy non-role-aware flags - ``--max_out_tokens_models``,
``--max_model_len``, ``--chat_template``, ``--engine_kwargs`` - keep
working as deprecated aliases that fan out to the appropriate roles
with a ``DeprecationWarning``. Per-role flags always win when both are
set.

Internals
---------

``GenerationConfig`` (frozen dataclass in ``cli_common``) packages the
eight knobs above; ``BaseCliArgs`` now holds three of them
(``gen_A``/``gen_B``/``gen_judge``) instead of the previous flat list of
``max_out_tokens_models`` / ``max_out_tokens_judge`` / ``max_model_len``
/ ``chat_template`` / ``engine_kwargs``. ``resolve_generation_configs``
turns a parsed ``argparse.Namespace`` into the per-role configs, and
``gen_config_to_invoke_kwargs`` flattens a config into the kwargs that
``make_model`` / ``generate_*`` accept.

Backends
--------

- ``ChatVLLM`` accepts ``temperature``, ``top_p``, ``top_k``, ``seed``
  and writes them through to ``SamplingParams`` (it used to hard-code
  ``temperature=0.6`` / ``top_p=0.95``); a ``set_temperature`` method is
  added so MT-Bench's category-aware temperature switching can keep
  working without reaching into private attributes.
- ``make_model`` extracts the cross-backend sampling fields from
  ``engine_kwargs`` and routes them to the provider-appropriate
  constructor argument: ``ChatOpenAI`` / OpenRouter / Together /
  ``LlamaCpp`` all see ``temperature``/``top_p``/``seed`` directly;
  ``top_k`` is tunneled through ``model_kwargs`` for OpenAI-compatible
  backends.
- ``DummyModel`` records every kwarg the constructor sees on
  ``init_kwargs`` so tests can assert that per-role configs reach the
  model layer.
- ``do_inference`` gains an optional ``out_metadata`` argument that
  collects per-call provider response metadata (``system_fingerprint``,
  ``model_name``, etc.) when present.

Call sites
----------

``generate_and_evaluate.main``, ``estimate_elo_ratings.main`` and
``mt_bench/mt_bench_utils._generate_mt_bench_completions`` now build
their generators from ``args.gen_A`` / ``args.gen_B`` / ``args.gen_judge``
via ``gen_config_to_invoke_kwargs``. MT-Bench keeps its category-aware
temperature defaults but only when the user has not explicitly pinned a
role-level temperature; otherwise the CLI override wins. The judge call
in MT-Bench keeps its historical ``temperature=0.0`` default but accepts
overrides from ``--temperature_judge``.

Backward compatibility
----------------------

The cache key formats and ``run-metadata.v1.json`` schema are
intentionally unchanged in this PR. Follow-up PRs in the
reproducibility-hardening stack content-address the cache key off the
``GenerationConfig`` and add the resolved per-role configs to the run
metadata.

Tests
-----

- ``tests/test_seed_plumbing.py`` (new) - asserts that ``--seed_A`` and
  ``--temperature_judge`` reach the right ``GenerationConfig`` and that
  ``DummyModel.init_kwargs`` captures the values flowing through
  ``make_model``.
- ``tests/test_cli.py`` - new tests for per-role temperature/seed
  flags, deprecated ``--engine_kwargs`` fan-out behaviour, per-role
  override over deprecation alias, and ``--max_out_tokens_models``
  fan-out.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant