Fix partial metric re-runs by fanny-riols · Pull Request #59 · ServiceNow/eva

fanny-riols · 2026-04-15T15:35:36Z

What

Fixes four issues that made partial metric re-runs (e.g. --metrics response_speed --force-rerun-metrics) unreliable.

Changes

metrics_summary.json no longer overwritten on partial re-run
When re-running a subset of metrics, per_metric now covers all metrics found across records rather than only the ones being re-run. metric_errors and pass_at_k_config are merged from the existing file so unrelated fields are not lost.

LLM deployment check skipped for metrics-only re-runs
apply_env_overrides gains a strict_llm param. Passing strict_llm=False when --force-rerun-metrics is set lets re-runs succeed on runs whose LLM deployment is no longer in EVA_MODEL_LIST — no simulation is needed, so the check is unnecessary.

None latency values handled gracefully in response_speed
Turns with missing audio timestamps store None in per_turn_latency. response_speed now skips them instead of crashing with a TypeError.

RunConfig loading no longer conflicts with current env pipeline mode
from_existing_run used model_validate_json which, in pydantic-settings v2, merges env vars and .env on top of the saved JSON — causing a conflict if e.g. EVA_MODEL__LLM is set in the environment but the saved run used S2S. Fixed by loading with a local settings_customise_sources override that reads only from init kwargs. Also skips the pipeline mode conflict check in _strip_other_mode_fields when --force-rerun-metrics is set.

When re-running a subset of metrics (e.g. --metrics response_speed --force-rerun-metrics), the summary now aggregates per_metric for all metrics found across records rather than only the re-run ones. Also merges metric_errors and pass_at_k_config from the existing file so unrelated fields are not lost.

add strict_llm param to apply_env_overrides; pass strict_llm=False when --force-rerun-metrics is set so metrics-only re-runs on runs whose LLM deployment is no longer in EVA_MODEL_LIST don't fail

Turns with missing audio timestamps store None in per_turn_latency; guard against this in _compute_speed_stats and the main latency loop. Also rename section header to "Diagnostic & Validation Metrics".

from_existing_run now loads the saved config using only init_settings (no env vars / .env file), preventing the saved model config from being contaminated by the current environment's pipeline mode vars. Also skip the pipeline mode conflict check in _strip_other_mode_fields when --force-rerun-metrics is set, as the model config is unused.

fanny-riols added 3 commits April 15, 2026 11:34

Skip active-LLM deployment check when force-rerunning metrics

6456cf8

add strict_llm param to apply_env_overrides; pass strict_llm=False when --force-rerun-metrics is set so metrics-only re-runs on runs whose LLM deployment is no longer in EVA_MODEL_LIST don't fail

Skip None latency values in response_speed instead of crashing

e577b40

Turns with missing audio timestamps store None in per_turn_latency; guard against this in _compute_speed_stats and the main latency loop. Also rename section header to "Diagnostic & Validation Metrics".

fanny-riols changed the title ~~Pr/fr/metrics summary~~ response_speed: with/no tool call breakdown + re-run reliability fixes Apr 15, 2026

fanny-riols changed the base branch from main to pr/fr/response_speed_decomposition April 15, 2026 15:36

fanny-riols changed the title ~~response_speed: with/no tool call breakdown + re-run reliability fixes~~ Fix partial metric re-runs Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partial metric re-runs#59

Fix partial metric re-runs#59
fanny-riols wants to merge 4 commits intopr/fr/response_speed_decompositionfrom
pr/fr/metrics-summary

fanny-riols commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fanny-riols commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fanny-riols commented Apr 15, 2026 •

edited

Loading