Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls by fanny-riols · Pull Request #57 · ServiceNow/eva

fanny-riols · 2026-04-14T16:38:21Z

Summary

Adds a with/without tool call breakdown to the response_speed metric so latency can be compared between turns that required a tool call and turns that didn't.

Instead of separate registered metrics, the breakdown is computed as sub-fields within response_speed:

All latency data now comes from turn_taking per_turn_latency (keyed by turn_id), replacing the previous use of Pipecat's UserBotLatencyObserver data.
conversation_trace is checked to classify each turn as with/without tool calls.
The response_speed details dict gains two optional sub-dicts: with_tool_calls and no_tool_calls, each with mean_speed_seconds, max_speed_seconds, num_turns, and per_turn_speeds.

Example results across 150 records:

	Mean	Min	Max
`response_speed` (all turns)	10.48 s	4.57 s	14.59 s
`with_tool_calls`	11.67 s	5.32 s	16.37 s
`no_tool_calls`	8.51 s	3.40 s	12.38 s

Tool-call turns were ~3.2 s slower on average in this example.

Metrics summary

metrics_summary.json → per_metric.response_speed now includes nested with_tool_calls and no_tool_calls aggregate entries (mean/min/max/count).

Analysis app

response_speed_with_tool_calls and response_speed_no_tool_calls appear as columns in the Diagnostic table, next to response_speed, in both the single-run and cross-run views.
Both are treated as non-normalized (rendered in seconds).

Also included

apply_env_overrides: deployments with redacted secrets that aren't in the current EVA_MODEL_LIST now warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
_build_history: added _resolve_path() so pipecat_logs.jsonl / elevenlabs_events.jsonl fall back to output_dir/<filename> when the stored path no longer exists — fixes metric reruns after a run directory is moved.

…etrics Splits the existing response_speed diagnostic metric into two filtered variants based on whether the assistant made a tool call in the turn. Parses conversation_trace to map each latency to its turn and checks for tool_call entries on that turn_id. Shared logic (sanity filtering, mean/max, MetricScore construction) is extracted into a _ResponseSpeedBase class; each variant only implements _get_latencies(). Bumps metrics_version to 0.1.2.

…DEL_LIST When restoring redacted secrets in apply_env_overrides, skip deployments that are not present in the current environment's EVA_MODEL_LIST rather than raising a ValueError. Only raise if the missing deployment is the active LLM for this run. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.

Adds _resolve_path() helper that returns the stored path if it exists on disk, otherwise falls back to output_dir/<filename>. Used in _build_history for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns work correctly when a run directory has been moved from its original location.

…in analysis app Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as standalone seconds bar charts alongside response_speed. Category grouping, color, and table sorting are handled dynamically via the metric registry.

…onse speed metrics The filtered variants now read metrics/turn_taking/details/per_turn_latency from the record's metrics.json instead of using context.response_speed_latencies. This gives a direct turn_id → latency mapping, avoiding the index-based alignment that was previously needed to correlate latencies with tool calls. The base response_speed metric is unchanged (still uses UserBotLatencyObserver).

…NoToolCallsMetric Tests cover: missing output_dir, missing metrics.json, missing turn_taking data, no tool-call turns, all tool-call turns, mixed turns (correct split), invalid latency filtering, and an exhaustiveness check that with_tool + no_tool latencies together equal the full per_turn_latency set.

…ate metrics

…nd metrics_summary

fanny-riols added 6 commits April 14, 2026 12:28

gabegma reviewed Apr 14, 2026

View reviewed changes

Comment thread src/eva/metrics/diagnostic/response_speed.py Outdated

fanny-riols marked this pull request as ready for review April 15, 2026 13:08

fanny-riols added 3 commits April 15, 2026 09:27

Fold response_speed tool-call breakdown into details instead of separ…

a8c9204

…ate metrics

Merge branch 'main' into pr/fr/response_speed_decomposition

90cee3b

Show response_speed with/no tool call breakdown in diagnostic table a…

796c638

…nd metrics_summary

fanny-riols requested a review from gabegma April 15, 2026 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
fanny-riols wants to merge 9 commits intomainfrom
pr/fr/response_speed_decomposition

fanny-riols commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fanny-riols commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics summary

Analysis app

Also included

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fanny-riols commented Apr 14, 2026 •

edited

Loading