Skip to content

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57

Open
fanny-riols wants to merge 9 commits intomainfrom
pr/fr/response_speed_decomposition
Open

Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
fanny-riols wants to merge 9 commits intomainfrom
pr/fr/response_speed_decomposition

Conversation

@fanny-riols
Copy link
Copy Markdown
Collaborator

@fanny-riols fanny-riols commented Apr 14, 2026

Summary

Adds a with/without tool call breakdown to the response_speed metric so latency can be compared between turns that required a tool call and turns that didn't.

Instead of separate registered metrics, the breakdown is computed as sub-fields within response_speed:

  • All latency data now comes from turn_taking per_turn_latency (keyed by turn_id), replacing the previous use of Pipecat's UserBotLatencyObserver data.
  • conversation_trace is checked to classify each turn as with/without tool calls.
  • The response_speed details dict gains two optional sub-dicts: with_tool_calls and no_tool_calls, each with mean_speed_seconds, max_speed_seconds, num_turns, and per_turn_speeds.

Example results across 150 records:

Mean Min Max
response_speed (all turns) 10.48 s 4.57 s 14.59 s
with_tool_calls 11.67 s 5.32 s 16.37 s
no_tool_calls 8.51 s 3.40 s 12.38 s

Tool-call turns were ~3.2 s slower on average in this example.

Metrics summary

metrics_summary.jsonper_metric.response_speed now includes nested with_tool_calls and no_tool_calls aggregate entries (mean/min/max/count).

Analysis app

  • response_speed_with_tool_calls and response_speed_no_tool_calls appear as columns in the Diagnostic table, next to response_speed, in both the single-run and cross-run views.
  • Both are treated as non-normalized (rendered in seconds).

Also included

  • apply_env_overrides: deployments with redacted secrets that aren't in the current EVA_MODEL_LIST now warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
  • _build_history: added _resolve_path() so pipecat_logs.jsonl / elevenlabs_events.jsonl fall back to output_dir/<filename> when the stored path no longer exists — fixes metric reruns after a run directory is moved.
image

…etrics

Splits the existing response_speed diagnostic metric into two filtered
variants based on whether the assistant made a tool call in the turn.
Parses conversation_trace to map each latency to its turn and checks
for tool_call entries on that turn_id.

Shared logic (sanity filtering, mean/max, MetricScore construction) is
extracted into a _ResponseSpeedBase class; each variant only implements
_get_latencies(). Bumps metrics_version to 0.1.2.
…DEL_LIST

When restoring redacted secrets in apply_env_overrides, skip deployments
that are not present in the current environment's EVA_MODEL_LIST rather
than raising a ValueError. Only raise if the missing deployment is the
active LLM for this run. This allows metrics-only reruns in environments
that don't have every deployment from the original run configured.
Adds _resolve_path() helper that returns the stored path if it exists on
disk, otherwise falls back to output_dir/<filename>. Used in _build_history
for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns
work correctly when a run directory has been moved from its original location.
…in analysis app

Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as
standalone seconds bar charts alongside response_speed. Category grouping,
color, and table sorting are handled dynamically via the metric registry.
…onse speed metrics

The filtered variants now read metrics/turn_taking/details/per_turn_latency
from the record's metrics.json instead of using context.response_speed_latencies.
This gives a direct turn_id → latency mapping, avoiding the index-based
alignment that was previously needed to correlate latencies with tool calls.

The base response_speed metric is unchanged (still uses UserBotLatencyObserver).
…NoToolCallsMetric

Tests cover: missing output_dir, missing metrics.json, missing turn_taking
data, no tool-call turns, all tool-call turns, mixed turns (correct split),
invalid latency filtering, and an exhaustiveness check that with_tool +
no_tool latencies together equal the full per_turn_latency set.
Comment thread src/eva/metrics/diagnostic/response_speed.py Outdated
@fanny-riols fanny-riols marked this pull request as ready for review April 15, 2026 13:08
@fanny-riols fanny-riols requested a review from gabegma April 15, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants