fix(evaluators): exclude NaN from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator by Aftabbs · Pull Request #11510 · deepset-ai/haystack

Aftabbs · 2026-06-04T10:42:20Z

Summary

When raise_on_failure=False and an LLM call fails, FaithfulnessEvaluator and ContextRelevanceEvaluator each mark that query's score as float(\"nan\"). Both evaluators then compute the aggregate score using np_mean / statistics.mean over the full list — including the nan entries. Because NaN propagates through both functions, a single failed query silently poisons the entire batch score with no warning.

Fixes #11383

Root Cause

faithfulness.py line 179 (and the equivalent line in context_relevance.py line 185):

result["score"] = np_mean([res["score"] for res in result["results"]])
# → nan if any element is nan — no warning, no explanation

Changes

File	What changed
`haystack/components/evaluators/faithfulness.py`	Filter NaN scores before `np_mean`; log a warning with the count of excluded queries; import `math` and `logging`
`haystack/components/evaluators/context_relevance.py`	Same fix using `statistics.mean`; import `math` and `logging`
`test/components/evaluators/test_faithfulness_evaluator.py`	Update existing partial-failure test to assert `score == 1.0` (valid only); add `test_run_all_failed_returns_nan_score` for the all-fail edge case
`test/components/evaluators/test_context_relevance_evaluator.py`	Same test updates
`releasenotes/notes/fix-evaluator-nan-aggregate-score-*.yaml`	Release note

Behaviour after this fix

Scenario	`score` before	`score` after
1 of 3 queries fail	`nan`	`mean([s1, s3])` — correct average
All queries fail	`nan`	`nan` — unchanged
No failures	unchanged	unchanged

individual_scores and results[i]["score"] still preserve NaN for failed queries so callers can inspect per-query status — this is intentional and compatible with the evaluation_statuses output added by PR #11333.

Testing

Existing tests updated to reflect correct new behaviour
New test: test_run_all_failed_returns_nan_score — verifies score stays NaN when all queries fail
Manually verified the reproduction case from the issue returns the correct aggregate score

Impact

Users running batch evaluations where some LLM calls fail intermittently no longer get a silent nan score that breaks downstream dashboards or comparisons.

…uator and ContextRelevanceEvaluator When raise_on_failure=False and one or more LLM calls fail, np_mean/mean over a list containing NaN silently returns NaN for the aggregate score. This means a single failed query poisons the whole batch result with no warning, breaking any downstream code that compares or reports the score. Filter out NaN entries before computing the mean so failed queries are excluded from the aggregate. Log a warning with the count of skipped queries. If all queries fail the aggregate remains NaN (unchanged). Individual scores in individual_scores and results are preserved as NaN for per-query transparency. Fixes deepset-ai#11383

vercel · 2026-06-04T10:42:27Z

@Aftabbs is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

davidsbatista · 2026-06-04T11:01:35Z

Can you please use the template we provide for opening PRs?

Aftabbs requested a review from a team as a code owner June 4, 2026 10:42

Aftabbs requested review from davidsbatista and removed request for a team June 4, 2026 10:42

github-actions Bot added the topic:tests label Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluators): exclude NaN from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator#11510

fix(evaluators): exclude NaN from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator#11510
Aftabbs wants to merge 1 commit into
deepset-ai:mainfrom
Aftabbs:fix/evaluator-nan-aggregate-score

Aftabbs commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026

Uh oh!

davidsbatista commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aftabbs commented Jun 4, 2026

Summary

Root Cause

Changes

Behaviour after this fix

Testing

Impact

Uh oh!

vercel Bot commented Jun 4, 2026

Uh oh!

davidsbatista commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants