Skip to content

fix(evaluators): exclude NaN from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator#11510

Open
Aftabbs wants to merge 1 commit into
deepset-ai:mainfrom
Aftabbs:fix/evaluator-nan-aggregate-score
Open

fix(evaluators): exclude NaN from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator#11510
Aftabbs wants to merge 1 commit into
deepset-ai:mainfrom
Aftabbs:fix/evaluator-nan-aggregate-score

Conversation

@Aftabbs
Copy link
Copy Markdown
Contributor

@Aftabbs Aftabbs commented Jun 4, 2026

Summary

When raise_on_failure=False and an LLM call fails, FaithfulnessEvaluator and ContextRelevanceEvaluator each mark that query's score as float(\"nan\"). Both evaluators then compute the aggregate score using np_mean / statistics.mean over the full list — including the nan entries. Because NaN propagates through both functions, a single failed query silently poisons the entire batch score with no warning.

Fixes #11383

Root Cause

faithfulness.py line 179 (and the equivalent line in context_relevance.py line 185):

result["score"] = np_mean([res["score"] for res in result["results"]])
# → nan if any element is nan — no warning, no explanation

Changes

File What changed
haystack/components/evaluators/faithfulness.py Filter NaN scores before np_mean; log a warning with the count of excluded queries; import math and logging
haystack/components/evaluators/context_relevance.py Same fix using statistics.mean; import math and logging
test/components/evaluators/test_faithfulness_evaluator.py Update existing partial-failure test to assert score == 1.0 (valid only); add test_run_all_failed_returns_nan_score for the all-fail edge case
test/components/evaluators/test_context_relevance_evaluator.py Same test updates
releasenotes/notes/fix-evaluator-nan-aggregate-score-*.yaml Release note

Behaviour after this fix

Scenario score before score after
1 of 3 queries fail nan mean([s1, s3]) — correct average
All queries fail nan nan — unchanged
No failures unchanged unchanged

individual_scores and results[i]["score"] still preserve NaN for failed queries so callers can inspect per-query status — this is intentional and compatible with the evaluation_statuses output added by PR #11333.

Testing

  • Existing tests updated to reflect correct new behaviour
  • New test: test_run_all_failed_returns_nan_score — verifies score stays NaN when all queries fail
  • Manually verified the reproduction case from the issue returns the correct aggregate score

Impact

Users running batch evaluations where some LLM calls fail intermittently no longer get a silent nan score that breaks downstream dashboards or comparisons.

…uator and ContextRelevanceEvaluator

When raise_on_failure=False and one or more LLM calls fail, np_mean/mean
over a list containing NaN silently returns NaN for the aggregate score.
This means a single failed query poisons the whole batch result with no
warning, breaking any downstream code that compares or reports the score.

Filter out NaN entries before computing the mean so failed queries are
excluded from the aggregate. Log a warning with the count of skipped
queries. If all queries fail the aggregate remains NaN (unchanged).
Individual scores in individual_scores and results are preserved as NaN
for per-query transparency.

Fixes deepset-ai#11383
@Aftabbs Aftabbs requested a review from a team as a code owner June 4, 2026 10:42
@Aftabbs Aftabbs requested review from davidsbatista and removed request for a team June 4, 2026 10:42
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

@Aftabbs is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@davidsbatista
Copy link
Copy Markdown
Contributor

Can you please use the template we provide for opening PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: FaithfulnessEvaluator / ContextRelevanceEvaluator silently return NaN when an LLM call fails

2 participants