Skip to content

Which evaluation models are best suited for minimizing score variability in contextual relevance assessments? #2302

@pwnr00t

Description

@pwnr00t

Problem Description

When using different evaluation models (e.g., Qwen3-32B,Qwen-plus,Qwen3-max) with the same test cases and dataset, the contextual relevance scores calculated by DeepEval (via relevant_statements / total_verdicts) vary significantly. For example:

Model A returns a score of 0.85, while Model B returns 0.62 for identical inputs.
Debugging shows that the discrepancy arises from differences in how models judge relevance (i.e., "yes"/"no" verdicts in self.verdicts_list).

Current Setup

DeepEval Version: 3.6.9
Evaluation Models Tested: [Qwen3-32B,Qwen-plus,Qwen3-max]
Key Metric Used: ContextualRelevancyMetric

Could you please recommend:

Which evaluation models are best suited for minimizing score variability in contextual relevance assessments?
Whether DeepEval has pre-configured models or best practices for this specific use case (e.g., using GPT-4 as the "gold standard" evaluator)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions