Problem Description
When using different evaluation models (e.g., Qwen3-32B,Qwen-plus,Qwen3-max) with the same test cases and dataset, the contextual relevance scores calculated by DeepEval (via relevant_statements / total_verdicts) vary significantly. For example:
Model A returns a score of 0.85, while Model B returns 0.62 for identical inputs.
Debugging shows that the discrepancy arises from differences in how models judge relevance (i.e., "yes"/"no" verdicts in self.verdicts_list).
Current Setup
DeepEval Version: 3.6.9
Evaluation Models Tested: [Qwen3-32B,Qwen-plus,Qwen3-max]
Key Metric Used: ContextualRelevancyMetric
Could you please recommend:
Which evaluation models are best suited for minimizing score variability in contextual relevance assessments?
Whether DeepEval has pre-configured models or best practices for this specific use case (e.g., using GPT-4 as the "gold standard" evaluator)?
Problem Description
When using different evaluation models (e.g., Qwen3-32B,Qwen-plus,Qwen3-max) with the same test cases and dataset, the contextual relevance scores calculated by DeepEval (via relevant_statements / total_verdicts) vary significantly. For example:
Model A returns a score of 0.85, while Model B returns 0.62 for identical inputs.
Debugging shows that the discrepancy arises from differences in how models judge relevance (i.e., "yes"/"no" verdicts in self.verdicts_list).
Current Setup
DeepEval Version: 3.6.9
Evaluation Models Tested: [Qwen3-32B,Qwen-plus,Qwen3-max]
Key Metric Used: ContextualRelevancyMetric
Could you please recommend:
Which evaluation models are best suited for minimizing score variability in contextual relevance assessments?
Whether DeepEval has pre-configured models or best practices for this specific use case (e.g., using GPT-4 as the "gold standard" evaluator)?