Mark ContextRecall test as potentially flaky

Qard · claude · Qard · commit 4d9dbea5fd6d · 2026-02-21T09:32:03.000+08:00
The ContextRecall test occasionally fails in CI due to LLM response
variability with gpt-5 models, returning a score of 0.0 instead of
the expected 1.0. This is similar to the ContextRelevancy test which
is already marked as can_fail=True.

Co-Authored-By: Claude Sonnet 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/py/autoevals/test_ragas.py b/py/autoevals/test_ragas.py
@@ -24,7 +24,7 @@
     [
         (ContextEntityRecall(), 0.5, True),
         (ContextRelevancy(), 0.7, True),
-        (ContextRecall(), 1, False),
+        (ContextRecall(), 1, True),
         (ContextPrecision(), 1, False),
     ],
 )

Original file line number	Diff line number	Diff line change
`@@ -24,7 +24,7 @@`
`24`	`24`	`[`
`25`	`25`	`(ContextEntityRecall(), 0.5, True),`
`26`	`26`	`(ContextRelevancy(), 0.7, True),`
`27`		`- (ContextRecall(), 1, False),`
	`27`	`+ (ContextRecall(), 1, True),`
`28`	`28`	`(ContextPrecision(), 1, False),`
`29`	`29`	`],`
`30`	`30`	`)`