Skip to content

test: stabilize two LLM-flake tests surfaced after PR #1469#1774

Merged
nicoloboschi merged 2 commits into
mainfrom
fix/stabilize-llm-flaky-tests
May 27, 2026
Merged

test: stabilize two LLM-flake tests surfaced after PR #1469#1774
nicoloboschi merged 2 commits into
mainfrom
fix/stabilize-llm-flaky-tests

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #1469. Two LLM-dependent tests went red on that PR's CI run; both look like the bucket split exposing pre-existing flake rather than regressions. This PR addresses the root causes rather than just adding reruns.

1. test_high_skepticism_response_is_more_hedged_than_low (hs_llm_core)

The stored claim was "Sam is *supposedly* the most productive engineer ...". The built-in hedge ("supposedly") primes both low- and high-skepticism reflects to echo it — shrinking the gap the judge has to detect. With ≤1 word of variance the judge sometimes calls them equivalent.

Fix: rephrase the claim as a direct assertion ("Sam is the most productive engineer on the team by a wide margin."). High-skepticism now has room to hedge; low-skepticism states it directly. Existing @pytest.mark.flaky(reruns=2) retained.

2. test_comprehensive_multi_dimension (was hs_llm_mat)

Module-level marker is hs_llm_core; this method was overriding to hs_llm_mat, which sends it through the full 6-provider matrix including bedrock/us.amazon.nova-2-lite-v1:0. PR #1469's own description acknowledges that nova-lite "consistently merges all three facts into a single observation" — i.e. it's the weakest matrix model and can't be expected to meet quality bars.

The test asserts BOTH emotional AND preferential dimensions are preserved — a quality assertion, not a provider-compatibility check. Removing the per-method @pytest.mark.hs_llm_mat lets it inherit the module-level hs_llm_core marker so it runs only on the single strong provider (vertexai/gemini-2.5-flash-lite). Matches the pattern PR #1469 used for similar quality tests (test_consolidation_keeps_different_people_separate, etc).

Test plan

  • Core LLM tests (test-api-llm-core) passes — exercises both fixed tests
  • test-api deterministic bucket still green
  • Remaining hs_llm_mat matrix providers still pass test_fact_extraction_quality.py (test rename to hindsight #2 is no longer in this bucket so this is just regression-checking the rest)

🤖 Generated with Claude Code

1. test_high_skepticism_response_is_more_hedged_than_low (hs_llm_core):
   The source claim was "Sam is *supposedly* the most productive engineer
   ...". The built-in hedge ("supposedly") primes both low- and
   high-skepticism reflects to echo it, shrinking the gap the judge has
   to detect. Rephrasing the claim as a direct assertion gives the
   disposition room to matter — high-skepticism should now hedge while
   low-skepticism states it directly.

2. test_comprehensive_multi_dimension (was hs_llm_mat):
   Module-level marker is hs_llm_core; this method was overriding to
   hs_llm_mat, which sent it through the bedrock/nova-2-lite weak model.
   That model consistently drops one of the two required dimensions
   (emotional or preferential) and fails the judge. This is a quality
   assertion, not a provider-compatibility check, so it belongs in the
   single-strong-provider tier (matching the pattern PR #1469 used).
CI on the first fix attempt still failed identically — both low- and
high-skepticism reflects produced "Sam is considered the most productive
engineer..." on gemini-2.5-flash-lite. Root cause: with a single
assertive claim and no contradicting signal, skepticism has nothing to
express. The disposition trait can only show up when there's tension
between facts to weigh differently.

Add one piece of contradicting evidence ("Sam's manager noted Sam had
missed two deadlines last quarter."). Now skepticism=5 should
acknowledge the tension while skepticism=1 should defer to the headline
claim. Updated the judge criteria and context accordingly.
@nicoloboschi nicoloboschi merged commit d7d41e7 into main May 27, 2026
71 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant