test: stabilize two LLM-flake tests surfaced after PR #1469#1774
Merged
Conversation
1. test_high_skepticism_response_is_more_hedged_than_low (hs_llm_core):
The source claim was "Sam is *supposedly* the most productive engineer
...". The built-in hedge ("supposedly") primes both low- and
high-skepticism reflects to echo it, shrinking the gap the judge has
to detect. Rephrasing the claim as a direct assertion gives the
disposition room to matter — high-skepticism should now hedge while
low-skepticism states it directly.
2. test_comprehensive_multi_dimension (was hs_llm_mat):
Module-level marker is hs_llm_core; this method was overriding to
hs_llm_mat, which sent it through the bedrock/nova-2-lite weak model.
That model consistently drops one of the two required dimensions
(emotional or preferential) and fails the judge. This is a quality
assertion, not a provider-compatibility check, so it belongs in the
single-strong-provider tier (matching the pattern PR #1469 used).
CI on the first fix attempt still failed identically — both low- and
high-skepticism reflects produced "Sam is considered the most productive
engineer..." on gemini-2.5-flash-lite. Root cause: with a single
assertive claim and no contradicting signal, skepticism has nothing to
express. The disposition trait can only show up when there's tension
between facts to weigh differently.
Add one piece of contradicting evidence ("Sam's manager noted Sam had
missed two deadlines last quarter."). Now skepticism=5 should
acknowledge the tension while skepticism=1 should defer to the headline
claim. Updated the judge criteria and context accordingly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #1469. Two LLM-dependent tests went red on that PR's CI run; both look like the bucket split exposing pre-existing flake rather than regressions. This PR addresses the root causes rather than just adding reruns.
1.
test_high_skepticism_response_is_more_hedged_than_low(hs_llm_core)The stored claim was
"Sam is *supposedly* the most productive engineer ...". The built-in hedge ("supposedly") primes both low- and high-skepticism reflects to echo it — shrinking the gap the judge has to detect. With ≤1 word of variance the judge sometimes calls them equivalent.Fix: rephrase the claim as a direct assertion (
"Sam is the most productive engineer on the team by a wide margin."). High-skepticism now has room to hedge; low-skepticism states it directly. Existing@pytest.mark.flaky(reruns=2)retained.2.
test_comprehensive_multi_dimension(was hs_llm_mat)Module-level marker is
hs_llm_core; this method was overriding tohs_llm_mat, which sends it through the full 6-provider matrix includingbedrock/us.amazon.nova-2-lite-v1:0. PR #1469's own description acknowledges that nova-lite "consistently merges all three facts into a single observation" — i.e. it's the weakest matrix model and can't be expected to meet quality bars.The test asserts BOTH emotional AND preferential dimensions are preserved — a quality assertion, not a provider-compatibility check. Removing the per-method
@pytest.mark.hs_llm_matlets it inherit the module-levelhs_llm_coremarker so it runs only on the single strong provider (vertexai/gemini-2.5-flash-lite). Matches the pattern PR #1469 used for similar quality tests (test_consolidation_keeps_different_people_separate, etc).Test plan
Core LLM tests(test-api-llm-core) passes — exercises both fixed teststest-apideterministic bucket still greenhs_llm_matmatrix providers still passtest_fact_extraction_quality.py(test rename to hindsight #2 is no longer in this bucket so this is just regression-checking the rest)🤖 Generated with Claude Code