test: stabilize two LLM-flake tests surfaced after PR #1469 by nicoloboschi · Pull Request #1774 · vectorize-io/hindsight

nicoloboschi · 2026-05-27T08:22:35Z

Summary

Follow-up to #1469. Two LLM-dependent tests went red on that PR's CI run; both look like the bucket split exposing pre-existing flake rather than regressions. This PR addresses the root causes rather than just adding reruns.

1. `test_high_skepticism_response_is_more_hedged_than_low` (hs_llm_core)

The stored claim was "Sam is *supposedly* the most productive engineer ...". The built-in hedge ("supposedly") primes both low- and high-skepticism reflects to echo it — shrinking the gap the judge has to detect. With ≤1 word of variance the judge sometimes calls them equivalent.

Fix: rephrase the claim as a direct assertion ("Sam is the most productive engineer on the team by a wide margin."). High-skepticism now has room to hedge; low-skepticism states it directly. Existing @pytest.mark.flaky(reruns=2) retained.

2. `test_comprehensive_multi_dimension` (was hs_llm_mat)

Module-level marker is hs_llm_core; this method was overriding to hs_llm_mat, which sends it through the full 6-provider matrix including bedrock/us.amazon.nova-2-lite-v1:0. PR #1469's own description acknowledges that nova-lite "consistently merges all three facts into a single observation" — i.e. it's the weakest matrix model and can't be expected to meet quality bars.

The test asserts BOTH emotional AND preferential dimensions are preserved — a quality assertion, not a provider-compatibility check. Removing the per-method @pytest.mark.hs_llm_mat lets it inherit the module-level hs_llm_core marker so it runs only on the single strong provider (vertexai/gemini-2.5-flash-lite). Matches the pattern PR #1469 used for similar quality tests (test_consolidation_keeps_different_people_separate, etc).

Test plan

Core LLM tests (test-api-llm-core) passes — exercises both fixed tests
test-api deterministic bucket still green
Remaining hs_llm_mat matrix providers still pass test_fact_extraction_quality.py (test rename to hindsight #2 is no longer in this bucket so this is just regression-checking the rest)

🤖 Generated with Claude Code

1. test_high_skepticism_response_is_more_hedged_than_low (hs_llm_core): The source claim was "Sam is *supposedly* the most productive engineer ...". The built-in hedge ("supposedly") primes both low- and high-skepticism reflects to echo it, shrinking the gap the judge has to detect. Rephrasing the claim as a direct assertion gives the disposition room to matter — high-skepticism should now hedge while low-skepticism states it directly. 2. test_comprehensive_multi_dimension (was hs_llm_mat): Module-level marker is hs_llm_core; this method was overriding to hs_llm_mat, which sent it through the bedrock/nova-2-lite weak model. That model consistently drops one of the two required dimensions (emotional or preferential) and fails the judge. This is a quality assertion, not a provider-compatibility check, so it belongs in the single-strong-provider tier (matching the pattern PR #1469 used).

CI on the first fix attempt still failed identically — both low- and high-skepticism reflects produced "Sam is considered the most productive engineer..." on gemini-2.5-flash-lite. Root cause: with a single assertive claim and no contradicting signal, skepticism has nothing to express. The disposition trait can only show up when there's tension between facts to weigh differently. Add one piece of contradicting evidence ("Sam's manager noted Sam had missed two deadlines last quarter."). Now skepticism=5 should acknowledge the tension while skepticism=1 should defer to the headline claim. Updated the judge criteria and context accordingly.

nicoloboschi added 2 commits May 27, 2026 10:21

nicoloboschi merged commit d7d41e7 into main May 27, 2026
71 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: stabilize two LLM-flake tests surfaced after PR #1469#1774

test: stabilize two LLM-flake tests surfaced after PR #1469#1774
nicoloboschi merged 2 commits into
mainfrom
fix/stabilize-llm-flaky-tests

nicoloboschi commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicoloboschi commented May 27, 2026

Summary

1. test_high_skepticism_response_is_more_hedged_than_low (hs_llm_core)

2. test_comprehensive_multi_dimension (was hs_llm_mat)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `test_high_skepticism_response_is_more_hedged_than_low` (hs_llm_core)

2. `test_comprehensive_multi_dimension` (was hs_llm_mat)