feat(ai): A/B eval — crew vs single-call on goldens (AI-046) by mrviduus · Pull Request #341 · mrviduus/textstack

mrviduus · 2026-06-16T04:50:35Z

AI-046 — A/B eval: crew vs single-call (Phase 7 DoD gate)

The measurement that justifies the whole multi-agent phase. Runs the SAME brief+source through (A) one single LLM call (BaselineFieldAgent) and (B) the full FieldCrew (researcher→drafter→critic→editor), judges both with an independent gpt-4.1 RubricEvaluator (absolute 1–5), and reports the crew's lift % + cost ratio.

Gate: Passed = liftPct ≥ 0.10 AND costRatio ≤ 2.0. Persists a crew_ab EvalRun (Score = liftPct). No schema change.

Honest by construction (verified adversarially — 0 P1/P2)

A is not a strawman. BaselineFieldPrompt folds the identical brief contract the crew enforces — length bounds, banned phrases, style guide, language — through the same shared BriefConstraints. A reads the raw source; the crew's researcher distills that same source. The only difference is orchestration (1 call vs 4), not prompt quality or input. Same final-prose token budget (500) both sides.
No judge label-leak. The judge scores an anonymous candidate against the source — no "crew"/"baseline" token, no ordering tell, A and B in separate same-rubric calls. Rubric scores grounding / tone / bounds-aware completeness — not raw length, so the lift isn't a verbosity artifact.
B's cost is the true 4-call sum (CostUsdTotal aggregated across all sub-agents), surfaced via FieldResult.CostUsd. Div-by-zero guarded: a free or worthless baseline (avgA==0 → lift 0; sumCostA==0 → ratio +∞) cannot hand the crew a pass.
Same nano generator for A and B isolates orchestration (not model tier); the judge is a stronger independent model.

Split

Deterministic half (fake generator + fake judge through the real RubricEvaluator/parser) runs in CI with no key. Live run is admin-triggered (new Evals-tab button → lift% / cost ratio / Avg A (baseline) / Avg B (crew) / win-rate / PASS-FAIL; ~1–2 min). Golden crew_ab.json is N=10 edition/description, // TODO grow to 50 — the in-repo gate is a smoke/plumbing check; the owner-triggered 50-fixture prod run is the real DoD evidence.

Tests — 8 (full AiEvals suite 49 pass / 5 live-key skip)

lift/cost math, cost-gate independent of lift, negative lift → fail, crew-halt null-EditedText → B=0, both div-by-zero guards (lift→0, ratio→+∞→fail, breakdown costRatio:null), persistence (Feature/Score/BreakdownJson/N/JudgeModelId). StudyBuddy set-equality + AutoPublish/Seo green (FieldResult.CostUsd is purely additive); no ITool leaked.

Verify

dotnet test tests/TextStack.AiEvals → 49 pass / 5 skip (deterministic half, no key)
dotnet test tests/TextStack.UnitTests → 407 pass
dotnet format --verify-no-changes → clean
pnpm -C apps/admin exec tsc --noEmit + build → clean

Closes Phase 7 (AI-040…046). Admin button build-verified; live click owner-triggered (needs prod key + admin session).

🤖 Generated with Claude Code

The Phase 7 DoD gate. Runs the SAME brief+source through (A) one single LLM call (BaselineFieldAgent) and (B) the full FieldCrew (researcher→ drafter→critic→editor), judges both with an independent gpt-4.1 RubricEvaluator, and reports crew lift% + cost ratio. Passed = liftPct >= 0.10 AND costRatio <= 2.0. Persists a crew_ab eval_run. Honest by construction (verified adversarially): - A folds the IDENTICAL brief contract the crew enforces (length/banned/ style/lang via shared BriefConstraints) — only orchestration differs, not prompt quality or input. Not a strawman. - Judge sees an anonymous candidate vs source, no crew/baseline label, A and B in separate same-rubric calls — no label-leak; rubric scores grounding/tone/bounds-aware-completeness, not raw length. - B's cost = true sum of all 4 sub-agent calls (CostUsdTotal); div-by- zero guarded (free/worthless baseline can't hand the crew a pass). Same nano generator for A and B isolates orchestration. Deterministic half (fake gen + fake judge through the real RubricEvaluator) runs in CI with no key; live run admin-triggered (new Evals-tab button: lift%/cost ratio/PASS-FAIL). Golden N=10 edition/description, grows to 50 (owner- triggered prod run = the real DoD evidence). Reuses EvalRun, no migration. 8 deterministic tests (lift/cost math, gate independence, negative lift, null-edit→0, div-by-zero guards, persistence). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mrviduus merged commit 4010e03 into main Jun 16, 2026
5 checks passed

mrviduus deleted the ai-046-crew-ab-eval branch June 16, 2026 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): A/B eval — crew vs single-call on goldens (AI-046)#341

feat(ai): A/B eval — crew vs single-call on goldens (AI-046)#341
mrviduus merged 1 commit into
mainfrom
ai-046-crew-ab-eval

mrviduus commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented Jun 16, 2026

AI-046 — A/B eval: crew vs single-call (Phase 7 DoD gate)

Honest by construction (verified adversarially — 0 P1/P2)

Split

Tests — 8 (full AiEvals suite 49 pass / 5 live-key skip)

Verify

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant