Skip to content

feat(ai): A/B eval — crew vs single-call on goldens (AI-046)#341

Merged
mrviduus merged 1 commit into
mainfrom
ai-046-crew-ab-eval
Jun 16, 2026
Merged

feat(ai): A/B eval — crew vs single-call on goldens (AI-046)#341
mrviduus merged 1 commit into
mainfrom
ai-046-crew-ab-eval

Conversation

@mrviduus

Copy link
Copy Markdown
Owner

AI-046 — A/B eval: crew vs single-call (Phase 7 DoD gate)

The measurement that justifies the whole multi-agent phase. Runs the SAME brief+source through (A) one single LLM call (BaselineFieldAgent) and (B) the full FieldCrew (researcher→drafter→critic→editor), judges both with an independent gpt-4.1 RubricEvaluator (absolute 1–5), and reports the crew's lift % + cost ratio.

Gate: Passed = liftPct ≥ 0.10 AND costRatio ≤ 2.0. Persists a crew_ab EvalRun (Score = liftPct). No schema change.

Honest by construction (verified adversarially — 0 P1/P2)

  • A is not a strawman. BaselineFieldPrompt folds the identical brief contract the crew enforces — length bounds, banned phrases, style guide, language — through the same shared BriefConstraints. A reads the raw source; the crew's researcher distills that same source. The only difference is orchestration (1 call vs 4), not prompt quality or input. Same final-prose token budget (500) both sides.
  • No judge label-leak. The judge scores an anonymous candidate against the source — no "crew"/"baseline" token, no ordering tell, A and B in separate same-rubric calls. Rubric scores grounding / tone / bounds-aware completeness — not raw length, so the lift isn't a verbosity artifact.
  • B's cost is the true 4-call sum (CostUsdTotal aggregated across all sub-agents), surfaced via FieldResult.CostUsd. Div-by-zero guarded: a free or worthless baseline (avgA==0 → lift 0; sumCostA==0 → ratio +∞) cannot hand the crew a pass.
  • Same nano generator for A and B isolates orchestration (not model tier); the judge is a stronger independent model.

Split

Deterministic half (fake generator + fake judge through the real RubricEvaluator/parser) runs in CI with no key. Live run is admin-triggered (new Evals-tab button → lift% / cost ratio / Avg A (baseline) / Avg B (crew) / win-rate / PASS-FAIL; ~1–2 min). Golden crew_ab.json is N=10 edition/description, // TODO grow to 50 — the in-repo gate is a smoke/plumbing check; the owner-triggered 50-fixture prod run is the real DoD evidence.

Tests — 8 (full AiEvals suite 49 pass / 5 live-key skip)

lift/cost math, cost-gate independent of lift, negative lift → fail, crew-halt null-EditedText → B=0, both div-by-zero guards (lift→0, ratio→+∞→fail, breakdown costRatio:null), persistence (Feature/Score/BreakdownJson/N/JudgeModelId). StudyBuddy set-equality + AutoPublish/Seo green (FieldResult.CostUsd is purely additive); no ITool leaked.

Verify

  • dotnet test tests/TextStack.AiEvals → 49 pass / 5 skip (deterministic half, no key)
  • dotnet test tests/TextStack.UnitTests → 407 pass
  • dotnet format --verify-no-changes → clean
  • pnpm -C apps/admin exec tsc --noEmit + build → clean

Closes Phase 7 (AI-040…046). Admin button build-verified; live click owner-triggered (needs prod key + admin session).

🤖 Generated with Claude Code

The Phase 7 DoD gate. Runs the SAME brief+source through (A) one single
LLM call (BaselineFieldAgent) and (B) the full FieldCrew (researcher→
drafter→critic→editor), judges both with an independent gpt-4.1
RubricEvaluator, and reports crew lift% + cost ratio.
Passed = liftPct >= 0.10 AND costRatio <= 2.0. Persists a crew_ab eval_run.

Honest by construction (verified adversarially):
- A folds the IDENTICAL brief contract the crew enforces (length/banned/
  style/lang via shared BriefConstraints) — only orchestration differs,
  not prompt quality or input. Not a strawman.
- Judge sees an anonymous candidate vs source, no crew/baseline label,
  A and B in separate same-rubric calls — no label-leak; rubric scores
  grounding/tone/bounds-aware-completeness, not raw length.
- B's cost = true sum of all 4 sub-agent calls (CostUsdTotal); div-by-
  zero guarded (free/worthless baseline can't hand the crew a pass).

Same nano generator for A and B isolates orchestration. Deterministic
half (fake gen + fake judge through the real RubricEvaluator) runs in CI
with no key; live run admin-triggered (new Evals-tab button: lift%/cost
ratio/PASS-FAIL). Golden N=10 edition/description, grows to 50 (owner-
triggered prod run = the real DoD evidence). Reuses EvalRun, no migration.

8 deterministic tests (lift/cost math, gate independence, negative lift,
null-edit→0, div-by-zero guards, persistence).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mrviduus mrviduus merged commit 4010e03 into main Jun 16, 2026
5 checks passed
@mrviduus mrviduus deleted the ai-046-crew-ab-eval branch June 16, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant