feat(verifier): add rubric verifier engine#2133
Open
miguelg719 wants to merge 9 commits into
Open
Conversation
🦋 Changeset detectedLatest commit: b725247 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
There was a problem hiding this comment.
1 issue found across 19 files
Confidence score: 3/5
- There is a concrete stability risk in
packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts: unbounded range expansion can trigger OOM or hangs when parsing malformed LLM output. - Because this is severity 7/10 with high confidence (8/10) and can directly impact runtime behavior, the merge risk is moderate until a guardrail is added.
- Pay close attention to
packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts- cap per-range expansion (for example, max steps) to prevent runaway memory/CPU use.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts">
<violation number="1" location="packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts:131">
P1: Unbounded range expansion can cause OOM or hang on malformed LLM output. Add a cap on the number of elements generated from a single range segment (e.g., limit to 1000 steps).</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as V3Evaluator (public API)
participant Verifier as RubricVerifier (internal)
participant LLM as LLMClient (provider-managed)
participant Evidence as evidence.ts (collector)
participant Prompts as prompt modules
participant Taxonomy as errorTaxonomy.ts
participant Disk as Trajectory on disk
Note over Client,Disk: Primary entry: V3Evaluator.verify(trajectory, taskSpec)
Client->>Verifier: verify(trajectory, taskSpec)
alt STAGEHAND_EVALUATOR_BACKEND=verifier
Client->>Client: NEW: instantiate RubricVerifier(getClient)
else legacy
Client->>Client: delegate to LegacyV3Evaluator (unchanged)
end
Verifier->>Disk: collectCanonicalEvidence(trajectory)
Note over Verifier,Disk: NEW: rehydrate tier-1 agent images from imagePath<br/>NEW: rehydrate tier-2 probe screenshots from screenshotPath
Disk-->>Verifier: canonicalImages[], canonicalText[]
Verifier->>Taxonomy: getTaxonomyText(1,6,3)
Taxonomy-->>Verifier: rendered taxonomy block
Verifier->>Prompts: render RUBRIC_GENERATION_PROMPT(task, urlContext)
alt taskSpec has precomputedRubric
Verifier->>Verifier: use precomputed rubric (skip generation)
else no precomputed rubric
Verifier->>LLM: NEW: call getClient().generate() with rubric prompt
LLM-->>Verifier: parsed rubric JSON (items: criterion[])
end
Verifier->>Verifier: NEW: select approach via VERIFIER_APPROACH env var
alt Approach B (default) — fused judgment
Verifier->>Evidence: groupTopKByCriterion(numCriteria, relevanceScores, topK)
Evidence-->>Verifier: Map<criterionIdx, evidenceIdx[]>
Verifier->>Prompts: build evidence manifest for all criteria
Verifier->>LLM: NEW: FUSED_JUDGMENT_PROMPT (single multi-modal call)
Note over Verifier,LLM: Grades ALL criteria + outcome + optional failure analysis + task validity
LLM-->>Verifier: fused JSON (outcome, per_criterion[], optional failure_point, optional task_validity)
Verifier->>Verifier: mapFusedPerCriterionToScores(rubric, perCriterion)
Verifier->>Verifier: build Verdict from fused response
else Approach A — per-criterion + fused outcome
Verifier->>LLM: loop N times (NEW: MM_BATCHED_RELEVANCE_PROMPT in batches)
Note over Verifier,LLM: Batched relevance: one call per batch of B evidence points
LLM-->>Verifier: relevance scores per evidence per criterion
Verifier->>Evidence: groupTopKByCriterion(...)
Verifier->>LLM: loop N times (NEW: MM_PER_CRITERION_SCORE_PROMPT — one call per criterion)
Note over Verifier,LLM: Per-criterion analysis + score (parallelizable)
LLM-->>Verifier: per-criterion earned_points, justification
Verifier->>Prompts: build rubric_summary from per-criterion scores
Verifier->>LLM: NEW: FUSED_OUTCOME_PROMPT (consumes pre-scored rubric)
LLM-->>Verifier: outcome verdict + optional failure analysis + task validity
end
opt Optional steps not folded (VERIFIER_OPTIONAL_STEPS=separate)
Verifier->>LLM: NEW: FIRST_POINT_OF_FAILURE_PROMPT (Step 9a separate call)
LLM-->>Verifier: failure analysis JSON
Verifier->>Verifier: parseFailureStepNumbers() for each failure point
Verifier->>LLM: NEW: TASK_VALIDITY_PROMPT (Step 10 separate call)
LLM-->>Verifier: task validity JSON
end
Verifier-->>Client: Verdict { success, processScore, outcome, failurePoint, taskValidity, criterionScores }
Note over Client: generateRubric(taskSpec) also uses RubricVerifier
Client->>Verifier: generateRubric(taskSpec)
alt STAGEHAND_EVALUATOR_BACKEND=verifier
Verifier->>LLM: RUBRIC_GENERATION_PROMPT (same as Step 0a)
LLM-->>Verifier: rubric items
Verifier-->>Client: Rubric { items: criterion[] }
else legacy
Client->>Client: return single legacyTaskCompletionCriterion (unchanged)
end
Note over Client: ask() / batchAsk() blocked on verifier backend
Client->>Client: throw: "verifier backend only supports verify() and generateRubric()"
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
3fee3cb to
7d010ed
Compare
163db47 to
ebe60bf
Compare
7d010ed to
cfa9a9a
Compare
ebe60bf to
191904b
Compare
cfa9a9a to
a152252
Compare
191904b to
62cb8db
Compare
a152252 to
fc5a9f7
Compare
a6ee702 to
2e7ff0f
Compare
fc5a9f7 to
be24a26
Compare
2e7ff0f to
b725247
Compare
be24a26 to
4e9c26e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
With trajectory capture and canonical evidence in place, the verifier needs the actual rubric-based judgment engine. This PR adds the new verifier backend while keeping legacy
ask()andbatchAsk()isolated behind the backend flag.What Changed
RubricVerifierwith Approach A and Approach B verifier paths.verifier/types.ts; implementation modules import those types directly.STAGEHAND_EVALUATOR_BACKEND=verifierthroughV3Evaluator.verify()andgenerateRubric().ask()andbatchAsk()on the legacy backend only.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm --filter @browserbasehq/stagehand run build:esmpnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-failure-step-parser.test.js packages/core/dist/esm/tests/unit/verifier-trajectory.test.jspnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentgit diff --check