Skip to content

feat(verifier): add rubric verifier engine#2133

Open
miguelg719 wants to merge 9 commits into
miguelgonzalez/verifier-04-evidence-normalizationfrom
miguelgonzalez/verifier-05-core-engine
Open

feat(verifier): add rubric verifier engine#2133
miguelg719 wants to merge 9 commits into
miguelgonzalez/verifier-04-evidence-normalizationfrom
miguelgonzalez/verifier-05-core-engine

Conversation

@miguelg719
Copy link
Copy Markdown
Collaborator

@miguelg719 miguelg719 commented May 15, 2026

Why

With trajectory capture and canonical evidence in place, the verifier needs the actual rubric-based judgment engine. This PR adds the new verifier backend while keeping legacy ask() and batchAsk() isolated behind the backend flag.

What Changed

  • Added RubricVerifier with Approach A and Approach B verifier paths.
  • Added prompt modules for rubric generation, relevance, per-criterion scoring, fused judgment, fused outcome, failure analysis, and task validity.
  • Added task validity and error taxonomy support.
  • Consolidated engine option, taxonomy, and failure-step parser types into verifier/types.ts; implementation modules import those types directly.
  • Aligned raw verifier result types with the engine output for empty trajectories and task-validity reasoning.
  • Extended trajectory asset path containment to externalized tier-1 image paths.
  • Bounded first-point-of-failure step-range parsing so malformed model output cannot expand unbounded ranges.
  • Wired STAGEHAND_EVALUATOR_BACKEND=verifier through V3Evaluator.verify() and generateRubric().
  • Kept ask() and batchAsk() on the legacy backend only.
  • Removed upstream verifier references from comments.

Tests

  • pnpm --filter @browserbasehq/stagehand run typecheck
  • pnpm --filter @browserbasehq/stagehand run build:esm
  • pnpm --filter @browserbasehq/stagehand run test:core -- packages/core/dist/esm/tests/unit/verifier-failure-step-parser.test.js packages/core/dist/esm/tests/unit/verifier-trajectory.test.js
  • pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
  • git diff --check

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 15, 2026

🦋 Changeset detected

Latest commit: b725247

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@browserbasehq/stagehand Patch
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch
@browserbasehq/stagehand-server-v4 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 19 files

Confidence score: 3/5

  • There is a concrete stability risk in packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts: unbounded range expansion can trigger OOM or hangs when parsing malformed LLM output.
  • Because this is severity 7/10 with high confidence (8/10) and can directly impact runtime behavior, the merge risk is moderate until a guardrail is added.
  • Pay close attention to packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts - cap per-range expansion (for example, max steps) to prevent runaway memory/CPU use.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts">

<violation number="1" location="packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts:131">
P1: Unbounded range expansion can cause OOM or hang on malformed LLM output. Add a cap on the number of elements generated from a single range segment (e.g., limit to 1000 steps).</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Client as V3Evaluator (public API)
    participant Verifier as RubricVerifier (internal)
    participant LLM as LLMClient (provider-managed)
    participant Evidence as evidence.ts (collector)
    participant Prompts as prompt modules
    participant Taxonomy as errorTaxonomy.ts
    participant Disk as Trajectory on disk

    Note over Client,Disk: Primary entry: V3Evaluator.verify(trajectory, taskSpec)

    Client->>Verifier: verify(trajectory, taskSpec)
    alt STAGEHAND_EVALUATOR_BACKEND=verifier
        Client->>Client: NEW: instantiate RubricVerifier(getClient)
    else legacy
        Client->>Client: delegate to LegacyV3Evaluator (unchanged)
    end

    Verifier->>Disk: collectCanonicalEvidence(trajectory)
    Note over Verifier,Disk: NEW: rehydrate tier-1 agent images from imagePath<br/>NEW: rehydrate tier-2 probe screenshots from screenshotPath
    Disk-->>Verifier: canonicalImages[], canonicalText[]

    Verifier->>Taxonomy: getTaxonomyText(1,6,3)
    Taxonomy-->>Verifier: rendered taxonomy block

    Verifier->>Prompts: render RUBRIC_GENERATION_PROMPT(task, urlContext)
    alt taskSpec has precomputedRubric
        Verifier->>Verifier: use precomputed rubric (skip generation)
    else no precomputed rubric
        Verifier->>LLM: NEW: call getClient().generate() with rubric prompt
        LLM-->>Verifier: parsed rubric JSON (items: criterion[])
    end

    Verifier->>Verifier: NEW: select approach via VERIFIER_APPROACH env var

    alt Approach B (default) — fused judgment
        Verifier->>Evidence: groupTopKByCriterion(numCriteria, relevanceScores, topK)
        Evidence-->>Verifier: Map<criterionIdx, evidenceIdx[]>
        Verifier->>Prompts: build evidence manifest for all criteria
        Verifier->>LLM: NEW: FUSED_JUDGMENT_PROMPT (single multi-modal call)
        Note over Verifier,LLM: Grades ALL criteria + outcome + optional failure analysis + task validity
        LLM-->>Verifier: fused JSON (outcome, per_criterion[], optional failure_point, optional task_validity)
        Verifier->>Verifier: mapFusedPerCriterionToScores(rubric, perCriterion)
        Verifier->>Verifier: build Verdict from fused response

    else Approach A — per-criterion + fused outcome
        Verifier->>LLM: loop N times (NEW: MM_BATCHED_RELEVANCE_PROMPT in batches)
        Note over Verifier,LLM: Batched relevance: one call per batch of B evidence points
        LLM-->>Verifier: relevance scores per evidence per criterion
        Verifier->>Evidence: groupTopKByCriterion(...)
        Verifier->>LLM: loop N times (NEW: MM_PER_CRITERION_SCORE_PROMPT — one call per criterion)
        Note over Verifier,LLM: Per-criterion analysis + score (parallelizable)
        LLM-->>Verifier: per-criterion earned_points, justification
        Verifier->>Prompts: build rubric_summary from per-criterion scores
        Verifier->>LLM: NEW: FUSED_OUTCOME_PROMPT (consumes pre-scored rubric)
        LLM-->>Verifier: outcome verdict + optional failure analysis + task validity
    end

    opt Optional steps not folded (VERIFIER_OPTIONAL_STEPS=separate)
        Verifier->>LLM: NEW: FIRST_POINT_OF_FAILURE_PROMPT (Step 9a separate call)
        LLM-->>Verifier: failure analysis JSON
        Verifier->>Verifier: parseFailureStepNumbers() for each failure point
        Verifier->>LLM: NEW: TASK_VALIDITY_PROMPT (Step 10 separate call)
        LLM-->>Verifier: task validity JSON
    end

    Verifier-->>Client: Verdict { success, processScore, outcome, failurePoint, taskValidity, criterionScores }

    Note over Client: generateRubric(taskSpec) also uses RubricVerifier
    Client->>Verifier: generateRubric(taskSpec)
    alt STAGEHAND_EVALUATOR_BACKEND=verifier
        Verifier->>LLM: RUBRIC_GENERATION_PROMPT (same as Step 0a)
        LLM-->>Verifier: rubric items
        Verifier-->>Client: Rubric { items: criterion[] }
    else legacy
        Client->>Client: return single legacyTaskCompletionCriterion (unchanged)
    end

    Note over Client: ask() / batchAsk() blocked on verifier backend
    Client->>Client: throw: "verifier backend only supports verify() and generateRubric()"
Loading

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic

Comment thread packages/core/lib/v3/verifier/prompts/firstPointOfFailure.ts Outdated
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from 3fee3cb to 7d010ed Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 163db47 to ebe60bf Compare May 15, 2026 21:23
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from 7d010ed to cfa9a9a Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from ebe60bf to 191904b Compare May 15, 2026 21:45
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from cfa9a9a to a152252 Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 191904b to 62cb8db Compare May 15, 2026 22:33
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from a152252 to fc5a9f7 Compare May 15, 2026 23:27
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch 2 times, most recently from a6ee702 to 2e7ff0f Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from fc5a9f7 to be24a26 Compare May 16, 2026 04:40
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-05-core-engine branch from 2e7ff0f to b725247 Compare May 16, 2026 05:50
@miguelg719 miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from be24a26 to 4e9c26e Compare May 16, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant