feat(verifier): normalize canonical evidence by miguelg719 · Pull Request #2132 · browserbase/stagehand

miguelg719 · 2026-05-15T20:58:43Z

Why

The verifier needs a mode-neutral evidence layer before scoring. DOM/Hybrid tasks can only be judged correctly if text, JSON, and tool-output evidence are considered alongside screenshots and ARIA evidence, while core installs should not pick up image-processing runtime dependencies.

What Changed

Added canonical screenshot loading, deduplication, resizing, and step-index mapping.
Added canonical text evidence for ARIA snippets, agent text, agent JSON, and native tool output.
Added combined chronological canonical evidence collection.
Moved canonical evidence and evidence-load result types into verifier/types.ts; implementation files import those types directly.
Kept image reduction behind a dynamic sharp import so core remains dependency-light; sharp is owned by evals where verifier tooling runs.
Added a smoke check proving probe-aria, agent-text, agent-json, and tool-output evidence are collected.
Removed upstream verifier references from comments.

Tests

pnpm --filter @browserbasehq/stagehand run typecheck
pnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agent
Canonical evidence smoke: collectCanonicalEvidence produced probe-aria,agent-text,agent-json,tool-output
git diff --check

changeset-bot · 2026-05-15T20:58:57Z

🦋 Changeset detected

Latest commit: 4e9c26e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages

Name	Type
@browserbasehq/stagehand	Patch
@browserbasehq/stagehand-evals	Patch
@browserbasehq/stagehand-server-v3	Patch
@browserbasehq/stagehand-server-v4	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

1 issue found across 3 files

Confidence score: 2/5

High-confidence, high-severity risk: adding sharp in packages/core/package.json introduces a binary/node-gyp-backed dependency in core, which violates the stated core-lib constraint and is likely to cause build/portability regressions.
Because this is a concrete policy and runtime/build compatibility concern (severity 9/10, confidence 10/10), this is not a low-risk merge as-is.
Pay close attention to packages/core/package.json - sharp should be removed, replaced, or isolated outside core to avoid binary dependency risk.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/package.json">

<violation number="1" location="packages/core/package.json:113">
P0: Custom agent: **Flag any imports of packages that use node-gyp or embed binaries (e.g. sharp)**

`sharp` (a binary/node-gyp-backed package) was added to `packages/core/package.json` dependencies, violating the core-library restriction on native/binary npm packages.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant Verifier as Evidence Verifier
    participant Loader as loadAndReduceScreenshots
    participant Sharp as Sharp (Optional)
    participant TextCollector as collectCanonicalEvidence
    participant Trajectory as Trajectory Data

    Note over Verifier,Trajectory: Canonical Evidence Collection Flow

    Verifier->>Trajectory: Read trajectory steps
    Verifier->>Loader: Call loadAndReduceScreenshots(trajectory)

    Loader->>Loader: Parse env thresholds (SSIM, MSE, resize)

    loop Each trajectory step
        Loader->>Trajectory: Check probeEvidence.screenshot buffer
        alt Screenshot exists
            Loader->>Loader: Add to rawFrames list
        else No screenshot
            Loader->>Loader: Skip step
        end
    end

    alt Sharp available
        Loader->>Sharp: Dynamic import sharp
        Sharp-->>Loader: sharp instance

        loop Dedup and resize frames
            Loader->>Sharp: calculateMSE(prev.bytes, frame.bytes)
            Sharp-->>Loader: MSE value

            alt MSE >= threshold
                Loader->>Sharp: calculateSSIM(prev.bytes, frame.bytes)
                Sharp-->>Loader: SSIM value

                alt SSIM < threshold
                    Loader->>Loader: Keep frame (keptReason: "diverges")
                else SSIM >= threshold
                    Loader->>Loader: Drop duplicate frame
                end
            else MSE < threshold
                Loader->>Loader: Drop duplicate frame (fast path)
            end

            alt First or last frame
                Loader->>Loader: Always keep (keptReason: "first"/"last")
            end
        end

        Loader->>Sharp: Resize kept frames (resizeFactor)
        Sharp-->>Loader: Resized Buffer
    else Sharp unavailable
        Loader->>Loader: Keep all frames native size (keptReason: "no-dedup")
    end

    Loader-->>Verifier: EvidenceLoadResult (canonical screenshots)

    Verifier->>TextCollector: Call collectCanonicalEvidence(trajectory)

    loop Each trajectory step
        TextCollector->>Trajectory: Extract aria snippet
        TextCollector->>Trajectory: Extract agent text
        TextCollector->>Trajectory: Extract agent JSON
        TextCollector->>Trajectory: Extract tool output

        alt Text evidence exists
            TextCollector->>TextCollector: Create CanonicalTextEvidence with source type
        end
    end

    TextCollector-->>Verifier: Array<CanonicalTextEvidence>

    Verifier->>Verifier: Merge screenshots + text into chronological CanonicalEvidence[]
    
    Note over Verifier: StepIndex → CanonicalIndex mapping preserved

    Verifier-->>Verifier: Evidence ready for relevance scoring (Step 2)

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic}

cubic-dev-ai Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/core/package.json Outdated

miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from 3fee3cb to 7d010ed Compare May 15, 2026 21:23

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch 2 times, most recently from da0c152 to d77e596 Compare May 15, 2026 21:45

miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch 3 times, most recently from a152252 to fc5a9f7 Compare May 15, 2026 23:27

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from 83b4e86 to fd043bc Compare May 16, 2026 04:40

miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from fc5a9f7 to be24a26 Compare May 16, 2026 04:40

miguelg719 added 5 commits May 15, 2026 22:49

feat(verifier): normalize canonical evidence

af07ca1

chore: add evidence normalization changeset

591e873

fix(core): keep sharp out of verifier install path

8d387fd

refactor(verifier): consolidate evidence types

6dae7da

refactor(verifier): keep evidence types in types module

4e9c26e

miguelg719 force-pushed the miguelgonzalez/verifier-03-trajectory-recorder branch from fd043bc to 635b3d2 Compare May 16, 2026 05:50

miguelg719 force-pushed the miguelgonzalez/verifier-04-evidence-normalization branch from be24a26 to 4e9c26e Compare May 16, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(verifier): normalize canonical evidence#2132

feat(verifier): normalize canonical evidence#2132
miguelg719 wants to merge 5 commits into
miguelgonzalez/verifier-03-trajectory-recorderfrom
miguelgonzalez/verifier-04-evidence-normalization

miguelg719 commented May 15, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 15, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miguelg719 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

Tests

Uh oh!

changeset-bot Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented May 15, 2026 •

edited

Loading

changeset-bot Bot commented May 15, 2026 •

edited

Loading