feat(verifier): normalize canonical evidence#2132
Open
miguelg719 wants to merge 5 commits into
Open
Conversation
🦋 Changeset detectedLatest commit: 4e9c26e The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
There was a problem hiding this comment.
1 issue found across 3 files
Confidence score: 2/5
- High-confidence, high-severity risk: adding
sharpinpackages/core/package.jsonintroduces a binary/node-gyp-backed dependency in core, which violates the stated core-lib constraint and is likely to cause build/portability regressions. - Because this is a concrete policy and runtime/build compatibility concern (severity 9/10, confidence 10/10), this is not a low-risk merge as-is.
- Pay close attention to
packages/core/package.json-sharpshould be removed, replaced, or isolated outside core to avoid binary dependency risk.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/package.json">
<violation number="1" location="packages/core/package.json:113">
P0: Custom agent: **Flag any imports of packages that use node-gyp or embed binaries (e.g. sharp)**
`sharp` (a binary/node-gyp-backed package) was added to `packages/core/package.json` dependencies, violating the core-library restriction on native/binary npm packages.</violation>
</file>
Architecture diagram
sequenceDiagram
participant Verifier as Evidence Verifier
participant Loader as loadAndReduceScreenshots
participant Sharp as Sharp (Optional)
participant TextCollector as collectCanonicalEvidence
participant Trajectory as Trajectory Data
Note over Verifier,Trajectory: Canonical Evidence Collection Flow
Verifier->>Trajectory: Read trajectory steps
Verifier->>Loader: Call loadAndReduceScreenshots(trajectory)
Loader->>Loader: Parse env thresholds (SSIM, MSE, resize)
loop Each trajectory step
Loader->>Trajectory: Check probeEvidence.screenshot buffer
alt Screenshot exists
Loader->>Loader: Add to rawFrames list
else No screenshot
Loader->>Loader: Skip step
end
end
alt Sharp available
Loader->>Sharp: Dynamic import sharp
Sharp-->>Loader: sharp instance
loop Dedup and resize frames
Loader->>Sharp: calculateMSE(prev.bytes, frame.bytes)
Sharp-->>Loader: MSE value
alt MSE >= threshold
Loader->>Sharp: calculateSSIM(prev.bytes, frame.bytes)
Sharp-->>Loader: SSIM value
alt SSIM < threshold
Loader->>Loader: Keep frame (keptReason: "diverges")
else SSIM >= threshold
Loader->>Loader: Drop duplicate frame
end
else MSE < threshold
Loader->>Loader: Drop duplicate frame (fast path)
end
alt First or last frame
Loader->>Loader: Always keep (keptReason: "first"/"last")
end
end
Loader->>Sharp: Resize kept frames (resizeFactor)
Sharp-->>Loader: Resized Buffer
else Sharp unavailable
Loader->>Loader: Keep all frames native size (keptReason: "no-dedup")
end
Loader-->>Verifier: EvidenceLoadResult (canonical screenshots)
Verifier->>TextCollector: Call collectCanonicalEvidence(trajectory)
loop Each trajectory step
TextCollector->>Trajectory: Extract aria snippet
TextCollector->>Trajectory: Extract agent text
TextCollector->>Trajectory: Extract agent JSON
TextCollector->>Trajectory: Extract tool output
alt Text evidence exists
TextCollector->>TextCollector: Create CanonicalTextEvidence with source type
end
end
TextCollector-->>Verifier: Array<CanonicalTextEvidence>
Verifier->>Verifier: Merge screenshots + text into chronological CanonicalEvidence[]
Note over Verifier: StepIndex → CanonicalIndex mapping preserved
Verifier-->>Verifier: Evidence ready for relevance scoring (Step 2)
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.
Re-trigger cubic
3fee3cb to
7d010ed
Compare
da0c152 to
d77e596
Compare
a152252 to
fc5a9f7
Compare
83b4e86 to
fd043bc
Compare
fc5a9f7 to
be24a26
Compare
fd043bc to
635b3d2
Compare
be24a26 to
4e9c26e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The verifier needs a mode-neutral evidence layer before scoring. DOM/Hybrid tasks can only be judged correctly if text, JSON, and tool-output evidence are considered alongside screenshots and ARIA evidence, while core installs should not pick up image-processing runtime dependencies.
What Changed
verifier/types.ts; implementation files import those types directly.sharpimport so core remains dependency-light;sharpis owned by evals where verifier tooling runs.probe-aria,agent-text,agent-json, andtool-outputevidence are collected.Tests
pnpm --filter @browserbasehq/stagehand run typecheckpnpm -w exec prettier --check packages/core/lib/v3/verifier packages/core/tests/unit/verifier-failure-step-parser.test.ts packages/core/tests/unit/verifier-trajectory.test.ts packages/evals/cli.ts packages/evals/framework packages/evals/tests/framework/rubricCache.test.ts packages/evals/tests/framework/verifierAdapter.test.ts packages/evals/tests/tui/commandTree.test.ts packages/evals/tui packages/evals/tasks/bench/agentcollectCanonicalEvidenceproducedprobe-aria,agent-text,agent-json,tool-outputgit diff --check