Add classifier support to eval by Qard · Pull Request #114 · braintrustdata/braintrust-sdk-java

Stephen Belanger (Qard) · 2026-05-26T14:35:26Z

Summary

Ports classifier support from the Ruby SDK (braintrust-sdk-ruby#154), following the canonical contract in braintrust-spec/docs/features/classifiers.md.

Classifiers categorize and label eval outputs. Unlike scorers (numeric 0–1), they return structured Classification items with an id, optional label, and optional metadata. They run alongside scorers; their failures are non-fatal; and at least one of scorers or classifiers is now required (relaxes the prior "scorers required" check).

New public types

Classification — record carrying optional name, required id, optional label, optional metadata.
Classifier<INPUT, OUTPUT> — interface with classify(TaskResult); static factories Classifier.of(name, fn) (list-returning) and Classifier.single(name, fn) (single-result convenience). Validates non-blank id with the spec wording.
TracedClassifier<INPUT, OUTPUT> — parallel to TracedScorer, gives classifiers access to the BrainstoreTrace.

Eval integration

Eval.Builder gains classifiers(...); the builder rejects "no scorers and no classifiers" with a clear runtime error.
New runClassifier helper emits a classifier span per classifier with type=classifier, purpose=scorer, name=<resolved>, dispatching to TracedClassifier when applicable.
Per-case classifications aggregate onto the root eval span as braintrust.classifications (only when non-empty, per spec).
Classifier exceptions are recorded on the classifier span and merged into the root span's braintrust.metadata.classifier_errors — other classifiers and scorers continue to run; the eval does not abort.
Classifiers only run when the task succeeds (matches Ruby; spec defines no task-exception fallback).

Drive-by fix

TestHarness.ensureRemoteDataset's post-rebuild verify check had an inverted condition (threw failed to ensure expected dataset when the rebuild succeeded). Only surfaced during cassette re-recording against a fresh project.

Cassette churn

To produce a coherent VCR scenario chain for the new ClassifierEvalTest integration tests, the 118 braintrust cassettes were re-recorded from scratch via VCR_MODE=record ./gradlew :braintrust-sdk:test --max-workers=1. WireMock numbers scenarios starting from 1 each recording session, so adding cassettes piecemeal causes name collisions with the existing chain — re-recording in one shot is the only clean path. Third-party AI provider cassettes (anthropic/openai/google/bedrock) are untouched.

Test plan

VCR_MODE=replay ./gradlew :braintrust-sdk:test — green
VCR_MODE=replay ./gradlew test (full repo incl. instrumentation + smoke tests) — green
./gradlew :braintrust-sdk:spotlessCheck :examples:spotlessCheck :test-harness:spotlessCheck — green
Smoke ClassifiersExample against a real Braintrust project and verify the experiment shows classifier spans + grouped classifications

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 671c0ccce3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-26T14:40:50Z

+            for (var item : classifications) {
+                var itemName = item.name();
+                if (itemName == null || itemName.isBlank()) {


Keep classifier post-processing failures non-fatal

runClassifier only catches exceptions thrown by classifier.classify(...), but it processes returned items outside that try/catch. A custom Classifier implementation (which this commit explicitly supports) can return a list containing null or otherwise malformed items, and item.name() will throw here, escaping runClassifier and aborting the eval case. That breaks the intended contract that classifier failures are non-fatal and should be recorded under classifier_errors instead.

Useful? React with 👍 / 👎.

I don't think this is valid. A classifier returning a list with null values seems like a contract breech. We could explicitly doc that you're not allowed to do this, but that seems so unlikely I wouldn't say it's necessary

Andrew Kent (realark) · 2026-05-27T01:07:29Z

Stephen Belanger (@Qard) could you only record fresh cassettes for ClassifierEvalTest.java to cut down on the diff size? something like VCR_MODE=record ./gradlew :braintrust-sdk:test --tests '*ClassifierEvalTest*' should do the trick

Andrew Kent (realark)

LGTM -- requesting smaller cassettes (I should probably make an AGENTS.md file telling agents not to do a full re-record)

Mirrors the Ruby SDK's classifier port (braintrustdata/braintrust-sdk-ruby#154) and the canonical classifier spec at braintrust-spec/docs/features/classifiers.md. Classifiers return structured Classification items (id, optional label, optional metadata) instead of numeric scores. They run alongside scorers, their failures are non-fatal, and at least one of scorers/classifiers is required (relaxes the prior scorers-required check). New public types: Classification, Classifier (+ Classifier.of / .single factories), TracedClassifier. Eval gains a classifiers(...) builder method and a runClassifier helper that emits classifier spans with type=classifier, purpose=scorer; per-case classifications aggregate onto the root eval span as braintrust.classifications, and classifier exceptions land in braintrust.metadata.classifier_errors. Also fixes an inverted-condition bug in TestHarness.ensureRemoteDataset's post-rebuild verify check (threw when datasets matched). New cassettes were recorded only for ClassifierEvalTest (VCR_MODE=record ... --tests '*ClassifierEvalTest*') against the same Braintrust SDKs org used by the existing cassettes, so the rest of the cassette set is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stephen Belanger (Qard) · 2026-05-27T12:04:35Z

Andrew Kent (@realark) Fixed. Just new cassette files now. 🙂

Agreed with the AGENTS.md note, Claude didn't seem to understand how to only record new files, it needed some guidance to do it right.

Stephen Belanger (Qard) requested a review from Andrew Kent (realark) May 26, 2026 14:37

Stephen Belanger (Qard) self-assigned this May 26, 2026

Stephen Belanger (Qard) added the enhancement New feature or request label May 26, 2026

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

Andrew Kent (realark) reviewed May 27, 2026

View reviewed changes

Stephen Belanger (Qard) force-pushed the classifiers branch from 671c0cc to 26b798a Compare May 27, 2026 12:01

Stephen Belanger (Qard) requested a review from Andrew Kent (realark) May 27, 2026 12:02

Andrew Kent (realark) approved these changes May 27, 2026

View reviewed changes

Andrew Kent (realark) merged commit 92826c9 into main May 27, 2026
1 check passed

Andrew Kent (realark) deleted the classifiers branch May 27, 2026 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add classifier support to eval#114

Add classifier support to eval#114
Andrew Kent (realark) merged 1 commit into
mainfrom
classifiers

Stephen Belanger (Qard) commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

Andrew Kent (realark) May 27, 2026

Uh oh!

Andrew Kent (realark) commented May 27, 2026

Uh oh!

Andrew Kent (realark) left a comment

Uh oh!

Stephen Belanger (Qard) commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stephen Belanger (Qard) commented May 26, 2026

Summary

New public types

Eval integration

Drive-by fix

Cassette churn

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Andrew Kent (realark) May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Andrew Kent (realark) commented May 27, 2026

Uh oh!

Andrew Kent (realark) left a comment

Choose a reason for hiding this comment

Uh oh!

Stephen Belanger (Qard) commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants