Skip to content

Add classifier support to eval#114

Merged
Andrew Kent (realark) merged 1 commit into
mainfrom
classifiers
May 27, 2026
Merged

Add classifier support to eval#114
Andrew Kent (realark) merged 1 commit into
mainfrom
classifiers

Conversation

@Qard
Copy link
Copy Markdown
Contributor

Summary

Ports classifier support from the Ruby SDK (braintrust-sdk-ruby#154), following the canonical contract in braintrust-spec/docs/features/classifiers.md.

Classifiers categorize and label eval outputs. Unlike scorers (numeric 0–1), they return structured Classification items with an id, optional label, and optional metadata. They run alongside scorers; their failures are non-fatal; and at least one of scorers or classifiers is now required (relaxes the prior "scorers required" check).

New public types

  • Classification — record carrying optional name, required id, optional label, optional metadata.
  • Classifier<INPUT, OUTPUT> — interface with classify(TaskResult); static factories Classifier.of(name, fn) (list-returning) and Classifier.single(name, fn) (single-result convenience). Validates non-blank id with the spec wording.
  • TracedClassifier<INPUT, OUTPUT> — parallel to TracedScorer, gives classifiers access to the BrainstoreTrace.

Eval integration

  • Eval.Builder gains classifiers(...); the builder rejects "no scorers and no classifiers" with a clear runtime error.
  • New runClassifier helper emits a classifier span per classifier with type=classifier, purpose=scorer, name=<resolved>, dispatching to TracedClassifier when applicable.
  • Per-case classifications aggregate onto the root eval span as braintrust.classifications (only when non-empty, per spec).
  • Classifier exceptions are recorded on the classifier span and merged into the root span's braintrust.metadata.classifier_errors — other classifiers and scorers continue to run; the eval does not abort.
  • Classifiers only run when the task succeeds (matches Ruby; spec defines no task-exception fallback).

Drive-by fix

TestHarness.ensureRemoteDataset's post-rebuild verify check had an inverted condition (threw failed to ensure expected dataset when the rebuild succeeded). Only surfaced during cassette re-recording against a fresh project.

Cassette churn

To produce a coherent VCR scenario chain for the new ClassifierEvalTest integration tests, the 118 braintrust cassettes were re-recorded from scratch via VCR_MODE=record ./gradlew :braintrust-sdk:test --max-workers=1. WireMock numbers scenarios starting from 1 each recording session, so adding cassettes piecemeal causes name collisions with the existing chain — re-recording in one shot is the only clean path. Third-party AI provider cassettes (anthropic/openai/google/bedrock) are untouched.

Test plan

  • VCR_MODE=replay ./gradlew :braintrust-sdk:test — green
  • VCR_MODE=replay ./gradlew test (full repo incl. instrumentation + smoke tests) — green
  • ./gradlew :braintrust-sdk:spotlessCheck :examples:spotlessCheck :test-harness:spotlessCheck — green
  • Smoke ClassifiersExample against a real Braintrust project and verify the experiment shows classifier spans + grouped classifications

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 671c0ccce3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment on lines +324 to +326
for (var item : classifications) {
var itemName = item.name();
if (itemName == null || itemName.isBlank()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep classifier post-processing failures non-fatal

runClassifier only catches exceptions thrown by classifier.classify(...), but it processes returned items outside that try/catch. A custom Classifier implementation (which this commit explicitly supports) can return a list containing null or otherwise malformed items, and item.name() will throw here, escaping runClassifier and aborting the eval case. That breaks the intended contract that classifier failures are non-fatal and should be recorded under classifier_errors instead.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is valid. A classifier returning a list with null values seems like a contract breech. We could explicitly doc that you're not allowed to do this, but that seems so unlikely I wouldn't say it's necessary

@realark
Copy link
Copy Markdown
Collaborator

Stephen Belanger (@Qard) could you only record fresh cassettes for ClassifierEvalTest.java to cut down on the diff size? something like VCR_MODE=record ./gradlew :braintrust-sdk:test --tests '*ClassifierEvalTest*' should do the trick

Copy link
Copy Markdown
Collaborator

@realark Andrew Kent (realark) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- requesting smaller cassettes (I should probably make an AGENTS.md file telling agents not to do a full re-record)

Mirrors the Ruby SDK's classifier port (braintrustdata/braintrust-sdk-ruby#154)
and the canonical classifier spec at braintrust-spec/docs/features/classifiers.md.

Classifiers return structured Classification items (id, optional label,
optional metadata) instead of numeric scores. They run alongside scorers,
their failures are non-fatal, and at least one of scorers/classifiers is
required (relaxes the prior scorers-required check).

New public types: Classification, Classifier (+ Classifier.of / .single
factories), TracedClassifier. Eval gains a classifiers(...) builder method
and a runClassifier helper that emits classifier spans with
type=classifier, purpose=scorer; per-case classifications aggregate onto
the root eval span as braintrust.classifications, and classifier exceptions
land in braintrust.metadata.classifier_errors.

Also fixes an inverted-condition bug in TestHarness.ensureRemoteDataset's
post-rebuild verify check (threw when datasets matched). New cassettes
were recorded only for ClassifierEvalTest (VCR_MODE=record ... --tests
'*ClassifierEvalTest*') against the same Braintrust SDKs org used by the
existing cassettes, so the rest of the cassette set is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Qard
Copy link
Copy Markdown
Contributor Author

Andrew Kent (@realark) Fixed. Just new cassette files now. 🙂

Agreed with the AGENTS.md note, Claude didn't seem to understand how to only record new files, it needed some guidance to do it right.

@realark Andrew Kent (realark) merged commit 92826c9 into main May 27, 2026
1 check passed
@realark Andrew Kent (realark) deleted the classifiers branch May 27, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants