Add classifier support to eval#114
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 671c0ccce3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| for (var item : classifications) { | ||
| var itemName = item.name(); | ||
| if (itemName == null || itemName.isBlank()) { |
There was a problem hiding this comment.
Keep classifier post-processing failures non-fatal
runClassifier only catches exceptions thrown by classifier.classify(...), but it processes returned items outside that try/catch. A custom Classifier implementation (which this commit explicitly supports) can return a list containing null or otherwise malformed items, and item.name() will throw here, escaping runClassifier and aborting the eval case. That breaks the intended contract that classifier failures are non-fatal and should be recorded under classifier_errors instead.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I don't think this is valid. A classifier returning a list with null values seems like a contract breech. We could explicitly doc that you're not allowed to do this, but that seems so unlikely I wouldn't say it's necessary
|
Stephen Belanger (@Qard) could you only record fresh cassettes for ClassifierEvalTest.java to cut down on the diff size? something like |
Andrew Kent (realark)
left a comment
There was a problem hiding this comment.
LGTM -- requesting smaller cassettes (I should probably make an AGENTS.md file telling agents not to do a full re-record)
Mirrors the Ruby SDK's classifier port (braintrustdata/braintrust-sdk-ruby#154) and the canonical classifier spec at braintrust-spec/docs/features/classifiers.md. Classifiers return structured Classification items (id, optional label, optional metadata) instead of numeric scores. They run alongside scorers, their failures are non-fatal, and at least one of scorers/classifiers is required (relaxes the prior scorers-required check). New public types: Classification, Classifier (+ Classifier.of / .single factories), TracedClassifier. Eval gains a classifiers(...) builder method and a runClassifier helper that emits classifier spans with type=classifier, purpose=scorer; per-case classifications aggregate onto the root eval span as braintrust.classifications, and classifier exceptions land in braintrust.metadata.classifier_errors. Also fixes an inverted-condition bug in TestHarness.ensureRemoteDataset's post-rebuild verify check (threw when datasets matched). New cassettes were recorded only for ClassifierEvalTest (VCR_MODE=record ... --tests '*ClassifierEvalTest*') against the same Braintrust SDKs org used by the existing cassettes, so the rest of the cassette set is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
671c0cc to
26b798a
Compare
|
Andrew Kent (@realark) Fixed. Just new cassette files now. 🙂 Agreed with the AGENTS.md note, Claude didn't seem to understand how to only record new files, it needed some guidance to do it right. |
Summary
Ports classifier support from the Ruby SDK (braintrust-sdk-ruby#154), following the canonical contract in braintrust-spec/docs/features/classifiers.md.
Classifiers categorize and label eval outputs. Unlike scorers (numeric 0–1), they return structured
Classificationitems with anid, optionallabel, and optionalmetadata. They run alongside scorers; their failures are non-fatal; and at least one ofscorersorclassifiersis now required (relaxes the prior "scorers required" check).New public types
Classification— record carrying optionalname, requiredid, optionallabel, optionalmetadata.Classifier<INPUT, OUTPUT>— interface withclassify(TaskResult); static factoriesClassifier.of(name, fn)(list-returning) andClassifier.single(name, fn)(single-result convenience). Validates non-blankidwith the spec wording.TracedClassifier<INPUT, OUTPUT>— parallel toTracedScorer, gives classifiers access to theBrainstoreTrace.Eval integration
Eval.Buildergainsclassifiers(...); the builder rejects "no scorers and no classifiers" with a clear runtime error.runClassifierhelper emits a classifier span per classifier withtype=classifier,purpose=scorer,name=<resolved>, dispatching toTracedClassifierwhen applicable.braintrust.classifications(only when non-empty, per spec).braintrust.metadata.classifier_errors— other classifiers and scorers continue to run; the eval does not abort.Drive-by fix
TestHarness.ensureRemoteDataset's post-rebuild verify check had an inverted condition (threwfailed to ensure expected datasetwhen the rebuild succeeded). Only surfaced during cassette re-recording against a fresh project.Cassette churn
To produce a coherent VCR scenario chain for the new
ClassifierEvalTestintegration tests, the 118 braintrust cassettes were re-recorded from scratch viaVCR_MODE=record ./gradlew :braintrust-sdk:test --max-workers=1. WireMock numbers scenarios starting from 1 each recording session, so adding cassettes piecemeal causes name collisions with the existing chain — re-recording in one shot is the only clean path. Third-party AI provider cassettes (anthropic/openai/google/bedrock) are untouched.Test plan
VCR_MODE=replay ./gradlew :braintrust-sdk:test— greenVCR_MODE=replay ./gradlew test(full repo incl. instrumentation + smoke tests) — green./gradlew :braintrust-sdk:spotlessCheck :examples:spotlessCheck :test-harness:spotlessCheck— greenClassifiersExampleagainst a real Braintrust project and verify the experiment shows classifier spans + grouped classifications🤖 Generated with Claude Code