-
Notifications
You must be signed in to change notification settings - Fork 3
Add classifier support to eval #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
37 changes: 37 additions & 0 deletions
37
braintrust-sdk/src/main/java/dev/braintrust/eval/Classification.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| package dev.braintrust.eval; | ||
|
|
||
| import java.util.Map; | ||
| import javax.annotation.Nullable; | ||
|
|
||
| /** | ||
| * A single structured classification produced by a {@link Classifier}. | ||
| * | ||
| * <p>Unlike a {@link Score} (numeric 0-1), a Classification carries a stable id, an optional | ||
| * display label, and optional metadata. The {@code name} acts as the grouping key in the aggregated | ||
| * result map; when {@code name} is {@code null} or blank, the owning classifier's resolved name is | ||
| * used instead. | ||
| * | ||
| * @param name optional grouping key; defaults to the owning classifier's resolved name when null or | ||
| * blank | ||
| * @param id stable identifier for the classification (required) | ||
| * @param label optional display label | ||
| * @param metadata optional arbitrary metadata | ||
| */ | ||
| public record Classification( | ||
| @Nullable String name, | ||
| String id, | ||
| @Nullable String label, | ||
| @Nullable Map<String, Object> metadata) { | ||
|
|
||
| public static Classification of(String id) { | ||
| return new Classification(null, id, null, null); | ||
| } | ||
|
|
||
| public static Classification of(String id, String label) { | ||
| return new Classification(null, id, label, null); | ||
| } | ||
|
|
||
| public static Classification of(String name, String id, String label) { | ||
| return new Classification(name, id, label, null); | ||
| } | ||
| } |
98 changes: 98 additions & 0 deletions
98
braintrust-sdk/src/main/java/dev/braintrust/eval/Classifier.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| package dev.braintrust.eval; | ||
|
|
||
| import java.util.List; | ||
| import java.util.function.Function; | ||
|
|
||
| /** | ||
| * A classifier categorizes and labels eval outputs, producing zero or more structured {@link | ||
| * Classification} items. | ||
| * | ||
| * <p>Classifiers run independently from {@link Scorer}s. Each Classifier exposes a name (used as | ||
| * the span name and as the default grouping key for classifications whose own {@code name} is | ||
| * blank). | ||
| * | ||
| * @param <INPUT> type of the input data | ||
| * @param <OUTPUT> type of the output data | ||
| */ | ||
| public interface Classifier<INPUT, OUTPUT> { | ||
| String INVALID_CLASSIFICATION_MESSAGE = | ||
| "When returning structured classifier results, each classification must be a non-empty" | ||
| + " object."; | ||
|
|
||
| String getName(); | ||
|
|
||
| /** | ||
| * Classifies the result of a successful task execution. | ||
| * | ||
| * @param taskResult the task output and originating dataset case | ||
| * @return zero or more classifications. An empty list means "no classifications for this case". | ||
| */ | ||
| List<Classification> classify(TaskResult<INPUT, OUTPUT> taskResult); | ||
|
|
||
| /** | ||
| * Creates a classifier from a function that returns a (possibly empty or null) list of | ||
| * classifications. | ||
| * | ||
| * <p>A {@code null} return value is treated as no classifications. Each returned {@link | ||
| * Classification} must have a non-blank {@code id}; otherwise the classifier throws an | ||
| * exception (which the eval runner records but does not abort on). | ||
| */ | ||
| static <INPUT, OUTPUT> Classifier<INPUT, OUTPUT> of( | ||
| String classifierName, | ||
| Function<TaskResult<INPUT, OUTPUT>, List<Classification>> classifierFn) { | ||
| return new Classifier<>() { | ||
| @Override | ||
| public String getName() { | ||
| return classifierName; | ||
| } | ||
|
|
||
| @Override | ||
| public List<Classification> classify(TaskResult<INPUT, OUTPUT> taskResult) { | ||
| var result = classifierFn.apply(taskResult); | ||
| if (result == null) { | ||
| return List.of(); | ||
| } | ||
| for (var item : result) { | ||
| validate(item); | ||
| } | ||
| return result; | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| /** | ||
| * Creates a classifier from a function that returns a single classification. | ||
| * | ||
| * <p>A {@code null} return value is treated as no classifications. | ||
| */ | ||
| static <INPUT, OUTPUT> Classifier<INPUT, OUTPUT> single( | ||
| String classifierName, | ||
| Function<TaskResult<INPUT, OUTPUT>, Classification> classifierFn) { | ||
| return new Classifier<>() { | ||
| @Override | ||
| public String getName() { | ||
| return classifierName; | ||
| } | ||
|
|
||
| @Override | ||
| public List<Classification> classify(TaskResult<INPUT, OUTPUT> taskResult) { | ||
| var item = classifierFn.apply(taskResult); | ||
| if (item == null) { | ||
| return List.of(); | ||
| } | ||
| validate(item); | ||
| return List.of(item); | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| /** | ||
| * Validates a single classification: it must have a non-blank id. Throws with the spec-mandated | ||
| * wording on failure. | ||
| */ | ||
| private static void validate(Classification item) { | ||
| if (item == null || item.id() == null || item.id().isBlank()) { | ||
| throw new IllegalArgumentException(INVALID_CLASSIFICATION_MESSAGE + " Got: " + item); | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
41 changes: 41 additions & 0 deletions
41
braintrust-sdk/src/main/java/dev/braintrust/eval/TracedClassifier.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| package dev.braintrust.eval; | ||
|
|
||
| import dev.braintrust.trace.BrainstoreTrace; | ||
| import java.util.List; | ||
|
|
||
| /** | ||
| * A classifier that receives access to the full distributed trace of the task that was evaluated. | ||
| * | ||
| * <p>Implement this interface when your classifier needs to examine intermediate LLM calls, tool | ||
| * invocations, or other spans produced during task execution — not just the final {@link | ||
| * TaskResult}. | ||
| * | ||
| * @param <INPUT> type of the input data | ||
| * @param <OUTPUT> type of the output data | ||
| */ | ||
| public interface TracedClassifier<INPUT, OUTPUT> extends Classifier<INPUT, OUTPUT> { | ||
|
|
||
| /** | ||
| * Classifies the task result using the distributed trace for additional context. Called instead | ||
| * of {@link Classifier#classify(TaskResult)} when a {@link BrainstoreTrace} is available. | ||
| * | ||
| * @param taskResult the task output and originating dataset case | ||
| * @param trace lazy access to the distributed trace spans for this eval case | ||
| * @return zero or more classifications | ||
| */ | ||
| List<Classification> classify(TaskResult<INPUT, OUTPUT> taskResult, BrainstoreTrace trace); | ||
|
|
||
| /** | ||
| * {@inheritDoc} | ||
| * | ||
| * <p>When used inside an {@link Eval}, this overload is never called — {@link | ||
| * #classify(TaskResult, BrainstoreTrace)} is dispatched instead. This default implementation | ||
| * throws {@link UnsupportedOperationException} to surface any accidental direct calls. | ||
| */ | ||
| @Override | ||
| default List<Classification> classify(TaskResult<INPUT, OUTPUT> taskResult) { | ||
| throw new UnsupportedOperationException( | ||
| "traced classifier classify method directly called. This is likely an accident. If" | ||
| + " you wish to support this, your implementation must override this method."); | ||
| } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runClassifieronly catches exceptions thrown byclassifier.classify(...), but it processes returned items outside thattry/catch. A customClassifierimplementation (which this commit explicitly supports) can return a list containingnullor otherwise malformed items, anditem.name()will throw here, escapingrunClassifierand aborting the eval case. That breaks the intended contract that classifier failures are non-fatal and should be recorded underclassifier_errorsinstead.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is valid. A classifier returning a list with null values seems like a contract breech. We could explicitly doc that you're not allowed to do this, but that seems so unlikely I wouldn't say it's necessary