Skip to content

Add RFC 0007: Scorer Presets with Customization and Persistence#13

Open
Nehanth wants to merge 11 commits into
mlflow:mainfrom
Nehanth:rfc-0007-scorer-presets
Open

Add RFC 0007: Scorer Presets with Customization and Persistence#13
Nehanth wants to merge 11 commits into
mlflow:mainfrom
Nehanth:rfc-0007-scorer-presets

Conversation

@Nehanth

@Nehanth Nehanth commented Apr 28, 2026

Copy link
Copy Markdown

Summary

This RFC proposes a Preset class for MLflow that packages a named, immutable collection of scorers for common evaluation patterns. MLflow ships three built-in presets (Rag, Agent, ConversationalAgent) as starting points. Users can create custom presets, persist them to the MLflow server, and share them across teams and sessions.

from mlflow.genai.scorers import Agent, Guidelines

# Use a built-in preset
mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[Agent()],
)

# Create and persist a custom preset
my_preset = Preset("my_team_eval", scorers=[Safety(), Fluency(), my_custom_scorer])
my_preset.register()

# Load it later
preset = get_preset(name="my_team_eval")
mlflow.genai.evaluate(data=eval_dataset, scorers=[preset])

Based on mlflow/mlflow#21445.

Co-Authored-By: Claude noreply@anthropic.com

Proposes a Preset class that packages a named collection of scorers
for common evaluation patterns (RAG, agent, conversational-agent,
safety, quality). Presets can be passed directly in the scorers list
alongside individual scorers, with automatic deduplication.

Based on mlflow/mlflow#21445.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Nehanth and others added 3 commits April 28, 2026 11:04
Each built-in preset is now a subclass (Agent, Rag, ConversationalAgent,
SafetyPreset, Quality) that creates fresh scorer instances on each call.
Eliminates shared mutable state and enables future preset-specific
configuration.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
return len(self._scorers)

def __add__(self, other):
if isinstance(other, (Preset, list)):

@mprahl mprahl May 5, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to deduplicate scorers when combining presets especially since some have overlap? Edit: I see this is addressed below. It might be good to mention it here too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added deduplication in add so duplicates are removed when presets are combined.

`validate_scorers()` deduplicates by scorer type after flattening:

```python
def _deduplicate_scorers(scorers: list[Scorer]) -> list[Scorer]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep Preset deduplicated whenever scorers are added instead of relying on validate_scorers to do it IMO.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have both, add (I added the deduplication there as well) handles deduplication when a user combines presets or scorers using the + operator, and validate_scorers() handles it when multiple presets are passed directly in the scorers list.

@@ -0,0 +1,682 @@
---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC has too many implementation details. The RFC should be scoped to requirements, APIs, database schema changes, etc. This is delving into PR level implementation.

Can we keep this scoped to the desired UX, the proposed groupings, and the class definition?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed it down!

- **Type checking.** `isinstance(preset, Agent)` works — code can distinguish which preset is being used.
- **Custom control flow.** Each preset can override methods for preset-specific validation or behavior.

### Deduplication

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does deduplication account for scorers with different configurations (e.g. ConversationalGuidelines with separate configs)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplication works by (type(scorer), scorer.name), so if you use different names for your
ConversationalGuidelines scorers, they'll be kept. If two scorers have the same type and name, one gets removed.


MLflow ships five built-in preset subclasses. Each call creates fresh scorer instances.

> **Note:** **`TaskSuccess`** is a new scorer proposed in [mlflow/mlflow#22972](https://github.com/mlflow/mlflow/issues/22972). It evaluates whether an agent successfully accomplished the user's task without requiring ground truth data — unlike `Correctness`, which requires an `expectations` column. This scorer would be added to `Agent`, `ConversationalAgent`, and `Quality`. This work can be part of this RFC or be a future addition after this RFC is completed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant for this RFC? It seems out of scope.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's out of scope for this RFC. I'll remove it. I did create a separate issue for it mlflow#22972, once it's implemented it would be added to the Agent, ConversationalAgent, and Quality presets.comm


### Built-in Presets as Subclasses

Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see a benefit to these presets if you can't configure the scorers. If it's merely to categorize different scorers, there are better ways to do this such as documentation or a function like get_agent_scorers() which returns the classes for discoverability.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I’m going in this direction is because, after this RFC is approved, subclassing opens up more than just categorization. For example, each preset could accept different judge models (e.g, Agent(model="openai:/gpt-4o")), different inference parameters, or other preset-specific configurations.

If this RFC gets approved, I can also update it to use the function-based approach I proposed in the alternatives section, if that’s preferred.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.

def __init__(self):
super().__init__("rag", [
RetrievalRelevance(),
RetrievalSufficiency(),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this require expected_response or expected_facts?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, RetrievalSufficiency does require ground truth, so I'm removing it from this preset. We want presets to work out of the box without requiring ground truth for easy evaluation.

Nehanth and others added 2 commits May 6, 2026 16:48
- Add deduplication to Preset __init__ and __add__
- Remove TaskSuccess from presets (out of scope)
- Remove RetrievalSufficiency from Rag preset (requires ground truth)
- Trim implementation details per reviewer feedback

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
@Nehanth Nehanth requested a review from mprahl May 8, 2026 15:37
def __len__(self):
return len(self._scorers)

def __add__(self, other):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be less surprising to implement __or__ for "set" union behavior since you are deduplicating?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Just did that!

@mprahl mprahl left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not convinced this should be a feature in its current form. Without a way to configure or cleanly override scorers inside a preset, the value seems limited to boilerplate reduction. At the same time, if built-in preset membership changes across releases, users will get silent behavior and cost changes on upgrade. That combination makes presets feel more like unstable convenience sugar than a strong API surface.

@B-Step62 B-Step62 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the RFC submission @Nehanth!

The idea of bundling multiple scorers into sharable preset sounds very useful to me. My main feedback is to support persistence of presets in MLflow server and provide great customization UX. I can see built-in presets are useful for new users to getting started and solves "which scorer should I start with?" question, but many teams need customization for adopting them for the real use cases.

)
```

```python

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to persist presets more than a single python session. Otherwise L65 is essentially same as defining a normal python list of scorers. Team sharing benefit mentioned below only holds with persistence in a shared server.

self._scorers = tuple(self._deduplicate(scorers))

@staticmethod
def _deduplicate(scorers):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of deduplication, I would block adding an existing scorer in the preset.


### Built-in Presets as Subclasses

Each built-in preset is a subclass of `Preset` that hardcodes its scorer list. This means each call creates **fresh scorer instances** (no shared mutable singletons) and opens the door for preset-specific configuration and control flow in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Matt about customization capability. The class-based approach is more ergonomic to me but I think it should come with a good customizability and persistence.

Comment on lines +240 to +253
class SafetyPreset(Preset):
def __init__(self):
super().__init__("safety", [
Safety(),
ConversationalSafety(),
])

class Quality(Preset):
def __init__(self):
super().__init__("quality", [
RelevanceToQuery(),
Fluency(),
Completeness(),
])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAG/Agent/Conversation presets look great, the other two look less useful or intuitive to me. Safety one just bundles two scorers with 'safety' in its name. "Quality" sounds a bit too vague imo, technically most of scorers aims to evaluate some aspect of the agent quality. Shall we start with the first three + great customization story?

@etirelli

Copy link
Copy Markdown
Contributor

I fully agree with @mprahl on this:

if built-in preset membership changes across releases, users will get silent behavior and cost changes on upgrade. That combination makes presets feel more like unstable convenience sugar than a strong API surface.

If this is not handled properly, this feature would end up being a burden for us to maintain, and create friction for users on upgrades, rather than being a benefit.
One way to solve this issue is to implement persistence as suggested by @B-Step62 , as that would empower users to control how they use and manage presets across versions. A function based implementation would in such cases be more flexible and efficient than a class hierarchy based one.

subclassing opens up more than just categorization. For example, each preset could accept different judge models (e.g, Agent(model="openai:/gpt-4o")), different inference parameters, or other preset-specific configurations.

A function based implementation does not prevent such customizations.

Nehanth and others added 2 commits June 1, 2026 10:30
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
@Nehanth Nehanth force-pushed the rfc-0007-scorer-presets branch from 8e31d00 to 9710c79 Compare June 1, 2026 14:59
Nehanth and others added 2 commits June 1, 2026 11:08
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
@Nehanth

Nehanth commented Jun 1, 2026

Copy link
Copy Markdown
Author

Sorry for the delay in addressing comments.

Thanks for the feedback @mprahl @B-Step62 @etirelli. I updated the RFC to address the concern around this being a burden to maintain.

As @B-Step62 suggested, I dropped the Safety and Quality presets and kept Rag, Agent, and ConversationalAgent as starting points for new users. I also added customization and persistence so teams can create custom presets, save them to the MLflow server, and share them across sessions. This addresses the version stability concern since persisted presets are user-controlled and don't change on upgrade. Please take a look.

@Nehanth Nehanth requested review from B-Step62 and mprahl June 1, 2026 15:32
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
@Nehanth Nehanth changed the title Add RFC 0007: Scorer Presets for Common Evaluation Patterns Add RFC 0007: Scorer Presets with Customization and Persistence Jun 8, 2026
@jwm4 jwm4 mentioned this pull request Jun 8, 2026
5 tasks

@jwm4 jwm4 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC is on the right track. The core idea of bundling scorers into named, shareable presets is solid.

My main feedback is that the persistence story needs more user-facing specification (see my comment on that section), and that the RFC still leans heavily into implementation detail (private method bodies, deduplication internals, or/ror implementations). An RFC should nail down the requirements and the user-facing API. Implementation choices like how deduplication is coded or how subclass constructors work are better left to the PR.


### Persistence

Presets can be registered to the MLflow server so teams can share them across sessions. This leverages the existing scorer registration infrastructure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persistence looks very useful to me. The API surface (register() / get_preset()) is clear, but a few user-facing behavioral questions should be answered at the RFC level:

  1. Custom scorer portability. The example shows my_custom_scorer in a persisted preset. When a teammate calls get_preset("my_team_eval"), what do they get back? Do they need the same custom scorer code available in their environment? What happens if they don't? This is central to the team-sharing story.
  2. Scope/namespace. The experiment_id parameter suggests presets are scoped to experiments, but the default behavior (no experiment_id) isn't specified. Are presets workspace-global by default? If I register to experiment A, can someone working in experiment B see it?
  3. Discovery. There's get_preset(name=...) for retrieving a known preset, but no way to ask "what presets are available here?" Something like list_presets() seems like a natural companion, especially for the team-sharing use case.

These don't need implementation details, just a description of the intended user experience.

def __init__(self):
super().__init__()
# Add team-specific scorers
self._scorers = self._scorers + (Fluency(), my_compliance_scorer)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation reassigns self._scorers after super().init() has already run _validate_no_duplicates, so the added scorers bypass the duplicate check. Consider having the constructor accept extra scorers so the full list goes through validation, e.g.:

class MyAgent(Agent):
    def __init__(self):
        super().__init__(extra_scorers=[Fluency(), my_compliance_scorer])

Alternatively, this level of implementation detail could be dropped from the RFC entirely. The RFC really should be focused more on the specification and not the code.

def __or__(self, other):
if isinstance(other, (Preset, list)):
combined = list(self) + list(other)
return self._deduplicate(combined)

@jwm4 jwm4 Jun 11, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deduplicate returns a plain list, so Agent() | Rag() gives you a list, not a Preset. That means you can't chain (Agent() | Rag() | [my_scorer]), can't call register() on the result, and the return type silently changes from Preset to list. This should probably return a new Preset.

Alternatively, this level of implementation detail could be dropped from the RFC, with just the requirement that | combines presets with deduplication and produces a Preset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants