Eval-First

If you change prompts by feel, you ship regressions by feel.

ruby_llm-contract works best when you treat evals as the source of truth:

Capture real failures from production.
Turn them into eval cases.
Change the prompt.
Re-run the same eval.
Merge only if the eval says quality improved or stayed safe.

That is the practical version of eval-first.

Core Rule

Do not start with the prompt. Start with the eval.

In this gem, that means:

ClassifyTicket.define_eval("regression") do
  add_case "billing dispute",
           input: "I was charged twice this month",
           expected: { priority: "high", category: "billing" }

  add_case "outage",
           input: "Database is down for all customers",
           expected: { priority: "urgent", category: "technical" }
end

Then and only then:

add or change system
tighten rule
add example
change validate
compare prompt versions

The Right Mental Model

Use the gem in three layers:

1. `smoke`

Fast, local, often offline.

Purpose:

verify that the step still parses
verify schema and validates
catch obvious contract breakage

ClassifyTicket.define_eval("smoke") do
  default_input "My invoice is wrong"
  sample_response({ priority: "high", category: "billing" })
end

sample_response is good here.

It is not your main quality signal.

2. `regression`

This is your real eval-first dataset.

Purpose:

represent real user traffic
capture known failures and expensive mistakes
gate merges and prompt changes

Good sources:

support tickets
bad completions from logs
incidents
edge cases found in QA
cases where a human had to correct the output

Every time the model fails in production, the default response should be:

add_case, then fix.

3. `ab`

Prompt iteration.

Purpose:

compare old prompt vs new prompt on the same dataset
block regressions before rollout

diff = ClassifyTicketV2.compare_with(
  ClassifyTicketV1,
  eval: "regression",
  model: "gpt-4.1-mini"
)

diff.safe_to_switch?

This is the cleanest eval-first move in the gem: same eval, same cases, two prompt versions.

What Counts As Eval-First In This Gem

Good

ClassifyTicket.define_eval("regression") do
  add_case "refund", input: "Refund me", expected: { category: "billing" }
end

# Prompt changes happen after the eval exists
diff = NewPrompt.compare_with(OldPrompt, eval: "regression", model: "gpt-4.1-mini")

Bad

# Tweak prompt for an hour
# Maybe add an example
# Maybe tighten a rule
# Then manually eyeball one or two responses

That is not eval-first. That is prompt guessing.

`sample_response`: Useful, But Not The Main Thing

sample_response is excellent for:

offline smoke tests
local development
testing evaluator wiring
verifying schema + validate behavior with zero API calls

It is not enough for real prompt decisions.

For real eval-first work:

use run_eval(..., context: { model: "..." })
or pass an explicit adapter

And for prompt A/B:

use compare_with
with a real model: or explicit adapters

compare_with intentionally ignores sample_response, because canned data would make both sides look the same.

The Minimal Team Workflow

Step 1. Build one eval that matters

Start with 10 to 30 cases that represent real mistakes and important business paths.

ClassifyTicket.define_eval("regression") do
  add_case "invoice", input: "Invoice is wrong", expected: { category: "billing" }
  add_case "feature", input: "Please add dark mode", expected: { priority: "low" }
  add_case "outage", input: "Everything is down", expected: { priority: "urgent" }
end

Step 2. Gate it in CI

expect(ClassifyTicket).to pass_eval("regression")
  .with_context(model: "gpt-4.1-mini")
  .with_minimum_score(0.8)

Now prompt changes stop being opinion-based.

Step 3. Save a baseline

report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-mini" })
report.save_baseline!(model: "gpt-4.1-mini")

This makes quality drift visible.

Step 4. Change prompts only through comparison

expect(ClassifyTicketV2).to pass_eval("regression")
  .with_context(model: "gpt-4.1-mini")
  .compared_with(ClassifyTicketV1)
  .with_minimum_score(0.8)

If the new prompt regresses, the change does not merge.

Step 5. Add every production failure back into the eval

This is the flywheel:

failure in prod
add a case
improve prompt
rerun eval
lock it with CI

That is how the eval gets stronger over time.

Few-Shot Examples Fit Naturally

If you add:

example input: "My invoice is wrong", output: '{"priority":"high","category":"billing"}'

that is still just a prompt change.

The eval-first way to use few-shot is:

add examples to the prompt
rerun the existing regression eval
compare against the old prompt with compare_with

Few-shot is not the proof. The eval is the proof.

Model Selection Comes After Prompt Stability

Do not optimize model cost before you stabilize prompt quality.

Recommended order:

Build regression
Improve prompt with compare_with
Lock quality in CI
Then run compare_models

comparison = ClassifyTicket.compare_models(
  "regression",
  models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
)

comparison.best_for(min_score: 0.95)

This keeps cost optimization downstream from quality.

Strong Defaults For Teams

If you want one simple standard:

smoke uses sample_response
regression uses real model calls
every prompt change uses compare_with
every merge runs pass_eval
every production failure becomes a new add_case

That is enough to make the gem work in a real eval-first loop.

Short Version

Use the gem like this:

Write define_eval before touching the prompt.
Treat sample_response as smoke only.
Use run_eval("name", context: { model: "..." }) for real quality measurement.
Use compare_with for every serious prompt change.
Gate merges with pass_eval.
Feed every production miss back into the eval dataset.

If you do that consistently, prompts stop being vibes and start being engineering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval-First

Core Rule

The Right Mental Model

1. `smoke`

2. `regression`

3. `ab`

What Counts As Eval-First In This Gem

Good

Bad

`sample_response`: Useful, But Not The Main Thing

The Minimal Team Workflow

Step 1. Build one eval that matters

Step 2. Gate it in CI

Step 3. Save a baseline

Step 4. Change prompts only through comparison

Step 5. Add every production failure back into the eval

Few-Shot Examples Fit Naturally

Model Selection Comes After Prompt Stability

Strong Defaults For Teams

Short Version

FilesExpand file tree

eval_first.md

Latest commit

History

eval_first.md

File metadata and controls

Eval-First

Core Rule

The Right Mental Model

1. smoke

2. regression

3. ab

What Counts As Eval-First In This Gem

Good

Bad

sample_response: Useful, But Not The Main Thing

The Minimal Team Workflow

Step 1. Build one eval that matters

Step 2. Gate it in CI

Step 3. Save a baseline

Step 4. Change prompts only through comparison

Step 5. Add every production failure back into the eval

Few-Shot Examples Fit Naturally

Model Selection Comes After Prompt Stability

Strong Defaults For Teams

Short Version

1. `smoke`

2. `regression`

3. `ab`

`sample_response`: Useful, But Not The Main Thing