If you change prompts by feel, you ship regressions by feel.
ruby_llm-contract works best when you treat evals as the source of truth:
- Capture real failures from production.
- Turn them into eval cases.
- Change the prompt.
- Re-run the same eval.
- Merge only if the eval says quality improved or stayed safe.
That is the practical version of eval-first.
Do not start with the prompt. Start with the eval.
In this gem, that means:
ClassifyTicket.define_eval("regression") do
add_case "billing dispute",
input: "I was charged twice this month",
expected: { priority: "high", category: "billing" }
add_case "outage",
input: "Database is down for all customers",
expected: { priority: "urgent", category: "technical" }
endThen and only then:
- add or change
system - tighten
rule - add
example - change
validate - compare prompt versions
Use the gem in three layers:
Fast, local, often offline.
Purpose:
- verify that the step still parses
- verify schema and validates
- catch obvious contract breakage
ClassifyTicket.define_eval("smoke") do
default_input "My invoice is wrong"
sample_response({ priority: "high", category: "billing" })
endsample_response is good here.
It is not your main quality signal.
This is your real eval-first dataset.
Purpose:
- represent real user traffic
- capture known failures and expensive mistakes
- gate merges and prompt changes
Good sources:
- support tickets
- bad completions from logs
- incidents
- edge cases found in QA
- cases where a human had to correct the output
Every time the model fails in production, the default response should be:
add_case, then fix.
Prompt iteration.
Purpose:
- compare old prompt vs new prompt on the same dataset
- block regressions before rollout
diff = ClassifyTicketV2.compare_with(
ClassifyTicketV1,
eval: "regression",
model: "gpt-4.1-mini"
)
diff.safe_to_switch?This is the cleanest eval-first move in the gem: same eval, same cases, two prompt versions.
ClassifyTicket.define_eval("regression") do
add_case "refund", input: "Refund me", expected: { category: "billing" }
end
# Prompt changes happen after the eval exists
diff = NewPrompt.compare_with(OldPrompt, eval: "regression", model: "gpt-4.1-mini")# Tweak prompt for an hour
# Maybe add an example
# Maybe tighten a rule
# Then manually eyeball one or two responsesThat is not eval-first. That is prompt guessing.
sample_response is excellent for:
- offline smoke tests
- local development
- testing evaluator wiring
- verifying schema + validate behavior with zero API calls
It is not enough for real prompt decisions.
For real eval-first work:
- use
run_eval(..., context: { model: "..." }) - or pass an explicit adapter
And for prompt A/B:
- use
compare_with - with a real
model:or explicit adapters
compare_with intentionally ignores sample_response, because canned data would make both sides look the same.
Start with 10 to 30 cases that represent real mistakes and important business paths.
ClassifyTicket.define_eval("regression") do
add_case "invoice", input: "Invoice is wrong", expected: { category: "billing" }
add_case "feature", input: "Please add dark mode", expected: { priority: "low" }
add_case "outage", input: "Everything is down", expected: { priority: "urgent" }
endexpect(ClassifyTicket).to pass_eval("regression")
.with_context(model: "gpt-4.1-mini")
.with_minimum_score(0.8)Now prompt changes stop being opinion-based.
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-mini" })
report.save_baseline!(model: "gpt-4.1-mini")This makes quality drift visible.
expect(ClassifyTicketV2).to pass_eval("regression")
.with_context(model: "gpt-4.1-mini")
.compared_with(ClassifyTicketV1)
.with_minimum_score(0.8)If the new prompt regresses, the change does not merge.
This is the flywheel:
- failure in prod
- add a case
- improve prompt
- rerun eval
- lock it with CI
That is how the eval gets stronger over time.
If you add:
example input: "My invoice is wrong", output: '{"priority":"high","category":"billing"}'that is still just a prompt change.
The eval-first way to use few-shot is:
- add examples to the prompt
- rerun the existing regression eval
- compare against the old prompt with
compare_with
Few-shot is not the proof. The eval is the proof.
Do not optimize model cost before you stabilize prompt quality.
Recommended order:
- Build
regression - Improve prompt with
compare_with - Lock quality in CI
- Then run
compare_models
comparison = ClassifyTicket.compare_models(
"regression",
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
)
comparison.best_for(min_score: 0.95)This keeps cost optimization downstream from quality.
If you want one simple standard:
smokeusessample_responseregressionuses real model calls- every prompt change uses
compare_with - every merge runs
pass_eval - every production failure becomes a new
add_case
That is enough to make the gem work in a real eval-first loop.
Use the gem like this:
- Write
define_evalbefore touching the prompt. - Treat
sample_responseas smoke only. - Use
run_eval("name", context: { model: "..." })for real quality measurement. - Use
compare_withfor every serious prompt change. - Gate merges with
pass_eval. - Feed every production miss back into the eval dataset.
If you do that consistently, prompts stop being vibes and start being engineering.