experiment: Agent evaluation via MLflow + OpenTelemetry#30
Conversation
Experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents: trace capture, mechanical + LLM-judge scoring, PR quality gates, regression detection, and prompt versioning. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
02dfefe to
9203dbc
Compare
|
🤖 Review · Started 5:25 PM UTC |
Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
🤖 Finished Review · ✅ Success · Started 5:28 PM UTC · Completed 5:43 PM UTC |
ReviewFindingsMedium
Low
Info
|
maruiz93
left a comment
There was a problem hiding this comment.
This is a PoC so my reviews aren't blockers but I think at least those could be noticed somewhere in the readme
|
|
||
| @scorer | ||
| def cost_within_budget(*, trace) -> Feedback: | ||
| """Is the run cost within acceptable bounds? |
There was a problem hiding this comment.
[Medium] Two of the five mechanical scorers have methodology concerns:
cost_within_budget uses a fixed hard ceiling ($2.00 for explore) rather than a deviation from a reference baseline. A binary pass/fail against a generous cap doesn't surface meaningful cost trends — an agent costing $1.90 passes the same as one costing $0.04. Consider scoring based on deviation from a known reference (e.g., percentage above/below baseline mean), which provides a continuous signal rather than a binary one.
confidence_coherence operates on the agent's self-reported confidence scores, which are LLM-generated. This places non-deterministic data at the base of the analysis pipeline — if these scores feed into baselines and regression detection, you get compounding non-determinism across layers. LLM-based judgment should sit at the top layer only (as the LLM-as-judge scorers already do), not be laundered through mechanical scorers into downstream analysis. Consider restricting mechanical scorers to objective trace properties (cost, duration, tool counts, schema validation) and moving any assessment of agent confidence to the LLM judge tier.
| gates: | ||
| min_validation_rate: 0.80 | ||
| min_quality_score: 3.0 | ||
| max_cost: 2.00 |
There was a problem hiding this comment.
[Medium] The harness YAML defines min_quality_score: 3.0 (implying a 1–5 scale), but the LLM judge scorers normalize to 0–1 (result["score"] / 5.0). The README's observed ranges confirm the 0–1 scale (e.g., reasoning_coherence mean 0.71). Either the gate compares 3.0 against 0–1 values (nothing passes), or the gate uses raw scores before normalization (inconsistent with the scorer output). Aligning the scales — or documenting which scale the gate expects — would make the example self-consistent.
|
|
||
|
|
||
| def compute_means(entries: list[dict]) -> dict[str, float]: | ||
| """Compute mean score per scorer from golden entries.""" |
There was a problem hiding this comment.
[Medium] The golden baselines are curated from production traces against different work items with different complexity. compute_means() averages across these heterogeneous traces, then compares against recent traces from yet other work items. A score drop could mean the agent regressed — or that recent work items were harder. Production trace means are useful for observability, metrics, and trend analysis, but they shouldn't be the baseline used for evaluation — too many uncontrolled factors can cause deviation: infrastructure outages (e.g., GitHub downtime), more complex inputs, human activity interlaced with agentic activity, poorly described issues providing bad input. Reliable evaluation baselines need fixed, controlled inputs to isolate agent quality from environmental variance. This PR already has fixtures (fixtures/input.yaml, fixtures/rubric.yaml) — consider using them to build the golden baselines rather than curating from production traces. Fixture-based baselines provide the controlled reference that production traces can't.
Summary
agent-eval-mlflow-otel/experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agentsContents
README.mdexamples/scorer_mechanical.pyexamples/scorer_llm_judge.pyexamples/run_eval.pymlflow.genai.evaluate()examples/check_regression.pyexamples/register_prompts.pyexamples/send_trace_example.pyexamples/harness-explore.yamlfixtures/Security
.gitignorecovers.env,venv/,results/,output/Made with Cursor