experiment: Agent evaluation via MLflow + OpenTelemetry by ascerra · Pull Request #30 · fullsend-ai/experiments

ascerra · 2026-06-09T17:23:26Z

Summary

Adds agent-eval-mlflow-otel/ experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents
Includes simplified example scripts: mechanical scorers, LLM-as-judge scorers, trace export, prompt registration, regression detection
All 5 hypotheses validated: trace capture, scoring, PR gates, regression detection, prompt versioning

File	Purpose
`README.md`	Full experiment write-up with architecture, results, and analysis
`examples/scorer_mechanical.py`	5 pure-Python scorers (validation, cost, efficiency, confidence, iterations)
`examples/scorer_llm_judge.py`	4 Claude Opus semantic quality scorers via Vertex AI
`examples/run_eval.py`	Score traces via `mlflow.genai.evaluate()`
`examples/check_regression.py`	Compare recent traces against golden baselines
`examples/register_prompts.py`	MLflow Prompts Registry with @staging/@production aliases
`examples/send_trace_example.py`	Minimal OTLP trace export to MLflow
`examples/harness-explore.yaml`	Example harness config with eval section
`fixtures/`	Example fixture input and LLM judge rubric

Security

No hardcoded secrets — all credentials via environment variables
No internal IPs or hostnames
.gitignore covers .env, venv/, results/, output/

Made with Cursor

Experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents: trace capture, mechanical + LLM-judge scoring, PR quality gates, regression detection, and prompt versioning. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review · 2026-06-09T17:25:11Z

🤖 Review · Started 5:25 PM UTC
Commit: ba204cb · View workflow run →

Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review · 2026-06-09T17:28:26Z

🤖 Finished Review · ✅ Success · Started 5:28 PM UTC · Completed 5:43 PM UTC
Commit: ba204cb · View workflow run →

fullsend-ai-review · 2026-06-09T17:43:13Z

Review

Findings

Medium

[error-handling] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:34 — _llm_judge() parses LLM response as JSON with no error handling. If the judge model returns malformed JSON or an unexpected structure (e.g., missing score key), json.loads raises JSONDecodeError and crashes the entire evaluation run. The markdown-fence stripping logic is also fragile.
Remediation: Wrap json.loads(content) in a try/except and return a fallback Feedback (e.g., score=0 with rationale indicating parse failure). Validate the returned dict contains the expected score key before accessing it.
[error-handling] agent-eval-mlflow-otel/examples/scorer_mechanical.py:28 — In tool_efficiency, int() cast on get_attribute values may raise ValueError if the attribute is a non-numeric string. The or 0 fallback only handles None/falsy, not arbitrary strings.
Remediation: Use try/except around the int() casts, e.g., try: tools = int(...) except (ValueError, TypeError): tools = 0.
[api-contract] agent-eval-mlflow-otel/examples/run_eval.py:93 — mlflow.log_param and mlflow.log_metrics are called after mlflow.genai.evaluate() returns without an active MLflow run context. If evaluate() manages its own internal run, these calls will fail with MlflowException.
Remediation: Wrap the evaluation and logging in a with mlflow.start_run(): block.

Low

[logic-error] agent-eval-mlflow-otel/examples/check_regression.py:80 — The regressions list is always empty (comparison logic is commented out). The script prints "To complete: fetch recent traces..." acknowledging the stub, but still reports "All scorers within threshold" which could be misleading if used in CI without reading the output carefully.
[prompt-injection] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:72 — _get_trace_summary() interpolates trace data (reasoning text, agent name) directly into LLM judge prompts without delimiting. An adversarial trace could influence scoring. Risk is low since this is an internal evaluation tool scoring the team's own agent traces.
[credential-handling] agent-eval-mlflow-otel/examples/check_regression.py:35 — connect() hardcodes admin as default MLflow username via setdefault. Could lead to unintended admin-level access if MLFLOW_TRACKING_USERNAME is unset while MLFLOW_OTLP_TOKEN is set.
[edge-case] agent-eval-mlflow-otel/examples/register_prompts.py:68 — client.search_prompt_versions() may raise RestException for non-existent prompt names rather than returning an empty list. First-time registration could fail.
[logic-error] agent-eval-mlflow-otel/examples/harness-explore.yaml:22 — iteration_count returns a raw count (e.g., 3) not a normalized 0–1 score. Gating logic like min_quality_score: 3.0 would interact confusingly with unnormalized values in metrics aggregation.
[naming-convention] agent-eval-mlflow-otel/examples/send_trace_example.py — The _example suffix is redundant when the file is already in the examples/ directory.
[missing-authorization] agent-eval-mlflow-otel/README.md — This PR adds 12 new files with no linked issue. For an experiments repo this is a minor process gap — the thorough README provides sufficient context — but linking to an authorizing issue improves traceability.

Info

[secrets-handling] .gitignore correctly excludes .env, venv/, results/, output/. All credentials loaded from environment variables. No hardcoded secrets found.
[scope-alignment] README clearly states production versions live at fullsend-ai/features and these are simplified standalone excerpts. Scope is well-documented.
[architectural-coherence] Post-hoc trace export design (avoiding coupling agents to observability libraries) is architecturally sound and well-justified.

maruiz93

~~Retracted~~ — posted prematurely, will follow up with a proper inline review.

maruiz93

This is a PoC so my reviews aren't blockers but I think at least those could be noticed somewhere in the readme

maruiz93 · 2026-06-16T12:03:59Z

+
+@scorer
+def cost_within_budget(*, trace) -> Feedback:
+    """Is the run cost within acceptable bounds?


[Medium] Two of the five mechanical scorers have methodology concerns:

cost_within_budget uses a fixed hard ceiling ($2.00 for explore) rather than a deviation from a reference baseline. A binary pass/fail against a generous cap doesn't surface meaningful cost trends — an agent costing $1.90 passes the same as one costing $0.04. Consider scoring based on deviation from a known reference (e.g., percentage above/below baseline mean), which provides a continuous signal rather than a binary one.

confidence_coherence operates on the agent's self-reported confidence scores, which are LLM-generated. This places non-deterministic data at the base of the analysis pipeline — if these scores feed into baselines and regression detection, you get compounding non-determinism across layers. LLM-based judgment should sit at the top layer only (as the LLM-as-judge scorers already do), not be laundered through mechanical scorers into downstream analysis. Consider restricting mechanical scorers to objective trace properties (cost, duration, tool counts, schema validation) and moving any assessment of agent confidence to the LLM judge tier.

maruiz93 · 2026-06-16T12:03:59Z

+  gates:
+    min_validation_rate: 0.80
+    min_quality_score: 3.0
+    max_cost: 2.00


[Medium] The harness YAML defines min_quality_score: 3.0 (implying a 1–5 scale), but the LLM judge scorers normalize to 0–1 (result["score"] / 5.0). The README's observed ranges confirm the 0–1 scale (e.g., reasoning_coherence mean 0.71). Either the gate compares 3.0 against 0–1 values (nothing passes), or the gate uses raw scores before normalization (inconsistent with the scorer output). Aligning the scales — or documenting which scale the gate expects — would make the example self-consistent.

maruiz93 · 2026-06-16T12:03:59Z

+
+
+def compute_means(entries: list[dict]) -> dict[str, float]:
+    """Compute mean score per scorer from golden entries."""


[Medium] The golden baselines are curated from production traces against different work items with different complexity. compute_means() averages across these heterogeneous traces, then compares against recent traces from yet other work items. A score drop could mean the agent regressed — or that recent work items were harder. Production trace means are useful for observability, metrics, and trend analysis, but they shouldn't be the baseline used for evaluation — too many uncontrolled factors can cause deviation: infrastructure outages (e.g., GitHub downtime), more complex inputs, human activity interlaced with agentic activity, poorly described issues providing bad input. Reliable evaluation baselines need fixed, controlled inputs to isolate agent quality from environmental variance. This PR already has fixtures (fixtures/input.yaml, fixtures/rubric.yaml) — consider using them to build the golden baselines rather than curating from production traces. Fixture-based baselines provide the controlled reference that production traces can't.

ascerra requested a review from a team as a code owner June 9, 2026 17:23

ascerra force-pushed the experiment/agent-eval-mlflow-otel branch from 02dfefe to 9203dbc Compare June 9, 2026 17:24

Add architecture diagram to README

28f6e39

Signed-off-by: Adam Scerra <ascerra@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>

fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label Jun 9, 2026

maruiz93 reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment: Agent evaluation via MLflow + OpenTelemetry#30

experiment: Agent evaluation via MLflow + OpenTelemetry#30
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel

ascerra commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

maruiz93 left a comment •

edited

Loading

Uh oh!

maruiz93 left a comment

Uh oh!

maruiz93 Jun 16, 2026

Uh oh!

maruiz93 Jun 16, 2026

Uh oh!

maruiz93 Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def compute_means(entries: list[dict]) -> dict[str, float]:
		"""Compute mean score per scorer from golden entries."""

Conversation

ascerra commented Jun 9, 2026

Summary

Contents

Security

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fullsend-ai-review Bot commented Jun 9, 2026

Review

Findings

Medium

Low

Info

Uh oh!

maruiz93 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maruiz93 left a comment

Choose a reason for hiding this comment

Uh oh!

maruiz93 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

maruiz93 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

maruiz93 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fullsend-ai-review Bot commented Jun 9, 2026 •

edited

Loading

maruiz93 left a comment •

edited

Loading