Skip to content

experiment: Agent evaluation via MLflow + OpenTelemetry#30

Open
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel
Open

experiment: Agent evaluation via MLflow + OpenTelemetry#30
ascerra wants to merge 2 commits into
mainfrom
experiment/agent-eval-mlflow-otel

Conversation

@ascerra

@ascerra ascerra commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds agent-eval-mlflow-otel/ experiment validating MLflow 3.x + OTLP as a complete eval platform for autonomous AI agents
  • Includes simplified example scripts: mechanical scorers, LLM-as-judge scorers, trace export, prompt registration, regression detection
  • All 5 hypotheses validated: trace capture, scoring, PR gates, regression detection, prompt versioning

Contents

File Purpose
README.md Full experiment write-up with architecture, results, and analysis
examples/scorer_mechanical.py 5 pure-Python scorers (validation, cost, efficiency, confidence, iterations)
examples/scorer_llm_judge.py 4 Claude Opus semantic quality scorers via Vertex AI
examples/run_eval.py Score traces via mlflow.genai.evaluate()
examples/check_regression.py Compare recent traces against golden baselines
examples/register_prompts.py MLflow Prompts Registry with @staging/@production aliases
examples/send_trace_example.py Minimal OTLP trace export to MLflow
examples/harness-explore.yaml Example harness config with eval section
fixtures/ Example fixture input and LLM judge rubric

Security

  • No hardcoded secrets — all credentials via environment variables
  • No internal IPs or hostnames
  • .gitignore covers .env, venv/, results/, output/

Made with Cursor

@ascerra ascerra requested a review from a team as a code owner June 9, 2026 17:23
Experiment validating MLflow 3.x + OTLP as a complete eval platform
for autonomous AI agents: trace capture, mechanical + LLM-judge scoring,
PR quality gates, regression detection, and prompt versioning.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Adam Scerra <ascerra@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@ascerra ascerra force-pushed the experiment/agent-eval-mlflow-otel branch from 02dfefe to 9203dbc Compare June 9, 2026 17:24
@fullsend-ai-review

Copy link
Copy Markdown

🤖 Review · Started 5:25 PM UTC
Commit: ba204cb · View workflow run →

Signed-off-by: Adam Scerra <ascerra@redhat.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@fullsend-ai-review

fullsend-ai-review Bot commented Jun 9, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 5:28 PM UTC · Completed 5:43 PM UTC
Commit: ba204cb · View workflow run →

@fullsend-ai-review

Copy link
Copy Markdown

Review

Findings

Medium

  • [error-handling] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:34_llm_judge() parses LLM response as JSON with no error handling. If the judge model returns malformed JSON or an unexpected structure (e.g., missing score key), json.loads raises JSONDecodeError and crashes the entire evaluation run. The markdown-fence stripping logic is also fragile.
    Remediation: Wrap json.loads(content) in a try/except and return a fallback Feedback (e.g., score=0 with rationale indicating parse failure). Validate the returned dict contains the expected score key before accessing it.

  • [error-handling] agent-eval-mlflow-otel/examples/scorer_mechanical.py:28 — In tool_efficiency, int() cast on get_attribute values may raise ValueError if the attribute is a non-numeric string. The or 0 fallback only handles None/falsy, not arbitrary strings.
    Remediation: Use try/except around the int() casts, e.g., try: tools = int(...) except (ValueError, TypeError): tools = 0.

  • [api-contract] agent-eval-mlflow-otel/examples/run_eval.py:93mlflow.log_param and mlflow.log_metrics are called after mlflow.genai.evaluate() returns without an active MLflow run context. If evaluate() manages its own internal run, these calls will fail with MlflowException.
    Remediation: Wrap the evaluation and logging in a with mlflow.start_run(): block.

Low

  • [logic-error] agent-eval-mlflow-otel/examples/check_regression.py:80 — The regressions list is always empty (comparison logic is commented out). The script prints "To complete: fetch recent traces..." acknowledging the stub, but still reports "All scorers within threshold" which could be misleading if used in CI without reading the output carefully.

  • [prompt-injection] agent-eval-mlflow-otel/examples/scorer_llm_judge.py:72_get_trace_summary() interpolates trace data (reasoning text, agent name) directly into LLM judge prompts without delimiting. An adversarial trace could influence scoring. Risk is low since this is an internal evaluation tool scoring the team's own agent traces.

  • [credential-handling] agent-eval-mlflow-otel/examples/check_regression.py:35connect() hardcodes admin as default MLflow username via setdefault. Could lead to unintended admin-level access if MLFLOW_TRACKING_USERNAME is unset while MLFLOW_OTLP_TOKEN is set.

  • [edge-case] agent-eval-mlflow-otel/examples/register_prompts.py:68client.search_prompt_versions() may raise RestException for non-existent prompt names rather than returning an empty list. First-time registration could fail.

  • [logic-error] agent-eval-mlflow-otel/examples/harness-explore.yaml:22iteration_count returns a raw count (e.g., 3) not a normalized 0–1 score. Gating logic like min_quality_score: 3.0 would interact confusingly with unnormalized values in metrics aggregation.

  • [naming-convention] agent-eval-mlflow-otel/examples/send_trace_example.py — The _example suffix is redundant when the file is already in the examples/ directory.

  • [missing-authorization] agent-eval-mlflow-otel/README.md — This PR adds 12 new files with no linked issue. For an experiments repo this is a minor process gap — the thorough README provides sufficient context — but linking to an authorizing issue improves traceability.

Info

  • [secrets-handling] .gitignore correctly excludes .env, venv/, results/, output/. All credentials loaded from environment variables. No hardcoded secrets found.

  • [scope-alignment] README clearly states production versions live at fullsend-ai/features and these are simplified standalone excerpts. Scope is well-documented.

  • [architectural-coherence] Post-hoc trace export design (avoiding coupling agents to observability libraries) is architecturally sound and well-justified.

@fullsend-ai-review fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label Jun 9, 2026

@maruiz93 maruiz93 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracted — posted prematurely, will follow up with a proper inline review.

@maruiz93 maruiz93 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a PoC so my reviews aren't blockers but I think at least those could be noticed somewhere in the readme


@scorer
def cost_within_budget(*, trace) -> Feedback:
"""Is the run cost within acceptable bounds?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Two of the five mechanical scorers have methodology concerns:

cost_within_budget uses a fixed hard ceiling ($2.00 for explore) rather than a deviation from a reference baseline. A binary pass/fail against a generous cap doesn't surface meaningful cost trends — an agent costing $1.90 passes the same as one costing $0.04. Consider scoring based on deviation from a known reference (e.g., percentage above/below baseline mean), which provides a continuous signal rather than a binary one.

confidence_coherence operates on the agent's self-reported confidence scores, which are LLM-generated. This places non-deterministic data at the base of the analysis pipeline — if these scores feed into baselines and regression detection, you get compounding non-determinism across layers. LLM-based judgment should sit at the top layer only (as the LLM-as-judge scorers already do), not be laundered through mechanical scorers into downstream analysis. Consider restricting mechanical scorers to objective trace properties (cost, duration, tool counts, schema validation) and moving any assessment of agent confidence to the LLM judge tier.

gates:
min_validation_rate: 0.80
min_quality_score: 3.0
max_cost: 2.00

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] The harness YAML defines min_quality_score: 3.0 (implying a 1–5 scale), but the LLM judge scorers normalize to 0–1 (result["score"] / 5.0). The README's observed ranges confirm the 0–1 scale (e.g., reasoning_coherence mean 0.71). Either the gate compares 3.0 against 0–1 values (nothing passes), or the gate uses raw scores before normalization (inconsistent with the scorer output). Aligning the scales — or documenting which scale the gate expects — would make the example self-consistent.



def compute_means(entries: list[dict]) -> dict[str, float]:
"""Compute mean score per scorer from golden entries."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] The golden baselines are curated from production traces against different work items with different complexity. compute_means() averages across these heterogeneous traces, then compares against recent traces from yet other work items. A score drop could mean the agent regressed — or that recent work items were harder. Production trace means are useful for observability, metrics, and trend analysis, but they shouldn't be the baseline used for evaluation — too many uncontrolled factors can cause deviation: infrastructure outages (e.g., GitHub downtime), more complex inputs, human activity interlaced with agentic activity, poorly described issues providing bad input. Reliable evaluation baselines need fixed, controlled inputs to isolate agent quality from environmental variance. This PR already has fixtures (fixtures/input.yaml, fixtures/rubric.yaml) — consider using them to build the golden baselines rather than curating from production traces. Fixture-based baselines provide the controlled reference that production traces can't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requires-manual-review Review requires human judgment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants