feat: JudgeLLM evaluation with ProposalAmender by rioloc · Pull Request #248 · lightspeed-core/lightspeed-evaluation

rioloc · 2026-05-27T14:56:28Z

Depends on #232 — this PR must be merged after #232.

Summary

This PR adds LLM-as-judge evaluation and enriched data capture for the ProposalDriver evaluation pipeline, along with per-scenario test infrastructure and a new CrashLoopBackOff test fixture.

Core changes

ProposalAmender — after a Proposal CR reaches terminal state, fetches child Result CRs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) from the cluster and enriches TurnData with:
- proposal_results: structured dict with the complete .status of each child CR
- response: a Markdown workflow summary suitable for both human review and LLM-as-judge evaluation
custom:proposal_evaluation_correctness — a new LLM-as-judge metric (score 0–1) with a multi-dimensional evaluation prompt:
- SRE persona: judge evaluates as a senior Site Reliability Engineer on OpenShift/Kubernetes
- Per-dimension scoring: separate scores for Diagnosis, Action, and Verification (+ average), with N/A for absent dimensions
- Reasoning-before-score: forces the LLM to reason before committing to a number, reducing post-hoc rationalization
- 3 calibration examples: anchors scoring on concrete K8s scenarios (high score, low score, infrastructure failure with correct diagnosis)
- Dedicated parser (_parse_proposal_eval_response): extracts sub-scores and average from the multi-dimensional output format
Per-scenario test infrastructure — setup/cleanup scripts refactored from monolithic per-provider scripts to:
- Shared infra scripts (_setup_infra-openai.sh, _setup_infra-claude-vertex.sh) sourced by scenario scripts
- Per-scenario setup/cleanup (setup_oomkill-openai.sh, setup_crashloop_probe-openai.sh, etc.)
- New crashloop-probe-demo fixture (nginx with misconfigured liveness probe at /nonexistent-health)
Shellcheck compliance — export for variables consumed by sourced scripts (SC2034), exclude SC1091 for dynamic source paths in Makefile

Why

The existing ProposalDriver only extracts condition message fields via _extract_summary, losing the rich structured data from child Result CRs. In particular, the Diagnosis from AnalysisResult (root cause analysis, confidence level, detailed summary) is never captured, making it impossible to evaluate whether the agentic workflow diagnosed and remediated the issue correctly.

custom:proposal_status provides deterministic pass/fail on workflow phase, but cannot assess the quality of diagnosis, actions, or verification. The new proposal_evaluation_correctness metric fills this gap using an LLM judge with a structured, multi-dimensional prompt.

Design choices

CLIClient abstraction

CLI operations (run, get_resource, apply, delete) are extracted into a CLIClient ABC with a KubeCLI implementation. Both ProposalDriver and ProposalAmender use the same interface; tests inject a mock CLIClient without patching subprocess internals.

ProposalAmender as a separate class

Follows the APIDataAmender pattern — a dedicated class composed into the driver, responsible for enriching TurnData in-place. Navigates proposal_status.steps.<step>.results[] to read StepResultRef entries, then fetches each child CR via CLIClient.get_resource().

Multi-dimensional judge prompt

The prompt produces per-dimension scores instead of a single holistic score:

Diagnosis, Action, Verification scored independently (0.0–1.0 or N/A)
Average computed from present dimensions
Calibration Example C explicitly anchors the "correct diagnosis but infrastructure failure" edge case — Action is marked N/A when execution fails for infra reasons (timeout, sandbox crash), not penalized as agent reasoning failure

Per-scenario scripts

Each scenario (oomkill, crashloop-probe) × provider (openai, claude-vertex) has its own setup/cleanup script that sources a shared _setup_infra-{provider}.sh / _cleanup_infra-{provider}.sh. This avoids deploying all fixtures for every conversation and makes adding new scenarios mechanical.

Relationship between the two proposal metrics

Metric	Type	Evaluates	Requires
`custom:proposal_status`	Deterministic	Phase matches expected (Completed/Failed/Denied/Escalated)	`expected_proposal_status`
`custom:proposal_evaluation_correctness`	LLM judge	Quality of diagnosis, actions, verification	`response` (from amender) + `expected_response`

They are complementary: proposal_status checks what happened, proposal_evaluation_correctness checks how well it was done.

Test plan

Existing test_proposal_driver.py tests pass unchanged (regression)
ProposalAmender unit tests: analysis-only, analysis+execution, full pipeline, failed step, empty results, Markdown summary formatting
proposal_evaluation_correctness metric unit tests: mock LLM, multi-dimensional score parsing, missing response handling, conversation-level skip, SRE persona verification
_parse_proposal_eval_response parser unit tests: all dimensions, N/A dimensions, fallback average computation, unparseable input
Integration tests updated: test_oomkill_full_lifecycle, test_analysis_only, test_oomkill_claude_vertex
New crashloop-probe-demo fixture and integration eval data for both providers
Shellcheck passes with per-scenario scripts
make pre-commit && make test green

🤖 Generated with Claude Code

coderabbitai · 2026-05-27T14:56:38Z

Walkthrough

This pull request introduces an LLM-judged evaluation framework for agentic remediation workflows. It extracts phase derivation into a shared utility, creates a Kubernetes CLI abstraction, implements a ProposalAmender component that enriches turn data with structured child CR results, refactors ProposalDriver to use these abstractions, and adds a new proposal_evaluation_correctness custom metric. Comprehensive unit and integration tests cover all new components.

Changes

Proposal Evaluation and Enrichment Pipeline

Layer / File(s)	Summary
Phase derivation extraction and public API `src/lightspeed_evaluation/core/proposal/phase.py`, `src/lightspeed_evaluation/core/proposal/__init__.py`, `src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py`, `tests/unit/core/metrics/custom/test_proposal_eval.py`	Phase derivation logic is extracted into a new public `derive_phase()` function that maps CRD conditions to terminal/in-progress states (`Denied`, `Escalated`, `Failed`, `Completed`, `InProgress`). The private `_derive_phase` helper is removed and replaced with public API imports across metrics and test modules.
TurnData model extension for proposal results `src/lightspeed_evaluation/core/models/data.py`	`TurnData` gains a new optional `proposal_results` field to store structured results from child Result CRs populated by `ProposalAmender`.
Kubernetes CLI abstraction and implementation `src/lightspeed_evaluation/pipeline/evaluation/cli.py`	New `CLIClient` abstract interface defines Kubernetes operations (`run`, `get_resource`, `apply`, `delete`) with timeout handling. `KubeCLI` implementation executes kubectl/oc commands via subprocess with JSON serialization and error handling.
ProposalAmender: workflow enrichment from child CRs `src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`, `tests/unit/pipeline/evaluation/test_proposal_amender.py`	`ProposalAmender` uses `CLIClient` to fetch analysis/execution/verification/escalation Result CRs, populate `TurnData.proposal_results`, and generate comprehensive Markdown workflow summaries with sections for each step type and outcome details. Tests cover analysis-only, full pipeline, and edge-case workflows via `MockCLI` test double.
ProposalDriver refactoring: CLI abstraction and amender integration `src/lightspeed_evaluation/pipeline/evaluation/driver.py`	`ProposalDriver` now uses `KubeCLI` for cluster operations instead of raw subprocess calls, integrates `ProposalAmender` to enrich turn data, uses public `derive_phase()` for terminal detection, and applies fallback logic when amending fails. Initialization now constructs `KubeCLI` and `ProposalAmender` instances.
Proposal evaluation correctness custom metric `src/lightspeed_evaluation/core/metrics/custom/custom.py`, `src/lightspeed_evaluation/core/metrics/custom/prompts.py`, `tests/unit/core/metrics/custom/test_custom.py`	New `proposal_evaluation_correctness` LLM-judged metric evaluates workflow quality across diagnosis, action appropriateness, risk management, and verification. Implements `_evaluate_proposal_evaluation_correctness()`, adds `PROPOSAL_EVALUATION_CORRECTNESS_PROMPT` template, and provides comprehensive unit tests for success paths, validation errors, and LLM failures.
System configuration, validation, and documentation `config/system.yaml`, `src/lightspeed_evaluation/core/system/validator.py`, `README.md`, `docs/EVALUATION_GUIDE.md`	Registers `custom:proposal_evaluation_correctness` in system config with 0.75 threshold, adds validation entry requiring `response` field, updates README with metric subsection, and extends EVALUATION_GUIDE.md with evaluation criteria, scoring rubric, examples, and quick-reference table entry.
Integration test fixtures and end-to-end scenarios `tests/integration/system-config-agents-proposal.yaml`, `tests/integration/test_evaluation_data_proposal.yaml`, `tests/integration/test_proposal_evaluation.py`, `tests/unit/pipeline/evaluation/test_proposal_driver.py`	Test fixtures now define three proposal scenarios (`proposal_full_lifecycle`, `proposal_analysis_only`, `proposal_judge_evaluation`) with expected responses and metric thresholds. New `test_judge_evaluation()` validates end-to-end LLM scoring. Agent timeout increases from 900s to 1200s, debug logging is enabled, and driver tests update assertions to use substring matching for generated workflow summaries and add `cli_timeout` configuration validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

lightspeed-core/lightspeed-evaluation#232: Both PRs touch the CRD proposal evaluation path—PR #248 refactors phase derivation to delegate to a shared derive_phase utility introduced alongside proposal-status evaluation logic from PR #232.

Suggested reviewers

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: JudgeLLM evaluation with ProposalAmender' directly and clearly summarizes the main changes: introduction of JudgeLLM-based evaluation capability and the ProposalAmender component.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

+
+import json
+import os
+import subprocess


rioloc · 2026-05-29T14:16:50Z

@CodeRabbit review

coderabbitai · 2026-05-29T14:16:57Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/integration/test_evaluation_data_proposal.yaml (1)
95-104: 💤 Low value

expected_response is likely unused for this metric.

custom:proposal_evaluation_correctness is a turn-level LLM-as-judge metric that requires only response (no ground truth), so expected_response here will not be consulted during scoring. It's harmless but can mislead readers into thinking the judge compares against it. Consider dropping it or adding a brief comment clarifying it's documentation-only.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_evaluation_data_proposal.yaml` around lines 95 - 104,
The test includes an unnecessary expected_response alongside the turn-level
judge metric custom:proposal_evaluation_correctness (defined in turn_metrics and
turn_metrics_metadata) which doesn't use ground truth; remove the
expected_response block from the test or, if you want to keep it for
human-readable documentation, add a short inline comment next to
expected_response stating it is documentation-only and not used by the
custom:proposal_evaluation_correctness metric so readers aren’t misled.
tests/integration/test_proposal_evaluation.py (1)
204-236: ⚡ Quick win

Test asserts less than its docstring claims.

The docstring states this verifies that custom:proposal_evaluation_correctness runs against the response and the pipeline completes, but the body only checks that turn.response is populated — identical to test_full_lifecycle's response check. Nothing confirms the judge metric actually produced a result.

Since live LLM scores are nondeterministic, asserting a specific score is fragile, but you can confirm the metric path executed by checking the emitted results (e.g., the JSON output written to tmp_path / "eval_output") contain a custom:proposal_evaluation_correctness entry for the turn. This makes the test meaningfully distinct from the lifecycle test.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_proposal_evaluation.py` around lines 204 - 236, The
test_judge_evaluation currently only asserts that ProposalDriver populated
turn.response; update it to also verify that the judge metric ran by reading the
evaluation output written to the configured storage (the FileBackend output_dir
set to tmp_path / "eval_output") after calling evaluate(system_config,
eval_data) and assert that a result entry for
custom:proposal_evaluation_correctness exists and is associated with the
evaluated turn; locate this logic near the test_judge_evaluation function and
use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the
turn from eval_data[0].turns[0]) to load the JSON results and assert the
presence of the custom:proposal_evaluation_correctness metric for that turn.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py`:
- Around line 61-75: The KubeCLI.run method can raise subprocess.TimeoutExpired
which escapes callers like KubeCLI.get_resource and ProposalAmender.amend;
modify KubeCLI.run to catch subprocess.TimeoutExpired and normalize it by
returning a failing subprocess.CompletedProcess (non-zero returncode, empty
stdout, stderr describing the timeout and including the timeout value/command)
so callers always receive a CompletedProcess rather than an exception; update
any references in get_resource/ProposalAmender.amend to rely on CompletedProcess
return semantics (or alternatively, raise the project-specific EvaluationError
consistently if your codebase prefers exceptions).

In `@src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`:
- Around line 36-39: The try/except in ProposalAmender that calls self._do_amend
currently only catches KeyError/TypeError/ValueError and therefore misses
subprocess errors from self._cli.get_resource (via KubeCLI.run); update the
except clause in ProposalAmender.execute (the block wrapping self._do_amend) to
also catch subprocess.SubprocessError and subprocess.TimeoutExpired (or broaden
to Exception if preferred), or alternatively normalize CLI exceptions inside
KubeCLI.run/_cli.get_resource so they raise a common custom exception that
ProposalAmender can catch; reference _do_amend, ProposalAmender,
_cli.get_resource, KubeCLI.run and ProposalDriver.execute_turn when making the
change.
- Line 80: Remove the stray stdout dump by replacing the print call that outputs
turn_data.response with structured logging: call logger.debug(...) (using the
module logger or create one via logging.getLogger(__name__) if absent) so the
Markdown summary is logged at debug level instead of printed; update the
location where print(turn_data.response) appears in proposal_amender.py (the
code handling turn_data response/amend flow) to use logger.debug and ensure
imports/logger declaration are present.

---

Nitpick comments:
In `@tests/integration/test_evaluation_data_proposal.yaml`:
- Around line 95-104: The test includes an unnecessary expected_response
alongside the turn-level judge metric custom:proposal_evaluation_correctness
(defined in turn_metrics and turn_metrics_metadata) which doesn't use ground
truth; remove the expected_response block from the test or, if you want to keep
it for human-readable documentation, add a short inline comment next to
expected_response stating it is documentation-only and not used by the
custom:proposal_evaluation_correctness metric so readers aren’t misled.

In `@tests/integration/test_proposal_evaluation.py`:
- Around line 204-236: The test_judge_evaluation currently only asserts that
ProposalDriver populated turn.response; update it to also verify that the judge
metric ran by reading the evaluation output written to the configured storage
(the FileBackend output_dir set to tmp_path / "eval_output") after calling
evaluate(system_config, eval_data) and assert that a result entry for
custom:proposal_evaluation_correctness exists and is associated with the
evaluated turn; locate this logic near the test_judge_evaluation function and
use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the
turn from eval_data[0].turns[0]) to load the JSON results and assert the
presence of the custom:proposal_evaluation_correctness metric for that turn.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e2a40ded-3b7b-4b8b-9ed7-9c9473a5bf72

📥 Commits

Reviewing files that changed from the base of the PR and between 9ada2a8 and c47a46f.

📒 Files selected for processing (20)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/metrics/custom/prompts.py
src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/proposal/__init__.py
src/lightspeed_evaluation/core/proposal/phase.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/cli.py
src/lightspeed_evaluation/pipeline/evaluation/driver.py
src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
tests/integration/system-config-agents-proposal.yaml
tests/integration/test_evaluation_data_proposal.yaml
tests/integration/test_proposal_evaluation.py
tests/unit/core/metrics/custom/test_custom.py
tests/unit/core/metrics/custom/test_proposal_eval.py
tests/unit/pipeline/evaluation/test_proposal_amender.py
tests/unit/pipeline/evaluation/test_proposal_driver.py

Extract CLI operations (run, get_resource, apply, delete) into an injectable CLIClient interface with KubeCLI implementation backed by oc/kubectl. ProposalDriver now delegates to KubeCLI instead of internal subprocess calls, enabling dependency injection for the upcoming ProposalAmender. ProposalAmender fetches AnalysisResult, ExecutionResult, VerificationResult, and EscalationResult CRs via CLIClient and populates turn_data.proposal_results with structured status data. It also builds a Markdown workflow summary into turn_data.response. - Add proposal_results field to TurnData model - Create ProposalAmender with CLIClient dependency injection - Integrate ProposalAmender into ProposalDriver (always enabled) - Fallback to _extract_summary if amender fails add custom:proposal_evaluation_correctness LLM-as-judge metric New metric that evaluates agentic remediation workflow quality using an LLM judge. Scores 0.0-1.0 based on diagnosis quality, action appropriateness, risk management, and verification thoroughness. - Add PROPOSAL_EVALUATION_CORRECTNESS_PROMPT template - Register metric in CustomMetrics.supported_metrics - Add METRIC_REQUIREMENTS entry (requires response field) - Add metrics_metadata threshold (0.75) in system.yaml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security AI found potential problems May 27, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Fixed

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Fixed

rioloc force-pushed the feat/judge-evaluation branch 2 times, most recently from 354b221 to 5c2f4b1 Compare May 28, 2026 10:58

github-advanced-security AI found potential problems May 28, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py

import json

import os

import subprocess

rioloc force-pushed the feat/judge-evaluation branch from ed62f0f to 5492575 Compare May 28, 2026 14:36

rioloc marked this pull request as ready for review May 29, 2026 10:18

rioloc force-pushed the feat/judge-evaluation branch from 5492575 to c47a46f Compare May 29, 2026 10:20

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Outdated

Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py

Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py Outdated

rioloc force-pushed the feat/judge-evaluation branch 2 times, most recently from 37b4196 to 4a5dc4c Compare May 29, 2026 17:03

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Dismissed

Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py Dismissed

rioloc force-pushed the feat/judge-evaluation branch from 4a5dc4c to 20926e7 Compare May 29, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: JudgeLLM evaluation with ProposalAmender#248

feat: JudgeLLM evaluation with ProposalAmender#248
rioloc wants to merge 1 commit into
lightspeed-core:mainfrom
rioloc:feat/judge-evaluation

rioloc commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rioloc commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rioloc commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core changes

Why

Design choices

CLIClient abstraction

ProposalAmender as a separate class

Multi-dimensional judge prompt

Per-scenario scripts

Relationship between the two proposal metrics

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Uh oh!

rioloc commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rioloc commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading