Skip to content

plexusone/structured-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Structured Evaluation

Go CI Go Lint Go SAST Go Report Card Docs Docs Visualization License

A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.

Overview

structured-evaluation provides standardized types for evaluation reports, enabling:

  • βš–οΈ LLM-as-Judge assessments with categorical scoring and severity-based findings
  • πŸ“Š Dual-scale support with Likert (1-5) scales for human comparison studies
  • πŸ“ˆ Inter-rater reliability metrics for LLM calibration and quality assurance
  • βœ… GO/NO-GO summary reports for deterministic checks (CI, tests, validation)
  • πŸ”— Multi-agent coordination with DAG-based report aggregation
  • πŸ“‹ Claims validation for factual claim extraction and source verification

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SummaryReport (GO/NO-GO)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  Embedded Reports    β”‚  β”‚   Team Sections      β”‚       β”‚
β”‚  β”‚  (Full-Fidelity)     β”‚  β”‚   (Task Results)     β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β–²
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                               β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Rubric (rubric/)      β”‚   β”‚   ClaimsReport (claims/)  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Category Results    β”‚  β”‚   β”‚  β”‚ Claims + Validation β”‚  β”‚
β”‚  β”‚ (pass/partial/fail) β”‚  β”‚   β”‚  β”‚ (verified/rejected) β”‚  β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚   β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”‚
β”‚  β”‚ Findings            β”‚  β”‚   β”‚  β”‚ Sources             β”‚  β”‚
β”‚  β”‚ (severity-based)    β”‚  β”‚   β”‚  β”‚ (external/internal) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  LLM-as-Judge scoring     β”‚   β”‚  Fact verification        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Three complementary report types:

Package Purpose Evaluation Type
rubric/ Categorical scoring with findings Subjective (LLM-as-Judge)
claims/ Fact verification with sources Objective (source-backed)
summary/ GO/NO-GO aggregation Deterministic

Installation

go get github.com/plexusone/structured-evaluation

Packages

Package Description
rubric Rubric, CategoryResult, Finding, Severity types for LLM-as-Judge
claims ClaimsReport, Claim, Validation, Verdict for source verification
summary SummaryReport, TeamSection, TaskResult for GO/NO-GO checks
combine DAG-based report aggregation using Kahn's algorithm
render/box Box-format terminal renderer for summary reports
render/detailed Detailed terminal renderer for rubric reports
render/terminal ANSI-colored terminal renderer with UTF8 icons
render/markdown Markdown report renderer
schema JSON Schema generation and embedding

Report Types

Rubric (LLM-as-Judge)

For subjective quality assessments with detailed findings:

import "github.com/plexusone/structured-evaluation/rubric"

report := rubric.NewRubric("prd", "document.md")
report.AddCategoryResult(rubric.CategoryResult{
    Category:  "problem_definition",
    Score:     rubric.ScorePass,
    Reasoning: "Clear problem statement with measurable goals",
})
report.AddFinding(rubric.Finding{
    Severity:       rubric.SeverityMedium,
    Category:       "metrics",
    Title:          "Missing baseline metrics",
    Recommendation: "Add current baseline measurements",
})
report.Finalize(nil, "sevaluation check document.md")

Summary Report (GO/NO-GO)

For deterministic checks with pass/fail status:

import "github.com/plexusone/structured-evaluation/summary"

report := summary.NewSummaryReport("my-service", "v1.0.0", "Release Validation")
report.AddTeam(summary.TeamSection{
    ID:   "qa",
    Name: "Quality Assurance",
    Tasks: []summary.TaskResult{
        {ID: "unit-tests", Status: summary.StatusGo, Detail: "Coverage: 92%"},
        {ID: "e2e-tests", Status: summary.StatusWarn, Detail: "2 flaky tests"},
    },
})

Claims Report (v0.6.0)

For factual claim extraction and source validation:

import "github.com/plexusone/structured-evaluation/claims"

report := claims.NewClaimsReport("security-advisory.md")

// External source: CVE from NVD
claim := claims.NewClaim("cvss", "CVSS 8.8 High", claims.ClaimRiskAssessment,
    claims.Location{Section: "severity"})
claim.SetValidation(claims.NewExternalValidation(
    "https://nvd.nist.gov/vuln/detail/CVE-2026-25253",
    claims.ExternalNVD,
))
report.AddClaim(*claim)

// Internal validation: exploit confirmed via code
exploit := claims.NewClaim("exploit", "RCE confirmed", claims.ClaimTechnicalFinding,
    claims.Location{Section: "impact"})
exploit.SetValidation(claims.NewInternalValidation(
    claims.MethodCodeExecution, "poc.py", true,
))
report.AddClaim(*exploit)

report.Finalize()
// report.Decision.Passed, report.Summary.Counts

Severity Levels

Following InfoSec conventions:

Severity Icon Blocking Description
Critical πŸ”΄ Yes Must fix before approval
High πŸ”΄ Yes Must fix before approval
Medium 🟑 No Should fix, tracked
Low 🟒 No Nice to fix
Info βšͺ No Informational only

Pass Criteria

Default criteria (zero blocking findings, all categories passing):

criteria := rubric.DefaultPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: -1 (unlimited), RequireAllPass: false

criteria := rubric.StrictPassCriteria()
// MaxCritical: 0, MaxHigh: 0, MaxMedium: 3, RequireAllPass: true

CLI Tool

# Install
go install github.com/plexusone/structured-evaluation/cmd/sevaluation@latest

# Render reports
sevaluation render report.json --format=detailed
sevaluation render report.json --format=terminal   # ANSI colors + UTF8 icons
sevaluation render report.json --format=markdown   # Markdown output
sevaluation render report.json --format=box
sevaluation render report.json --format=json

# Check pass/fail (exit code 0/1)
sevaluation check report.json

# Validate structure
sevaluation validate report.json

# Generate JSON Schema
sevaluation schema generate -o ./schema/

DAG-Based Aggregation

For multi-agent workflows with dependencies:

import "github.com/plexusone/structured-evaluation/combine"

results := []combine.AgentResult{
    {TeamID: "qa", Tasks: qaTasks},
    {TeamID: "security", Tasks: secTasks, DependsOn: []string{"qa"}},
    {TeamID: "release", Tasks: relTasks, DependsOn: []string{"qa", "security"}},
}

report := combine.AggregateResults(results, "my-project", "v1.0.0", "Release")
// Teams are topologically sorted: qa β†’ security β†’ release

JSON Schema

Schemas are embedded for runtime validation:

import "github.com/plexusone/structured-evaluation/schema"

rubricSchema := schema.RubricSchemaJSON
claimsSchema := schema.ClaimsSchemaJSON
summarySchema := schema.SummarySchemaJSON

RubricSet (v0.4.0)

Define explicit criteria for consistent categorical evaluations:

cat := rubric.NewCategory("quality", "Output Quality", "Overall quality assessment").
    WithPassPartialFail(
        []string{"Meets all requirements, no significant issues"},
        []string{"Meets most requirements, minor issues"},
        []string{"Missing key requirements or major issues"},
    )

// Use default PRD rubric
rubricSet := rubric.DefaultPRDRubricSet()

Judge Metadata (v0.2.0)

Track LLM judge configuration for reproducibility:

judge := rubric.NewJudgeMetadata("claude-3-opus").
    WithProvider("anthropic").
    WithPrompt("prd-eval-v1", "1.0").
    WithTemperature(0.0).
    WithTokenUsage(1500, 800)

report.SetJudge(judge)

Pairwise Comparison (v0.2.0)

Compare two outputs instead of absolute scoring:

comparison := rubric.NewPairwiseComparison(input, outputA, outputB)
comparison.SetWinner(rubric.WinnerA, "A is more accurate", 0.9)

// Aggregate multiple comparisons
result := rubric.ComputePairwiseResult(comparisons)
// result.WinRateA, result.OverallWinner

Multi-Judge Aggregation (v0.4.0)

Combine evaluations from multiple judges:

result := rubric.AggregateEvaluations(evaluations, rubric.AggregationMajority)

// Methods: AggregationMajority, AggregationConservative, AggregationOptimistic
// result.Agreement - inter-judge agreement (0-1)
// result.Disagreements - categories with significant disagreement
// result.ConsolidatedDecision - final aggregated decision

Likert Scales (v0.5.0)

Use 1-5 numeric scales for human comparison studies:

// Create a Likert-scale category
cat := rubric.NewCategory("quality", "Content Quality", "Overall quality").
    WithLikert5(rubric.StandardLikert5Anchors())

// Record a Likert score (automatically maps to categorical)
result := rubric.NewCategoryResultFromLikert("quality", 4, config, "Good quality")
// result.Score = ScorePass, result.NumericScore = 4.0

// Or record both categorical and numeric
result := rubric.NewCategoryResultWithNumeric("quality", rubric.ScorePass, 4.5, "reasoning")

Inter-Rater Reliability (v0.5.0)

Compare LLM evaluations with human ground truth:

// Compute IRR metrics
metrics := rubric.ComputeIRRFromResults(humanResults, llmResults)

fmt.Printf("Exact Agreement: %.1f%%\n", metrics.ExactAgreement*100)
fmt.Printf("Adjacent Agreement: %.1f%%\n", metrics.AdjacentAgreement*100)
fmt.Printf("Pearson r: %.3f\n", metrics.PearsonCorrelation)

// Categorical agreement with confusion matrix
agreement := rubric.ComputeCategoricalAgreement(humanResults, llmResults)

Claims Validation (v0.6.0)

Validate factual claims have proper source backing:

import "github.com/plexusone/structured-evaluation/claims"

report := claims.NewClaimsReport("article.md")

// Source types: external (URL), internal (code/lab), derived, subjective
// Reliability tiers: authoritative, high, medium, low
// Verdicts: verified, unverified, needs-review, rejected

// Configure pass criteria
report.SetCriteria(claims.ClaimsCriteria{
    RequireAllVerified:           true,
    AllowSubjectiveWithDisclaimer: false,
    MinReliabilityTier:           claims.ReliabilityHigh,
})

report.Finalize()
if report.IsPassing() {
    fmt.Println("Ready for publication")
}

Embedded Reports (v0.6.0)

Archive full-fidelity reports within SummaryReport:

report := summary.NewSummaryReport("project", "v1.0.0", "RELEASE")

// Embed detailed reports
report.EmbedRubricReport("quality-review", rubricReport)
report.EmbedClaimsReport("source-validation", claimsReport)

// Retrieve later
var r rubric.Rubric
report.GetEmbeddedRubricReport("quality-review", &r)

OmniObserve Integration

Export evaluations to Opik, Phoenix, or Langfuse:

import "github.com/plexusone/omniobserve/integrations/sevaluation"

// Export to observability platform
err := sevaluation.Export(ctx, provider, traceID, report)

Integration

Designed to work with:

  • github.com/plexusone/omniobserve - LLM observability (Opik, Phoenix, Langfuse)
  • github.com/grokify/structured-requirements - PRD evaluation templates
  • github.com/plexusone/multi-agent-spec - Agent coordination
  • github.com/grokify/structured-changelog - Release validation

License

MIT License - see LICENSE for details.

About

A reusable evaluation framework for LLM-as-Judge and multi-agent workflows.

Resources

License

Stars

Watchers

Forks

Contributors

Languages