Skip to content

Latest commit

 

History

History
1915 lines (1422 loc) · 51 KB

File metadata and controls

1915 lines (1422 loc) · 51 KB

LightSpeed Evaluation Framework: Complete Guide

Last Updated: December 23, 2025
Assisted by: AI to generate the document


Table of Contents

Part 1: Introduction & Fundamentals

  1. Introduction
  2. Understanding AI Evaluation

Part 2: Evaluation Methodologies

  1. Methodologies Overview
  2. Turn-Level Metrics (Single Q&A)
  3. Conversation-Level Metrics
  4. Metric Selection Guide

Part 3: Practical Implementation

  1. Step-by-Step Setup
  2. Configuration Guide
  3. Running Evaluations
  4. Programmatic API
  5. Understanding Results

Part 4: Real-World Application

  1. Common Use Cases
  2. Best Practices
  3. Troubleshooting

Part 5: Reference Materials

  1. Quick Reference Tables
  2. Resources & Links

Part 1: Introduction & Fundamentals

1. Introduction

The LightSpeed Evaluation Framework is a comprehensive system designed to evaluate AI-powered applications, particularly conversational AI systems and chatbots. This guide explains everything you need to know to evaluate your AI applications effectively—all without requiring deep technical or data science expertise.

What This Framework Does

Think of this framework as a quality control system for AI applications. Just as you might test a website to ensure all links work and pages load correctly, this framework tests AI systems to ensure they:

  • Provide accurate and relevant answers
  • Use correct information from their knowledge base
  • Maintain context across conversations
  • Call the right tools or functions when needed
  • Perform expected actions in the system

Who Should Use This Guide

  • Product Managers: Understanding evaluation metrics to make informed decisions
  • QA Engineers: Testing AI applications systematically
  • Application Developers: Integrating evaluation into development workflows
  • Technical Writers: Documenting AI application quality
  • Team Leads: Overseeing AI application quality assurance

2. Understanding AI Evaluation

Why Evaluate AI Applications?

Unlike traditional software where behavior is deterministic (same input always produces same output), AI applications can produce varied responses. Evaluation helps ensure:

  1. Quality Assurance: Responses meet quality standards
  2. Consistency: Similar questions get consistent treatment
  3. Safety: Responses don't include harmful or incorrect information
  4. Performance Tracking: Monitor improvements or regressions over time
  5. Compliance: Meet organizational standards and requirements

Two Levels of Evaluation

Turn-Level Evaluation (Single Query-Response)

  • Evaluates individual question-answer pairs
  • Like checking if a single customer support ticket was handled correctly
  • Example: "Was the answer to 'How do I reset my password?' accurate and helpful?"

Conversation-Level Evaluation (Multi-Turn Dialogue)

  • Evaluates entire conversations with multiple back-and-forth exchanges
  • Like reviewing a complete customer support conversation
  • Example: "Did the AI successfully guide the user through troubleshooting across 5 messages?"

3. Quick Start

Installation

# Navigate to project directory
cd lightspeed-evaluation
# Install dependencies
uv sync
# OR using pip
pip install -e .

Set Environment Variables

# Required: Judge LLM (the AI that evaluates your AI)
export OPENAI_API_KEY="sk-your-api-key-here"
# Optional: For live API testing
export API_KEY="your-api-endpoint-key"

Run Your First Evaluation

lightspeed-eval \
  --system-config config/system.yaml \
  --eval-data config/evaluation_data.yaml

That's it! Results will be in eval_output/ directory.


Part 2: Evaluation Methodologies

3. Methodologies Overview

The framework uses four main categories of evaluation methods:

Category What It Does When to Use Level
Ragas Metrics Industry-standard metrics for response and context quality RAG QnA, Single-turn responses Turn
DeepEval Metrics Advanced conversation analysis Multi-turn conversations Conversation
Custom Metrics Specialized evaluations for specific needs Intent checking, tool validation Turn
Script-Based Metrics Real-world validation through automated scripts E2E RAG/agent workflows Turn

Quick Selection Guide

Choose Ragas Metrics when:

  • You want to verify if answers are accurate and relevant
  • You need to check if the AI is using the right information
  • You want industry-standard, well-documented metrics

Choose DeepEval Metrics when:

  • You're evaluating multi-turn conversations
  • You need to assess conversation completeness
  • You want to check if the AI remembers earlier parts of the conversation

Choose Custom Metrics when:

  • You have specific requirements not covered by standard metrics
  • You need to compare against expected answers
  • You want to verify the AI's intent or tool usage

Choose Script-Based Metrics when:

  • Your AI performs actions in real systems
  • You need to verify real-world outcomes
  • You want to test end-to-end workflows

4. Turn-Level Metrics

Turn-level metrics evaluate individual question-answer pairs.

4.1 Ragas Metrics

A. Response Quality Metrics

Response Relevancy

What it measures: How well does the answer address the actual question?

Plain English: "Did the AI answer the question I asked, or did it go off-topic?"

Score Range: 0.0 to 1.0 (higher is better)

Example:

Question: "How do I reset my password?"

✓ Relevant (High Score):
"Click on 'Forgot Password' on the login page, enter your email,
and follow the reset link sent to you."

✗ Irrelevant (Low Score):
"Our system has been running for 5 years and we have excellent
security features including two-factor authentication."

When to use: Ensuring the AI stays on topic

Threshold: 0.8 or higher

Required fields: query, response


Faithfulness

What it measures: Does the answer stick to the facts provided in the source information?

Plain English: "Is the AI making things up, or is it only using information from its knowledge base?"

Score Range: 0.0 to 1.0 (higher is better)

Example:

Context: "OpenShift Virtualization requires 4GB RAM minimum."
Question: "What are OpenShift Virtualization requirements?"

✓ Faithful (High Score):
"OpenShift Virtualization requires a minimum of 4GB RAM."

✗ Not Faithful (Low Score):
"OpenShift Virtualization requires 8GB RAM and 100GB disk space."
(The disk space wasn't in the context - made up!)

When to use: Preventing AI hallucinations (making up information)

Threshold: 0.8 or higher

Required fields: response, contexts


B. Context/Retrieval Quality Metrics

Context Recall

What it measures: Did the AI retrieve all the necessary information to answer the question?

Plain English: "Did the AI look up everything it needed to give a complete answer?"

Score Range: 0.0 to 1.0 (higher is better)

Example:

Question: "What are the storage and memory requirements for OpenShift?"
Expected Answer mentions: 120GB storage AND 16GB RAM

Retrieved Context contains:
- Document about storage (120GB) ✓
- (Missing document about memory requirements) ✗

Context Recall: 0.5 (retrieved 1 out of 2 needed pieces)

When to use: Improving search/retrieval systems

Threshold: 0.8 or higher

Required fields: contexts, expected_response


Context Precision

What it measures: How much of the retrieved information is actually useful?

Plain English: "Is the AI pulling up relevant documents, or is it cluttering the answer with unnecessary information?"

Two variants:

  • Without Reference: Uses AI's response to judge relevance
  • With Reference: Uses expected answer for more accurate judgment

Score Range: 0.0 to 1.0 (higher is better)

When to use: Optimizing search algorithms, reducing noise

Threshold: 0.7 or higher

Required fields: query, contexts, response (and expected_response for "with reference" variant)


Context Relevance

What it measures: How relevant is the retrieved context to the user's question?

Plain English: "Is the information the AI found actually related to what the user asked?"

Score Range: 0.0 to 1.0 (higher is better)

When to use: Evaluating search quality before answer generation

Threshold: 0.7 or higher

Required fields: query, contexts


4.2 Custom Metrics

Answer Correctness

What it measures: How close is the AI's answer to the expected "correct" answer?

Plain English: "On a test where we know the right answer, how well did the AI do?"

Score Range: 0.0 to 1.0 (higher is better)

How it works: A Judge LLM compares the AI's response to your expected response

Example:

Question: "What is the capital of France?"
Expected Response: "Paris"

AI Response: "The capital of France is Paris."
Score: 1.0 (Perfect match)

AI Response: "Lyon is a major city in France."
Score: 0.1 (Incorrect answer)

When to use: Testing against known question-answer pairs, benchmarking

Threshold: 0.75 or higher

Required fields: query, response, expected_response


Intent Evaluation

What it measures: Does the AI's response have the right intent/purpose?

Plain English: "Is the AI trying to do what we expect it to do?"

Score: Binary (0 or 1)

Intent Categories:

  • Explain a concept: "What is Kubernetes?" → Expects explanatory response
  • Provide instructions: "How do I install Docker?" → Expects step-by-step guide
  • Refuse/Decline: "Can you hack this system?" → Expects refusal
  • Ask for clarification: Ambiguous question → Expects clarifying questions

Example:

Question: "Tell me a joke about programming"
Expected Intent: "refuse" (professional support bot should decline)

✓ Correct Intent (Score: 1):
"I apologize, but I'm designed to help with technical questions
about OpenShift. How can I assist you today?"

✗ Wrong Intent (Score: 0):
"Why do programmers prefer dark mode? Because light attracts bugs!"

When to use: Ensuring appropriate AI behavior, safety checking

Threshold: 1 (must match exactly)

Required fields: query, response, expected_intent


Tool Evaluation

What it measures: Does the AI call the right tools with correct parameters and get expected results?

Plain English: "When the AI needs to use a tool, did it use the right one with the right settings, and did the tool return what we expected?"

Score: Binary (0 or 1)

How it works:

  • Compares expected tool calls against actual tool calls
  • Validates tool names match exactly
  • Checks parameters (supports regex patterns)
  • Optionally validates tool call results (supports regex patterns)

Example:

Question: "Show me all pods in the default namespace"

Expected Tool Call:
- Tool: oc_get
- Parameters: {kind: "pod", namespace: "default"}

✓ Correct (Score: 1):
Tool: oc_get, Parameters: {kind: "pod", namespace: "default"}

✗ Incorrect (Score: 0):
Tool: oc_describe, Parameters: {kind: "pod", namespace: "default"}
(wrong tool)

Pattern Matching:

# Regex support for flexible matching
expected_tool_calls:
  - - tool_name: oc_get
      arguments:
        namespace: "openshift-light.*"  # Matches openshift-lightspeed

Result Validation (Optional):

# Validate tool call results using regex patterns
expected_tool_calls:
  - - tool_name: oc_get
      arguments:
        kind: pod
        namespace: default
      result: ".*Running.*"  # Verify pod is in Running state

  - - tool_name: oc_create
      arguments:
        kind: namespace
        name: test-ns
      result: ".*created"  # Verify creation succeeded

When to use: Function calling AI applications, tool-using agents, validating tool outputs

Threshold: 1 (must be exact)

Required fields: expected_tool_calls, tool_calls


4.3 Script-Based Metrics

Action Evaluation

What it measures: Did the AI's action actually work in the real system?

Plain English: "Don't just check what the AI said—check if it actually did what it was supposed to do."

Score: Binary (0 or 1)

How it works:

  1. AI performs an action (e.g., "Create a namespace")
  2. Framework runs your verification script
  3. Script exit code determines pass/fail (0 = success, non-zero = failure)

Example:

# verify_namespace.sh
#!/bin/bash
kubectl get namespace test-ns > /dev/null 2>&1
exit $?  # Returns 0 if namespace exists

Configuration:

- conversation_group_id: infrastructure_test
  setup_script: ./scripts/setup_cluster.sh
  cleanup_script: ./scripts/cleanup_cluster.sh
  turns:
    - turn_id: create_namespace
      query: "Create a namespace called demo-app"
      verify_script: ./scripts/verify_namespace.sh
      turn_metrics:
        - script:action_eval

When to use: Infrastructure changes, system modifications, end-to-end testing

Important: Scripts only run when API mode is enabled

Threshold: 1 (must succeed)

Required fields: verify_script (API mode must be enabled)


5. Conversation-Level Metrics

Conversation-level metrics evaluate complete multi-turn dialogues.

5.1 DeepEval Metrics

Conversation Completeness

What it measures: Did the conversation fully address what the user wanted to accomplish?

Plain English: "By the end of the conversation, did the user get everything they were looking for?"

Score Range: 0.0 to 1.0 (higher is better)

Example:

User: "I need to deploy an app to OpenShift and set up monitoring"
AI: "I can help! Let's start with deployment. What's your app name?"
User: "my-web-app"
AI: "Great! Here's how to deploy... [deployment instructions]"
User: "Done! What about monitoring?"
AI: "For monitoring, here are the steps... [monitoring setup]"
User: "Perfect, thanks!"

✓ Goal 1: Deploy app → Addressed
✓ Goal 2: Set up monitoring → Addressed
Score: 1.0 (Complete)

When to use: Evaluating customer support conversations, goal-oriented assistants

Threshold: 0.8 or higher


Conversation Relevancy

What it measures: How relevant are the responses throughout the conversation?

Plain English: "Does each response stay on topic?"

Score Range: 0.0 to 1.0 (higher is better)

When to use: Keeping conversations focused, detecting when AI drifts off-topic

Threshold: 0.7 or higher


Knowledge Retention

What it measures: Does the AI remember and use information from earlier in the conversation?

Plain English: "Does the AI have a memory, or does it forget what was said earlier?"

Score Range: 0.0 to 1.0 (higher is better)

Example:

✓ Good Retention (High Score):
User: "My deployment is called web-app in the production namespace"
AI: "Got it. What would you like to do with web-app?"
User: "Scale it to 3 replicas"
AI: "I'll scale web-app in the production namespace to 3 replicas."
     [remembers both name and namespace]

✗ Poor Retention (Low Score):
User: "My deployment is called web-app in the production namespace"
AI: "Okay, what do you want to do?"
User: "Scale it to 3 replicas"
AI: "What's the deployment name and namespace?" [forgot!]

When to use:

  • Multi-turn conversations and troubleshooting sessions
  • Evaluating fine-tuned models (especially useful to measure if fine-tuning improved context retention)
  • Comparing base models vs fine-tuned versions for conversation ability

Threshold: 0.7 or higher


6. Metric Selection Guide

Decision Tree

What are you evaluating?
│
├─ Single Q&A (Turn-Level)
│  │
│  ├─ Answer Quality?
│  │  ├─ Is answer relevant? → response_relevancy
│  │  ├─ Is answer factual? → faithfulness
│  │  └─ Matches expected? → answer_correctness
│  │
│  ├─ Information Retrieval?
│  │  ├─ Found everything needed? → context_recall
│  │  ├─ Is retrieved info relevant? → context_relevance
│  │  └─ Too much irrelevant info? → context_precision
│  │
│  ├─ AI Behavior?
│  │  ├─ Right intent? → intent_eval
│  │  └─ Right tools? → tool_eval
│  │
│  └─ Real Actions?
│     └─ Infrastructure changes? → action_eval
│
└─ Conversation (Conversation-Level)
   ├─ Goals achieved? → conversation_completeness
   ├─ Stayed on topic? → conversation_relevancy
   └─ Remembered context? → knowledge_retention

Common Recipe Patterns

Recipe 1: Customer Support Bot

turn_metrics:
  - ragas:response_relevancy    # On-topic answer?
  - ragas:faithfulness          # No hallucinations?
  - custom:answer_correctness   # Matches expected?

Recipe 2: Multi-Turn Troubleshooting

# Per turn:
turn_metrics:
  - ragas:response_relevancy

# Full conversation:
conversation_metrics:
  - deepeval:conversation_completeness
  - deepeval:knowledge_retention

Recipe 3: Tool-Calling Agent

turn_metrics:
  - custom:tool_eval            # Right tool + params?
  - ragas:response_relevancy    # Good explanation?

Recipe 4: Infrastructure Automation

turn_metrics:
  - script:action_eval          # Action worked?
  - custom:tool_eval            # Called right tool?

Part 3: Practical Implementation

7. Step-by-Step Setup

Step 1: Prerequisites

  • Python 3.11 - 3.13
  • UV package manager (recommended) or pip
  • API key for a Judge LLM (e.g., OpenAI)
  • Basic command line knowledge

Step 2: Installation

# Navigate to project
cd lightspeed-evaluation

# Install with UV
uv sync

# OR with pip
pip install -e .

Step 3: Set Environment Variables

# Required: Judge LLM
export OPENAI_API_KEY="sk-your-api-key-here"

# For other providers:
# export WATSONX_API_KEY="your-key"
# export GEMINI_API_KEY="your-key"

# Optional: For live API testing
export API_KEY="your-api-endpoint-key"

Step 4: Verify Installation

# Check if command is available
lightspeed-eval --help

8. Configuration Guide

System Configuration (system.yaml)

Minimal Configuration:

# Judge LLM settings
llm:
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.0              # Deterministic evaluation
  max_tokens: 512
  timeout: 300
  num_retries: 3

# Default metrics and thresholds
metrics_metadata:
  turn_level:
    "ragas:response_relevancy":
      threshold: 0.8
      description: "How relevant the response is"
      default: true             # Used by default
    "ragas:faithfulness":
      threshold: 0.8
      description: "Factual accuracy"
      default: false            # Only when specified

# Output settings
output:
  output_dir: "./eval_output"
  enabled_outputs:
    - csv                       # Detailed results
    - json                      # Statistics
    - txt                       # Summary

# Visualization
visualization:
  enabled_graphs:
    - "pass_rates"
    - "score_distribution"

Important Settings Explained:

  • default: true: Metric runs automatically if no metrics specified
  • default: false: Only runs when explicitly requested
  • threshold: Minimum score to pass (0.0 to 1.0)
  • temperature: 0.0: Ensures consistent, deterministic evaluation

Panel of Judges (Advanced)

⚠️ Note: The traditional llm config will be deprecated. Use llm_pool + judge_panel for new deployments.

For improved evaluation accuracy, you can use multiple LLMs as judges:

# Define a pool of LLM configurations (can be used by multiple components)
llm_pool:
  defaults:
    cache_dir: ".caches/llm_cache"
    parameters:
      temperature: 0.0
      max_completion_tokens: 512
  models:
    judge-4o-mini:
      provider: openai
      model: gpt-4o-mini
    judge-4.1-mini:
      provider: openai
      model: gpt-4.1-mini

# Configure which models to use as judges
judge_panel:
  judges:
    - judge-4o-mini
    - judge-4.1-mini
  aggregation_strategy: max  # or: average, majority_vote
  # enabled_metrics: ["ragas:faithfulness"]  # Optional: limit to specific metrics
  # If enabled_metrics not set, ALL LLM metrics use the full panel

Aggregation: max (highest score), average (mean vs threshold), or majority_vote (more than half of judges must individually meet the threshold — ties fail). See Configuration Guide.

Benefits:

  • Reduces bias from a single model
  • More robust evaluation scores
  • Per-judge token tracking for cost analysis

Evaluation Data (evaluation_data.yaml)

Simple Example:

- conversation_group_id: basic_test
  description: "Testing basic Q&A"
  
  turns:
    - turn_id: question_1
      query: "What is OpenShift?"
      response: "OpenShift is an enterprise Kubernetes platform..."
      contexts:
        - "OpenShift is Red Hat's enterprise Kubernetes distribution..."
      expected_response: "OpenShift is an enterprise Kubernetes platform"
      
      # Uses default metrics (response_relevancy)
      turn_metrics: null

Advanced Example:

- conversation_group_id: advanced_test
  description: "Testing with multiple metrics"
  
  turns:
    - turn_id: question_1
      query: "How do I reset my password?"
      response: "Click 'Forgot Password' on the login page..."
      expected_response: "Use the forgot password link"
      expected_intent: "provide instructions"
      
      # Specify exact metrics
      turn_metrics:
        - "ragas:response_relevancy"
        - "ragas:faithfulness"
        - "custom:answer_correctness"
        - "custom:intent_eval"
      
      # Override threshold for this turn
      turn_metrics_metadata:
        "ragas:faithfulness":
          threshold: 0.9          # Stricter than default

Tool Evaluation Example:

- conversation_group_id: tool_test
  turns:
    - turn_id: get_pods
      query: "Show me all pods in the default namespace"
      expected_tool_calls:
        - - tool_name: oc_get
            arguments:
              kind: pod
              namespace: default
      turn_metrics:
        - "custom:tool_eval"
      turn_metrics_metadata:
        "custom:tool_eval":
          ordered: true      # default: true
          full_match: true   # default: true (false = subset matching, all expected must be present)

Script-Based Example:

- conversation_group_id: infrastructure_test
  setup_script: "./scripts/setup_test_env.sh"
  cleanup_script: "./scripts/cleanup_test_env.sh"
  
  turns:
    - turn_id: create_namespace
      query: "Create a namespace called test-demo"
      verify_script: "./scripts/verify_namespace_exists.sh"
      turn_metrics:
        - "script:action_eval"

Skip on Failure Example:

Skip remaining turns completely (no API calls or evaluations) when a turn fails:

- conversation_group_id: dependent_workflow
  skip_on_failure: true  # Or set globally in system.yaml.
  turns:
    - turn_id: step_1
      query: "Create namespace"
      turn_metrics: ["script:action_eval"]
    - turn_id: step_2  # SKIPPED if step_1 fails
      query: "Deploy to namespace"
      turn_metrics: ["script:action_eval"]

9. Running Evaluations

Basic Evaluation

lightspeed-eval \
  --system-config config/system.yaml \
  --eval-data config/evaluation_data.yaml

With Custom Output Directory

lightspeed-eval \
  --system-config config/system.yaml \
  --eval-data config/evaluation_data.yaml \
  --output-dir ./my_evaluation_results

What Happens During Evaluation

  1. Configuration Validation

    • Checks all required fields
    • Validates metric selections
    • Verifies Judge LLM connectivity
  2. Data Collection

    • If API enabled: Calls your API for responses
    • If API disabled: Uses pre-filled data from YAML
  3. Metric Evaluation

    • Runs turn-level metrics for each turn
    • Runs conversation-level metrics for full conversations
    • Uses Judge LLM to score responses
  4. Scoring & Analysis

    • Compares scores against thresholds
    • Generates PASS/FAIL/ERROR/SKIPPED status
    • Calculates statistics
  5. Output Generation

    • Creates CSV, JSON, TXT files
    • Generates visualization graphs
    • Saves amended evaluation data

10. Programmatic API

In addition to the CLI, the framework can be used as a Python library. This is useful when you want to integrate evaluations into scripts, notebooks, CI pipelines, or custom tooling—without dealing with YAML files or command-line arguments.

Available Functions

Function Returns Purpose
evaluate(config, data) list[EvaluationResult] Evaluate a list of conversations
evaluate_conversation(config, data) list[EvaluationResult] Evaluate a single conversation
evaluate_turn(config, turn) list[EvaluationResult] Evaluate a single turn
evaluate_with_summary(config, data) EvaluationSummary Evaluate with structured statistics
evaluate_conversation_with_summary(config, data) EvaluationSummary Single conversation with statistics
evaluate_turn_with_summary(config, turn) EvaluationSummary Single turn with statistics

The evaluate*() functions return raw result lists. The *_with_summary() variants return an EvaluationSummary that wraps results with computed statistics (overall, per-metric, per-conversation, per-tag).

Basic Example

from lightspeed_evaluation import (
    evaluate,
    EvaluationData,
    LLMConfig,
    SystemConfig,
    TurnData,
)

# 1. Build configuration
config = SystemConfig(
    llm=LLMConfig(provider="openai", model="gpt-4o-mini"),
)

# 2. Build evaluation data
data = EvaluationData(
    conversation_group_id="my_eval",
    turns=[
        TurnData(
            turn_id="t1",
            query="What is OpenShift?",
            response="OpenShift is a Kubernetes-based container platform.",
            expected_response="OpenShift is Red Hat's Kubernetes platform.",
            turn_metrics=["ragas:response_relevancy"],
        ),
    ],
)

# 3. Run evaluation
results = evaluate(config, [data])

# 4. Inspect results
for r in results:
    print(f"{r.metric_identifier}: {r.result} (score={r.score})")

Evaluating a Single Turn

Use evaluate_turn() when you want to evaluate one question-answer pair. You can override metrics without modifying the original turn object:

from lightspeed_evaluation import evaluate_turn, SystemConfig, TurnData

config = SystemConfig()
turn = TurnData(
    turn_id="t1",
    query="What is a pod?",
    response="A pod is the smallest deployable unit in Kubernetes.",
)

results = evaluate_turn(
    config,
    turn,
    metrics=["ragas:response_relevancy", "ragas:faithfulness"],
)

Evaluating a Single Conversation

Use evaluate_conversation() when you have a single EvaluationData object:

from lightspeed_evaluation import evaluate_conversation, EvaluationData, SystemConfig, TurnData

config = SystemConfig()
data = EvaluationData(
    conversation_group_id="support_conv",
    turns=[
        TurnData(turn_id="t1", query="Hello", response="Hi! How can I help?"),
        TurnData(turn_id="t2", query="What is OCP?", response="OCP is OpenShift."),
    ],
    conversation_metrics=["deepeval:knowledge_retention"],
)

results = evaluate_conversation(config, data)

Working with Results

The evaluate() functions return list[EvaluationResult]. Each result contains:

Field Description
result Status: PASS, FAIL, ERROR, or SKIPPED
score Numeric score between 0.0 and 1.0
threshold Pass/fail threshold used
reason Explanation from the judge LLM
metric_identifier Which metric produced this result
turn_id Turn ID (for turn-level metrics)
conversation_group_id Conversation group ID

No files are generated by default—file output is the caller's responsibility. If you need CSV/JSON reports, use the OutputHandler or EvaluationSummary (see below).

Structured Results with EvaluationSummary

Use evaluate_with_summary() to get structured results with computed statistics:

from lightspeed_evaluation import (
    evaluate_with_summary,
    EvaluationData,
    EvaluationSummary,
    LLMConfig,
    SystemConfig,
    TurnData,
)

config = SystemConfig(
    llm=LLMConfig(provider="openai", model="gpt-4o-mini"),
)

data = EvaluationData(
    conversation_group_id="my_eval",
    turns=[
        TurnData(
            turn_id="t1",
            query="What is OpenShift?",
            response="OpenShift is a Kubernetes-based container platform.",
            turn_metrics=["ragas:response_relevancy"],
        ),
    ],
)

# Get structured results
summary = evaluate_with_summary(config, [data])

# Access overall statistics
print(f"Pass rate: {summary.overall.pass_rate}%")
print(f"Total: {summary.overall.total}")

# Access per-metric statistics
for metric_id, stats in summary.by_metric.items():
    print(f"{metric_id}: pass_rate={stats.pass_rate}%")
    if stats.score_statistics:
        print(f"  mean={stats.score_statistics.mean}")

# Access per-conversation statistics
for conv_id, stats in summary.by_conversation.items():
    print(f"{conv_id}: {stats.passed}/{stats.total} passed")

# Access raw results
for r in summary.results:
    print(f"{r.metric_identifier}: {r.result} (score={r.score})")

Saving Results to Files

Use OutputHandler.save() to write an EvaluationSummary to files:

from lightspeed_evaluation import OutputHandler

handler = OutputHandler(output_dir="./my_output")
files = handler.save(summary, formats=["json", "csv", "txt"])
print(f"Generated: {files}")

Bootstrap Confidence Intervals

When using the CLI, bootstrap confidence intervals are always computed for metrics with two or more scored results.

When using the programmatic API, confidence intervals are disabled by default. To enable them:

summary = evaluate_with_summary(
    config, [data],
    compute_confidence_intervals=True,
)

for metric_id, stats in summary.by_metric.items():
    ci = stats.score_statistics.confidence_interval
    if ci:
        print(f"{metric_id}: {ci['low']:.3f} - {ci['high']:.3f} (95% CI)")

CLI vs Programmatic API

Aspect CLI (lightspeed-eval) Programmatic API
Configuration YAML files Python objects (SystemConfig)
Input data YAML files Python objects (EvaluationData)
Output CSV, JSON, TXT files + graphs list[EvaluationResult] or EvaluationSummary
File output Automatic Optional via OutputHandler.save()
Use case Standalone runs, CI jobs Library integration, notebooks, scripts

11. Understanding Results

Output Files

eval_output/
├── evaluation_20251028_143000_detailed.csv
├── evaluation_20251028_143000_summary.json
├── evaluation_20251028_143000_summary.txt
└── graphs/
    ├── evaluation_20251028_143000_pass_rates.png
    ├── evaluation_20251028_143000_score_distribution.png
    ├── evaluation_20251028_143000_conversation_heatmap.png
    └── evaluation_20251028_143000_status_breakdown.png

CSV File (Detailed Results)

Contains every metric evaluation with:

  • Conversation group ID and turn ID
  • Metric identifier
  • Score, threshold, status (PASS/FAIL/ERROR/SKIPPED)
  • Detailed reasoning
  • Query and response text
  • Execution time

Use for: Drilling into specific failures, detailed analysis

JSON File (Summary)

Contains:

  • Overall statistics (pass/fail/error counts)
  • Per-metric summaries
  • Score distributions (mean, median, std dev)
  • Execution metadata

Use for: Quick overview, automated processing, tracking trends

TXT File (Human-Readable)

Example:

EVALUATION SUMMARY
==================
Total Evaluations: 10
Passed: 8 (80.0%)
Failed: 2 (20.0%)
Errors: 0 (0.0%)

METRIC BREAKDOWN
================
ragas:response_relevancy:
  Mean Score: 0.85
  Pass Rate: 90%
  
ragas:faithfulness:
  Mean Score: 0.78
  Pass Rate: 70%

Use for: Quick review, executive summaries

Visualization Graphs

  1. Pass Rates Bar Chart: Compare pass rates per metric
  2. Score Distribution Box Plot: Shows score spread
  3. Conversation Heatmap: Performance across conversations
  4. Status Breakdown Pie Chart: Overall pass/fail/error distribution

Use for: Presentations, quick visual insights

Interpreting Status

  • PASS ✅: Score met or exceeded threshold
  • FAIL ❌: Score below threshold
  • ERROR ⚠️: Evaluation couldn't complete (missing data, API failure, etc.)
  • SKIPPED ⏭️: Evaluation skipped due to prior failure (when skip_on_failure is enabled)

Score Quality Levels

Score Quality Recommendation
0.9 - 1.0 Excellent Production ready
0.8 - 0.9 Good Typical threshold
0.7 - 0.8 Acceptable Consider improvements
< 0.7 Poor Needs work

Pass Rate Interpretation

Pass Rate Status Action
≥ 90% Production ready Deploy with confidence
80-90% Good quality Minor improvements
70-80% Acceptable for testing Needs improvement
< 70% Not ready Significant work needed

Part 4: Real-World Application

12. Common Use Cases

Use Case 1: Quality Assurance for Customer Support Bot

Scenario: Launching a customer support chatbot

Evaluation Strategy:

  1. Create test dataset with 50 common questions

  2. Use metrics:

    • ragas:response_relevancy (0.8)
    • ragas:faithfulness (0.8)
    • custom:answer_correctness (0.75)
  3. Configuration:

- conversation_group_id: support_qa
  turns:
    - turn_id: password_reset
      query: "How do I reset my password?"
      contexts:
        - "Password reset: Click 'Forgot Password', enter email..."
      expected_response: "Use forgot password link and check email"
      turn_metrics:
        - ragas:response_relevancy
        - ragas:faithfulness
        - custom:answer_correctness
  1. Success criteria:
    • Overall pass rate ≥ 90%
    • No faithfulness scores below 0.7
    • All high-priority questions pass

Use Case 2: Regression Testing After Model Update

Scenario: Updating to a new AI model

Strategy:

  1. Use existing production questions (100-500 samples)
  2. Run evaluation on old model → Save results
  3. Run evaluation on new model → Save results
  4. Compare results

Commands:

# Evaluate old model
lightspeed-eval \
  --system-config config/system_old_model.yaml \
  --eval-data config/prod_samples.yaml \
  --output-dir ./results_old_model

# Evaluate new model
lightspeed-eval \
  --system-config config/system_new_model.yaml \
  --eval-data config/prod_samples.yaml \
  --output-dir ./results_new_model

# Compare results
uv run python script/compare_evaluations.py \
  results_old_model/evaluation_summary.json \
  results_new_model/evaluation_summary.json
  1. Decision criteria:
    • New model must not decrease pass rate by >5%
    • Critical metrics must maintain or improve
    • Statistical significance test passes

Use Case 3: Multi-Turn Troubleshooting

Scenario: AI guides users through complex troubleshooting

Configuration:

- conversation_group_id: troubleshoot_deployment
  description: "Multi-turn deployment troubleshooting"
  
  conversation_metrics:
    - deepeval:conversation_completeness
    - deepeval:knowledge_retention
  
  turns:
    - turn_id: turn_1
      query: "My pod won't start"
      turn_metrics:
        - ragas:response_relevancy
    
    - turn_id: turn_2
      query: "It says ImagePullBackOff"
      turn_metrics:
        - ragas:response_relevancy
    
    - turn_id: turn_3
      query: "How do I fix the image registry auth?"
      turn_metrics:
        - ragas:response_relevancy
        - custom:intent_eval

Success criteria:

  • Conversation completeness ≥ 0.85
  • Knowledge retention ≥ 0.8
  • Each turn response relevancy ≥ 0.8

Use Case 4: Tool-Calling Agent Validation

Scenario: AI performs actions in Kubernetes/OpenShift

Configuration:

- conversation_group_id: tool_calling_test
  turns:
    - turn_id: list_pods
      query: "Show me pods in the production namespace"
      expected_tool_calls:
        - - tool_name: oc_get
            arguments:
              kind: pod
              namespace: production
      turn_metrics:
        - custom:tool_eval
    
    - turn_id: scale_deployment
      query: "Scale web-app to 3 replicas"
      expected_tool_calls:
        - - tool_name: oc_scale
            arguments:
              kind: deployment
              name: web-app
              replicas: 3
      turn_metrics:
        - custom:tool_eval

Success criteria: 100% tool call accuracy


Use Case 5: Infrastructure Validation

Scenario: AI creates and modifies infrastructure

Configuration:

- conversation_group_id: infra_operations
  setup_script: "./scripts/setup_test_cluster.sh"
  cleanup_script: "./scripts/cleanup_test_cluster.sh"
  
  turns:
    - turn_id: create_namespace
      query: "Create a namespace called demo-app"
      verify_script: "./scripts/verify_namespace.sh"
      turn_metrics:
        - script:action_eval

Verification Script:

#!/bin/bash
# verify_namespace.sh
kubectl get namespace demo-app > /dev/null 2>&1
exit $?

Success criteria: 100% pass rate on critical operations


13. Best Practices

1. Start Small, Scale Up

Don't: Start with 1000 questions and all metrics
Do: Start with 10-20 key questions and 2-3 core metrics

Progression:

  • Week 1: 10 questions, 2 metrics
  • Week 2: 50 questions, add metrics
  • Month 1: 100-200 questions, full suite
  • Production: 500+ questions

2. Choose the Right Metrics

Scenario Recommended Metrics
Customer Support (Single Q&A) response_relevancy, faithfulness, answer_correctness
Multi-turn Conversations conversation_completeness, knowledge_retention
Tool-calling Agents tool_eval, response_relevancy
Infrastructure Automation script:action_eval, tool_eval

3. Set Realistic Thresholds

Metric Type Threshold Use Case
Production-critical 0.85 - 0.95 Customer-facing
Standard quality 0.75 - 0.85 General use
Beta/Testing 0.70 - 0.75 Testing phase
Binary metrics 1.0 Must match

4. Create a Diverse Test Set

Distribution:

  • 80%: Common, expected queries
  • 15%: Edge cases
  • 5%: Negative cases (should refuse/clarify)

5. Version Control Your Configurations

Track in Git:

  • system.yaml
  • evaluation_data.yaml
  • ✅ Verification scripts
  • ✅ Expected responses

Don't track:

  • ❌ API keys
  • ❌ Output files
  • ❌ Cached results

6. Automate and Integrate

CI/CD Example:

# .github/workflows/ai_evaluation.yml
name: AI Quality Evaluation

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Evaluation
        run: |
          uv sync
          lightspeed-eval \
            --system-config config/system.yaml \
            --eval-data config/evaluation_data.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

7. Regular Regression Testing

Schedule:

  • Daily: Quick smoke tests (10-20 questions)
  • Weekly: Full regression (100-500 questions)
  • Before releases: Extended suite (1000+ questions)

8. Document Your Strategy

Create a README with:

  • Evaluation goals
  • Metric selection rationale
  • Threshold justification
  • Test set composition
  • Success criteria

9. Handle Edge Cases

Common edge cases:

  • Missing context → Should ask for more info
  • Out-of-scope → Should politely decline
  • Ambiguous queries → Should ask clarifying questions
  • Multiple valid answers → Use broader thresholds

10. Cost Management

Optimize Judge LLM costs:

  1. Use cheaper models when possible
llm:
  model: "gpt-4o-mini"  # Instead of gpt-4o
  1. Enable caching
llm:
  cache_enabled: true
  cache_dir: ".caches/llm_cache"
  1. Subset testing during development
    • Full suite: Weekly
    • Sample (10%): Daily
    • Critical questions: Per PR

14. Troubleshooting

Issue 1: "No API key found"

Error: Error: OPENAI_API_KEY environment variable not set

Solution:

export OPENAI_API_KEY="sk-your-key-here"

# Verify
echo $OPENAI_API_KEY

# Persist in shell profile
echo 'export OPENAI_API_KEY="sk-your-key-here"' >> ~/.bashrc
source ~/.bashrc

Issue 2: All Metrics Show ERROR

Symptoms: Status shows "ERROR" instead of PASS/FAIL

Common causes & solutions:

  1. Missing required fields
# ❌ Missing contexts for faithfulness
- turn_id: test1
  query: "Question"
  response: "Answer"
  turn_metrics:
    - ragas:faithfulness  # Needs contexts!

# ✅ Fixed
- turn_id: test1
  query: "Question"
  response: "Answer"
  contexts:
    - "Context document here"
  turn_metrics:
    - ragas:faithfulness
  1. Empty or null values
# ❌ Empty response
response: ""

# ✅ Provide actual response
response: "This is the answer"

Field Requirements:

Metric Required Fields
response_relevancy query, response
faithfulness response, contexts
context_recall contexts, expected_response
answer_correctness query, response, expected_response
intent_eval query, response, expected_intent
tool_eval expected_tool_calls, tool_calls
action_eval verify_script (API mode)

Issue 3: Low Faithfulness Scores

Symptoms: Faithfulness scores consistently below threshold

Diagnosis: Check CSV for reasons like "claims not supported by context"

Solutions:

  1. Add more context documents
contexts:
  - "Document 1 about topic A"
  - "Document 2 about topic B"
  - "Document 3 with more details"
  1. Adjust prompt to stick to facts
api:
  system_prompt: "Only use information from the provided context.
                  If information isn't in the context, say so."

Issue 4: Inconsistent Results

Symptoms: Same question gets different scores each time

Cause: Non-zero temperature (randomness)

Solution:

llm:
  temperature: 0.0  # Zero for deterministic evaluation

Issue 5: Evaluation is Very Slow

Solutions:

  1. Increase concurrency
core:
  max_threads: 50
  1. Enable caching
llm:
  cache_enabled: true
  1. Use faster model
llm:
  model: "gpt-4o-mini"

Issue 6: Script Execution Failed

Solutions:

  1. Check permissions
chmod +x scripts/verify.sh
  1. Verify path
# Relative path from eval data file
verify_script: "./scripts/verify.sh"

# Or absolute path
verify_script: "/full/path/to/verify.sh"
  1. Test manually
./scripts/verify.sh
echo $?  # Should be 0 for success
  1. Ensure API mode enabled
api:
  enabled: true  # Required for scripts

Issue 7: Tool Evaluation Always Fails

Solutions:

  1. Check format
# ✅ Correct (list of lists of dicts)
expected_tool_calls:
  - - tool_name: oc_get
      arguments:
        kind: pod

# ❌ Wrong
expected_tool_calls:
  tool_name: oc_get  # Missing list structure
  1. Use regex for flexible matching
expected_tool_calls:
  - - tool_name: oc_get
      arguments:
        namespace: "openshift-.*"  # Regex pattern

Issue 8: Out of Memory

Solutions:

  1. Reduce batch size
core:
  max_threads: 10  # Lower from 50
  1. Process in smaller batches
# Split evaluation data into smaller files
lightspeed-eval --eval-data config/eval_batch1.yaml
lightspeed-eval --eval-data config/eval_batch2.yaml

Part 5: Reference Materials

15. Quick Reference Tables

All Metrics at a Glance

Metric Score What It Checks Threshold Required Fields
ragas:response_relevancy 0-1 Answer addresses question 0.8 query, response
ragas:faithfulness 0-1 No made-up information 0.8 response, contexts
ragas:context_recall 0-1 Found all needed info 0.8 contexts, expected_response
ragas:context_relevance 0-1 Retrieved info is relevant 0.7 query, contexts
ragas:context_precision_* 0-1 Retrieved info is useful 0.7 query, contexts, response
custom:answer_correctness 0-1 Matches expected answer 0.75 query, response, expected_response
custom:intent_eval 0/1 Has right intent 1 query, response, expected_intent
custom:tool_eval 0/1 Called correct tools with expected results 1 expected_tool_calls, tool_calls
script:action_eval 0/1 Real action verified 1 verify_script
deepeval:conversation_completeness 0-1 User's goals achieved 0.8 Full conversation
deepeval:conversation_relevancy 0-1 Stayed on topic 0.7 Full conversation
deepeval:knowledge_retention 0-1 Remembered context 0.7 Full conversation

Configuration Cheat Sheet

Minimal system.yaml:

llm:
  provider: "openai"
  model: "gpt-4o-mini"
  temperature: 0.0

metrics_metadata:
  turn_level:
    "ragas:response_relevancy":
      threshold: 0.8
      default: true

output:
  output_dir: "./eval_output"

Minimal evaluation_data.yaml:

- conversation_group_id: test_1
  turns:
    - turn_id: q1
      query: "What is OpenShift?"
      response: "OpenShift is..."
      contexts: ["OpenShift is..."]

Common Commands

# Basic evaluation
lightspeed-eval \
  --system-config config/system.yaml \
  --eval-data config/evaluation_data.yaml

# Custom output directory
lightspeed-eval \
  --system-config config/system.yaml \
  --eval-data config/evaluation_data.yaml \
  --output-dir ./results

# Compare evaluations
uv run python script/compare_evaluations.py \
  results1/summary.json \
  results2/summary.json

# Multi-provider evaluation
uv run python script/run_multi_provider_eval.py \
  --providers-config config/multi_eval_config.yaml

Score Interpretation

Score Quality Pass Rate Status
0.9-1.0 Excellent ≥90% Production ready
0.8-0.9 Good 80-90% Good quality
0.7-0.8 Acceptable 70-80% Needs improvement
<0.7 Poor <70% Not ready

Troubleshooting Quick Fixes

Problem Quick Fix
No API key export OPENAI_API_KEY="..."
All ERROR Check required fields for metrics
Low faithfulness Add more context documents
Inconsistent results Set temperature: 0.0
Slow evaluation Enable caching, increase threads
Script fails Check permissions: chmod +x
"Metric not found" Check spelling against supported list

16. Resources & Links

Official Framework Documentation

LLM Provider Documentation

Learning Resources

For Beginners:

  • "Introduction to LLM Evaluation" (search for current articles)
  • Ragas Getting Started Guide
  • DeepEval tutorials

For Advanced Users:

  • RAG (Retrieval-Augmented Generation) papers
  • LLM evaluation best practices
  • Conversational AI assessment techniques

Internal Resources

This repository:

  • Main README: ../README.md
  • Agent Guidelines: ../AGENTS.md
  • Multi-Provider Evaluation: multi_provider_evaluation.md
  • Evaluation Comparison: evaluation_comparison.md
  • Sample Configurations: ../config/
  • Example Scripts: ../config/sample_scripts/

Community & Support

  • GitHub Repository: Report issues, request features
  • GitHub Discussions: Ask questions, share experiences
  • Pull Requests: Contribute improvements

Key Concepts Glossary

  • API-Enabled Mode: Real-time evaluation calling your AI system's API
  • Binary Metric: Pass/fail evaluation (0 or 1)
  • Context: Background information from knowledge base
  • Faithfulness: How well answer sticks to provided facts
  • Hallucination: AI making up information
  • Judge LLM: AI model used to evaluate another AI
  • Pass Rate: Percentage of evaluations meeting threshold
  • Ragas: Framework for retrieval-augmented generation metrics
  • Static Mode: Evaluation using pre-filled responses
  • Threshold: Minimum score required to pass
  • Turn: Single question-response pair
  • Turn-Level: Evaluation of individual Q&A pairs
  • Conversation-Level: Evaluation of multi-turn dialogues

Conclusion

This comprehensive guide has covered everything you need to know to effectively evaluate AI applications using the LightSpeed Evaluation Framework:

Understanding - What evaluation is and why it matters
Methodologies - All 13 evaluation metrics explained in plain English
Implementation - Step-by-step setup and configuration
Interpretation - Understanding and acting on results
Application - Real-world use cases and best practices
Reference - Quick lookup tables and decision trees

Next Steps

  1. Start with a pilot: Choose 10-20 key questions and 2-3 metrics
  2. Run your first evaluation: Follow the step-by-step guide
  3. Analyze results: Use the interpretation section
  4. Iterate and improve: Adjust thresholds and expand coverage
  5. Automate: Integrate into your development workflow

Getting Help

  • Connect via Slack channel: #forum-lightspeed

Last Updated: December 23, 2025
Status: Complete and Ready for Use

Feedback: Please submit suggestions via GitHub issues or pull requests.


This guide is designed to make AI evaluation accessible to everyone. Whether you're a product manager making decisions, a QA engineer testing systems, or a developer integrating evaluation into workflows, you now have everything you need to ensure your AI applications meet quality standards.

Happy Evaluating! 🚀