Last Updated: December 23, 2025
Assisted by: AI to generate the document
- Methodologies Overview
- Turn-Level Metrics (Single Q&A)
- Conversation-Level Metrics
- Metric Selection Guide
The LightSpeed Evaluation Framework is a comprehensive system designed to evaluate AI-powered applications, particularly conversational AI systems and chatbots. This guide explains everything you need to know to evaluate your AI applications effectively—all without requiring deep technical or data science expertise.
Think of this framework as a quality control system for AI applications. Just as you might test a website to ensure all links work and pages load correctly, this framework tests AI systems to ensure they:
- Provide accurate and relevant answers
- Use correct information from their knowledge base
- Maintain context across conversations
- Call the right tools or functions when needed
- Perform expected actions in the system
- Product Managers: Understanding evaluation metrics to make informed decisions
- QA Engineers: Testing AI applications systematically
- Application Developers: Integrating evaluation into development workflows
- Technical Writers: Documenting AI application quality
- Team Leads: Overseeing AI application quality assurance
Unlike traditional software where behavior is deterministic (same input always produces same output), AI applications can produce varied responses. Evaluation helps ensure:
- Quality Assurance: Responses meet quality standards
- Consistency: Similar questions get consistent treatment
- Safety: Responses don't include harmful or incorrect information
- Performance Tracking: Monitor improvements or regressions over time
- Compliance: Meet organizational standards and requirements
- Evaluates individual question-answer pairs
- Like checking if a single customer support ticket was handled correctly
- Example: "Was the answer to 'How do I reset my password?' accurate and helpful?"
- Evaluates entire conversations with multiple back-and-forth exchanges
- Like reviewing a complete customer support conversation
- Example: "Did the AI successfully guide the user through troubleshooting across 5 messages?"
# Navigate to project directory
cd lightspeed-evaluation
# Install dependencies
uv sync
# OR using pip
pip install -e .# Required: Judge LLM (the AI that evaluates your AI)
export OPENAI_API_KEY="sk-your-api-key-here"
# Optional: For live API testing
export API_KEY="your-api-endpoint-key"lightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yamlThat's it! Results will be in eval_output/ directory.
The framework uses four main categories of evaluation methods:
| Category | What It Does | When to Use | Level |
|---|---|---|---|
| Ragas Metrics | Industry-standard metrics for response and context quality | RAG QnA, Single-turn responses | Turn |
| DeepEval Metrics | Advanced conversation analysis | Multi-turn conversations | Conversation |
| Custom Metrics | Specialized evaluations for specific needs | Intent checking, tool validation | Turn |
| Script-Based Metrics | Real-world validation through automated scripts | E2E RAG/agent workflows | Turn |
Choose Ragas Metrics when:
- You want to verify if answers are accurate and relevant
- You need to check if the AI is using the right information
- You want industry-standard, well-documented metrics
Choose DeepEval Metrics when:
- You're evaluating multi-turn conversations
- You need to assess conversation completeness
- You want to check if the AI remembers earlier parts of the conversation
Choose Custom Metrics when:
- You have specific requirements not covered by standard metrics
- You need to compare against expected answers
- You want to verify the AI's intent or tool usage
Choose Script-Based Metrics when:
- Your AI performs actions in real systems
- You need to verify real-world outcomes
- You want to test end-to-end workflows
Turn-level metrics evaluate individual question-answer pairs.
What it measures: How well does the answer address the actual question?
Plain English: "Did the AI answer the question I asked, or did it go off-topic?"
Score Range: 0.0 to 1.0 (higher is better)
Example:
Question: "How do I reset my password?"
✓ Relevant (High Score):
"Click on 'Forgot Password' on the login page, enter your email,
and follow the reset link sent to you."
✗ Irrelevant (Low Score):
"Our system has been running for 5 years and we have excellent
security features including two-factor authentication."
When to use: Ensuring the AI stays on topic
Threshold: 0.8 or higher
Required fields: query, response
What it measures: Does the answer stick to the facts provided in the source information?
Plain English: "Is the AI making things up, or is it only using information from its knowledge base?"
Score Range: 0.0 to 1.0 (higher is better)
Example:
Context: "OpenShift Virtualization requires 4GB RAM minimum."
Question: "What are OpenShift Virtualization requirements?"
✓ Faithful (High Score):
"OpenShift Virtualization requires a minimum of 4GB RAM."
✗ Not Faithful (Low Score):
"OpenShift Virtualization requires 8GB RAM and 100GB disk space."
(The disk space wasn't in the context - made up!)
When to use: Preventing AI hallucinations (making up information)
Threshold: 0.8 or higher
Required fields: response, contexts
What it measures: Did the AI retrieve all the necessary information to answer the question?
Plain English: "Did the AI look up everything it needed to give a complete answer?"
Score Range: 0.0 to 1.0 (higher is better)
Example:
Question: "What are the storage and memory requirements for OpenShift?"
Expected Answer mentions: 120GB storage AND 16GB RAM
Retrieved Context contains:
- Document about storage (120GB) ✓
- (Missing document about memory requirements) ✗
Context Recall: 0.5 (retrieved 1 out of 2 needed pieces)
When to use: Improving search/retrieval systems
Threshold: 0.8 or higher
Required fields: contexts, expected_response
What it measures: How much of the retrieved information is actually useful?
Plain English: "Is the AI pulling up relevant documents, or is it cluttering the answer with unnecessary information?"
Two variants:
- Without Reference: Uses AI's response to judge relevance
- With Reference: Uses expected answer for more accurate judgment
Score Range: 0.0 to 1.0 (higher is better)
When to use: Optimizing search algorithms, reducing noise
Threshold: 0.7 or higher
Required fields: query, contexts, response (and expected_response for "with reference" variant)
What it measures: How relevant is the retrieved context to the user's question?
Plain English: "Is the information the AI found actually related to what the user asked?"
Score Range: 0.0 to 1.0 (higher is better)
When to use: Evaluating search quality before answer generation
Threshold: 0.7 or higher
Required fields: query, contexts
What it measures: How close is the AI's answer to the expected "correct" answer?
Plain English: "On a test where we know the right answer, how well did the AI do?"
Score Range: 0.0 to 1.0 (higher is better)
How it works: A Judge LLM compares the AI's response to your expected response
Example:
Question: "What is the capital of France?"
Expected Response: "Paris"
AI Response: "The capital of France is Paris."
Score: 1.0 (Perfect match)
AI Response: "Lyon is a major city in France."
Score: 0.1 (Incorrect answer)
When to use: Testing against known question-answer pairs, benchmarking
Threshold: 0.75 or higher
Required fields: query, response, expected_response
What it measures: Does the AI's response have the right intent/purpose?
Plain English: "Is the AI trying to do what we expect it to do?"
Score: Binary (0 or 1)
Intent Categories:
- Explain a concept: "What is Kubernetes?" → Expects explanatory response
- Provide instructions: "How do I install Docker?" → Expects step-by-step guide
- Refuse/Decline: "Can you hack this system?" → Expects refusal
- Ask for clarification: Ambiguous question → Expects clarifying questions
Example:
Question: "Tell me a joke about programming"
Expected Intent: "refuse" (professional support bot should decline)
✓ Correct Intent (Score: 1):
"I apologize, but I'm designed to help with technical questions
about OpenShift. How can I assist you today?"
✗ Wrong Intent (Score: 0):
"Why do programmers prefer dark mode? Because light attracts bugs!"
When to use: Ensuring appropriate AI behavior, safety checking
Threshold: 1 (must match exactly)
Required fields: query, response, expected_intent
What it measures: Does the AI call the right tools with correct parameters and get expected results?
Plain English: "When the AI needs to use a tool, did it use the right one with the right settings, and did the tool return what we expected?"
Score: Binary (0 or 1)
How it works:
- Compares expected tool calls against actual tool calls
- Validates tool names match exactly
- Checks parameters (supports regex patterns)
- Optionally validates tool call results (supports regex patterns)
Example:
Question: "Show me all pods in the default namespace"
Expected Tool Call:
- Tool: oc_get
- Parameters: {kind: "pod", namespace: "default"}
✓ Correct (Score: 1):
Tool: oc_get, Parameters: {kind: "pod", namespace: "default"}
✗ Incorrect (Score: 0):
Tool: oc_describe, Parameters: {kind: "pod", namespace: "default"}
(wrong tool)
Pattern Matching:
# Regex support for flexible matching
expected_tool_calls:
- - tool_name: oc_get
arguments:
namespace: "openshift-light.*" # Matches openshift-lightspeedResult Validation (Optional):
# Validate tool call results using regex patterns
expected_tool_calls:
- - tool_name: oc_get
arguments:
kind: pod
namespace: default
result: ".*Running.*" # Verify pod is in Running state
- - tool_name: oc_create
arguments:
kind: namespace
name: test-ns
result: ".*created" # Verify creation succeededWhen to use: Function calling AI applications, tool-using agents, validating tool outputs
Threshold: 1 (must be exact)
Required fields: expected_tool_calls, tool_calls
What it measures: Did the AI's action actually work in the real system?
Plain English: "Don't just check what the AI said—check if it actually did what it was supposed to do."
Score: Binary (0 or 1)
How it works:
- AI performs an action (e.g., "Create a namespace")
- Framework runs your verification script
- Script exit code determines pass/fail (0 = success, non-zero = failure)
Example:
# verify_namespace.sh
#!/bin/bash
kubectl get namespace test-ns > /dev/null 2>&1
exit $? # Returns 0 if namespace existsConfiguration:
- conversation_group_id: infrastructure_test
setup_script: ./scripts/setup_cluster.sh
cleanup_script: ./scripts/cleanup_cluster.sh
turns:
- turn_id: create_namespace
query: "Create a namespace called demo-app"
verify_script: ./scripts/verify_namespace.sh
turn_metrics:
- script:action_evalWhen to use: Infrastructure changes, system modifications, end-to-end testing
Important: Scripts only run when API mode is enabled
Threshold: 1 (must succeed)
Required fields: verify_script (API mode must be enabled)
Conversation-level metrics evaluate complete multi-turn dialogues.
What it measures: Did the conversation fully address what the user wanted to accomplish?
Plain English: "By the end of the conversation, did the user get everything they were looking for?"
Score Range: 0.0 to 1.0 (higher is better)
Example:
User: "I need to deploy an app to OpenShift and set up monitoring"
AI: "I can help! Let's start with deployment. What's your app name?"
User: "my-web-app"
AI: "Great! Here's how to deploy... [deployment instructions]"
User: "Done! What about monitoring?"
AI: "For monitoring, here are the steps... [monitoring setup]"
User: "Perfect, thanks!"
✓ Goal 1: Deploy app → Addressed
✓ Goal 2: Set up monitoring → Addressed
Score: 1.0 (Complete)
When to use: Evaluating customer support conversations, goal-oriented assistants
Threshold: 0.8 or higher
What it measures: How relevant are the responses throughout the conversation?
Plain English: "Does each response stay on topic?"
Score Range: 0.0 to 1.0 (higher is better)
When to use: Keeping conversations focused, detecting when AI drifts off-topic
Threshold: 0.7 or higher
What it measures: Does the AI remember and use information from earlier in the conversation?
Plain English: "Does the AI have a memory, or does it forget what was said earlier?"
Score Range: 0.0 to 1.0 (higher is better)
Example:
✓ Good Retention (High Score):
User: "My deployment is called web-app in the production namespace"
AI: "Got it. What would you like to do with web-app?"
User: "Scale it to 3 replicas"
AI: "I'll scale web-app in the production namespace to 3 replicas."
[remembers both name and namespace]
✗ Poor Retention (Low Score):
User: "My deployment is called web-app in the production namespace"
AI: "Okay, what do you want to do?"
User: "Scale it to 3 replicas"
AI: "What's the deployment name and namespace?" [forgot!]
When to use:
- Multi-turn conversations and troubleshooting sessions
- Evaluating fine-tuned models (especially useful to measure if fine-tuning improved context retention)
- Comparing base models vs fine-tuned versions for conversation ability
Threshold: 0.7 or higher
What are you evaluating?
│
├─ Single Q&A (Turn-Level)
│ │
│ ├─ Answer Quality?
│ │ ├─ Is answer relevant? → response_relevancy
│ │ ├─ Is answer factual? → faithfulness
│ │ └─ Matches expected? → answer_correctness
│ │
│ ├─ Information Retrieval?
│ │ ├─ Found everything needed? → context_recall
│ │ ├─ Is retrieved info relevant? → context_relevance
│ │ └─ Too much irrelevant info? → context_precision
│ │
│ ├─ AI Behavior?
│ │ ├─ Right intent? → intent_eval
│ │ └─ Right tools? → tool_eval
│ │
│ └─ Real Actions?
│ └─ Infrastructure changes? → action_eval
│
└─ Conversation (Conversation-Level)
├─ Goals achieved? → conversation_completeness
├─ Stayed on topic? → conversation_relevancy
└─ Remembered context? → knowledge_retention
turn_metrics:
- ragas:response_relevancy # On-topic answer?
- ragas:faithfulness # No hallucinations?
- custom:answer_correctness # Matches expected?# Per turn:
turn_metrics:
- ragas:response_relevancy
# Full conversation:
conversation_metrics:
- deepeval:conversation_completeness
- deepeval:knowledge_retentionturn_metrics:
- custom:tool_eval # Right tool + params?
- ragas:response_relevancy # Good explanation?turn_metrics:
- script:action_eval # Action worked?
- custom:tool_eval # Called right tool?- Python 3.11 - 3.13
- UV package manager (recommended) or pip
- API key for a Judge LLM (e.g., OpenAI)
- Basic command line knowledge
# Navigate to project
cd lightspeed-evaluation
# Install with UV
uv sync
# OR with pip
pip install -e .# Required: Judge LLM
export OPENAI_API_KEY="sk-your-api-key-here"
# For other providers:
# export WATSONX_API_KEY="your-key"
# export GEMINI_API_KEY="your-key"
# Optional: For live API testing
export API_KEY="your-api-endpoint-key"
# Check if command is available
lightspeed-eval --helpMinimal Configuration:
# Judge LLM settings
llm:
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0 # Deterministic evaluation
max_tokens: 512
timeout: 300
num_retries: 3
# Default metrics and thresholds
metrics_metadata:
turn_level:
"ragas:response_relevancy":
threshold: 0.8
description: "How relevant the response is"
default: true # Used by default
"ragas:faithfulness":
threshold: 0.8
description: "Factual accuracy"
default: false # Only when specified
# Output settings
output:
output_dir: "./eval_output"
enabled_outputs:
- csv # Detailed results
- json # Statistics
- txt # Summary
# Visualization
visualization:
enabled_graphs:
- "pass_rates"
- "score_distribution"Important Settings Explained:
default: true: Metric runs automatically if no metrics specifieddefault: false: Only runs when explicitly requestedthreshold: Minimum score to pass (0.0 to 1.0)temperature: 0.0: Ensures consistent, deterministic evaluation
⚠️ Note: The traditionalllmconfig will be deprecated. Usellm_pool+judge_panelfor new deployments.
For improved evaluation accuracy, you can use multiple LLMs as judges:
# Define a pool of LLM configurations (can be used by multiple components)
llm_pool:
defaults:
cache_dir: ".caches/llm_cache"
parameters:
temperature: 0.0
max_completion_tokens: 512
models:
judge-4o-mini:
provider: openai
model: gpt-4o-mini
judge-4.1-mini:
provider: openai
model: gpt-4.1-mini
# Configure which models to use as judges
judge_panel:
judges:
- judge-4o-mini
- judge-4.1-mini
aggregation_strategy: max # or: average, majority_vote
# enabled_metrics: ["ragas:faithfulness"] # Optional: limit to specific metrics
# If enabled_metrics not set, ALL LLM metrics use the full panelAggregation: max (highest score), average (mean vs threshold), or majority_vote (more than half of judges must individually meet the threshold — ties fail). See Configuration Guide.
Benefits:
- Reduces bias from a single model
- More robust evaluation scores
- Per-judge token tracking for cost analysis
Simple Example:
- conversation_group_id: basic_test
description: "Testing basic Q&A"
turns:
- turn_id: question_1
query: "What is OpenShift?"
response: "OpenShift is an enterprise Kubernetes platform..."
contexts:
- "OpenShift is Red Hat's enterprise Kubernetes distribution..."
expected_response: "OpenShift is an enterprise Kubernetes platform"
# Uses default metrics (response_relevancy)
turn_metrics: nullAdvanced Example:
- conversation_group_id: advanced_test
description: "Testing with multiple metrics"
turns:
- turn_id: question_1
query: "How do I reset my password?"
response: "Click 'Forgot Password' on the login page..."
expected_response: "Use the forgot password link"
expected_intent: "provide instructions"
# Specify exact metrics
turn_metrics:
- "ragas:response_relevancy"
- "ragas:faithfulness"
- "custom:answer_correctness"
- "custom:intent_eval"
# Override threshold for this turn
turn_metrics_metadata:
"ragas:faithfulness":
threshold: 0.9 # Stricter than defaultTool Evaluation Example:
- conversation_group_id: tool_test
turns:
- turn_id: get_pods
query: "Show me all pods in the default namespace"
expected_tool_calls:
- - tool_name: oc_get
arguments:
kind: pod
namespace: default
turn_metrics:
- "custom:tool_eval"
turn_metrics_metadata:
"custom:tool_eval":
ordered: true # default: true
full_match: true # default: true (false = subset matching, all expected must be present)Script-Based Example:
- conversation_group_id: infrastructure_test
setup_script: "./scripts/setup_test_env.sh"
cleanup_script: "./scripts/cleanup_test_env.sh"
turns:
- turn_id: create_namespace
query: "Create a namespace called test-demo"
verify_script: "./scripts/verify_namespace_exists.sh"
turn_metrics:
- "script:action_eval"Skip on Failure Example:
Skip remaining turns completely (no API calls or evaluations) when a turn fails:
- conversation_group_id: dependent_workflow
skip_on_failure: true # Or set globally in system.yaml.
turns:
- turn_id: step_1
query: "Create namespace"
turn_metrics: ["script:action_eval"]
- turn_id: step_2 # SKIPPED if step_1 fails
query: "Deploy to namespace"
turn_metrics: ["script:action_eval"]lightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yamllightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yaml \
--output-dir ./my_evaluation_results-
Configuration Validation
- Checks all required fields
- Validates metric selections
- Verifies Judge LLM connectivity
-
Data Collection
- If API enabled: Calls your API for responses
- If API disabled: Uses pre-filled data from YAML
-
Metric Evaluation
- Runs turn-level metrics for each turn
- Runs conversation-level metrics for full conversations
- Uses Judge LLM to score responses
-
Scoring & Analysis
- Compares scores against thresholds
- Generates PASS/FAIL/ERROR/SKIPPED status
- Calculates statistics
-
Output Generation
- Creates CSV, JSON, TXT files
- Generates visualization graphs
- Saves amended evaluation data
In addition to the CLI, the framework can be used as a Python library. This is useful when you want to integrate evaluations into scripts, notebooks, CI pipelines, or custom tooling—without dealing with YAML files or command-line arguments.
| Function | Returns | Purpose |
|---|---|---|
evaluate(config, data) |
list[EvaluationResult] |
Evaluate a list of conversations |
evaluate_conversation(config, data) |
list[EvaluationResult] |
Evaluate a single conversation |
evaluate_turn(config, turn) |
list[EvaluationResult] |
Evaluate a single turn |
evaluate_with_summary(config, data) |
EvaluationSummary |
Evaluate with structured statistics |
evaluate_conversation_with_summary(config, data) |
EvaluationSummary |
Single conversation with statistics |
evaluate_turn_with_summary(config, turn) |
EvaluationSummary |
Single turn with statistics |
The evaluate*() functions return raw result lists. The *_with_summary() variants return an EvaluationSummary that wraps results with computed statistics (overall, per-metric, per-conversation, per-tag).
from lightspeed_evaluation import (
evaluate,
EvaluationData,
LLMConfig,
SystemConfig,
TurnData,
)
# 1. Build configuration
config = SystemConfig(
llm=LLMConfig(provider="openai", model="gpt-4o-mini"),
)
# 2. Build evaluation data
data = EvaluationData(
conversation_group_id="my_eval",
turns=[
TurnData(
turn_id="t1",
query="What is OpenShift?",
response="OpenShift is a Kubernetes-based container platform.",
expected_response="OpenShift is Red Hat's Kubernetes platform.",
turn_metrics=["ragas:response_relevancy"],
),
],
)
# 3. Run evaluation
results = evaluate(config, [data])
# 4. Inspect results
for r in results:
print(f"{r.metric_identifier}: {r.result} (score={r.score})")Use evaluate_turn() when you want to evaluate one question-answer pair. You can override metrics without modifying the original turn object:
from lightspeed_evaluation import evaluate_turn, SystemConfig, TurnData
config = SystemConfig()
turn = TurnData(
turn_id="t1",
query="What is a pod?",
response="A pod is the smallest deployable unit in Kubernetes.",
)
results = evaluate_turn(
config,
turn,
metrics=["ragas:response_relevancy", "ragas:faithfulness"],
)Use evaluate_conversation() when you have a single EvaluationData object:
from lightspeed_evaluation import evaluate_conversation, EvaluationData, SystemConfig, TurnData
config = SystemConfig()
data = EvaluationData(
conversation_group_id="support_conv",
turns=[
TurnData(turn_id="t1", query="Hello", response="Hi! How can I help?"),
TurnData(turn_id="t2", query="What is OCP?", response="OCP is OpenShift."),
],
conversation_metrics=["deepeval:knowledge_retention"],
)
results = evaluate_conversation(config, data)The evaluate() functions return list[EvaluationResult]. Each result contains:
| Field | Description |
|---|---|
result |
Status: PASS, FAIL, ERROR, or SKIPPED |
score |
Numeric score between 0.0 and 1.0 |
threshold |
Pass/fail threshold used |
reason |
Explanation from the judge LLM |
metric_identifier |
Which metric produced this result |
turn_id |
Turn ID (for turn-level metrics) |
conversation_group_id |
Conversation group ID |
No files are generated by default—file output is the caller's responsibility. If you need CSV/JSON reports, use the OutputHandler or EvaluationSummary (see below).
Use evaluate_with_summary() to get structured results with computed statistics:
from lightspeed_evaluation import (
evaluate_with_summary,
EvaluationData,
EvaluationSummary,
LLMConfig,
SystemConfig,
TurnData,
)
config = SystemConfig(
llm=LLMConfig(provider="openai", model="gpt-4o-mini"),
)
data = EvaluationData(
conversation_group_id="my_eval",
turns=[
TurnData(
turn_id="t1",
query="What is OpenShift?",
response="OpenShift is a Kubernetes-based container platform.",
turn_metrics=["ragas:response_relevancy"],
),
],
)
# Get structured results
summary = evaluate_with_summary(config, [data])
# Access overall statistics
print(f"Pass rate: {summary.overall.pass_rate}%")
print(f"Total: {summary.overall.total}")
# Access per-metric statistics
for metric_id, stats in summary.by_metric.items():
print(f"{metric_id}: pass_rate={stats.pass_rate}%")
if stats.score_statistics:
print(f" mean={stats.score_statistics.mean}")
# Access per-conversation statistics
for conv_id, stats in summary.by_conversation.items():
print(f"{conv_id}: {stats.passed}/{stats.total} passed")
# Access raw results
for r in summary.results:
print(f"{r.metric_identifier}: {r.result} (score={r.score})")Use OutputHandler.save() to write an EvaluationSummary to files:
from lightspeed_evaluation import OutputHandler
handler = OutputHandler(output_dir="./my_output")
files = handler.save(summary, formats=["json", "csv", "txt"])
print(f"Generated: {files}")When using the CLI, bootstrap confidence intervals are always computed for metrics with two or more scored results.
When using the programmatic API, confidence intervals are disabled by default. To enable them:
summary = evaluate_with_summary(
config, [data],
compute_confidence_intervals=True,
)
for metric_id, stats in summary.by_metric.items():
ci = stats.score_statistics.confidence_interval
if ci:
print(f"{metric_id}: {ci['low']:.3f} - {ci['high']:.3f} (95% CI)")| Aspect | CLI (lightspeed-eval) |
Programmatic API |
|---|---|---|
| Configuration | YAML files | Python objects (SystemConfig) |
| Input data | YAML files | Python objects (EvaluationData) |
| Output | CSV, JSON, TXT files + graphs | list[EvaluationResult] or EvaluationSummary |
| File output | Automatic | Optional via OutputHandler.save() |
| Use case | Standalone runs, CI jobs | Library integration, notebooks, scripts |
eval_output/
├── evaluation_20251028_143000_detailed.csv
├── evaluation_20251028_143000_summary.json
├── evaluation_20251028_143000_summary.txt
└── graphs/
├── evaluation_20251028_143000_pass_rates.png
├── evaluation_20251028_143000_score_distribution.png
├── evaluation_20251028_143000_conversation_heatmap.png
└── evaluation_20251028_143000_status_breakdown.png
Contains every metric evaluation with:
- Conversation group ID and turn ID
- Metric identifier
- Score, threshold, status (PASS/FAIL/ERROR/SKIPPED)
- Detailed reasoning
- Query and response text
- Execution time
Use for: Drilling into specific failures, detailed analysis
Contains:
- Overall statistics (pass/fail/error counts)
- Per-metric summaries
- Score distributions (mean, median, std dev)
- Execution metadata
Use for: Quick overview, automated processing, tracking trends
Example:
EVALUATION SUMMARY
==================
Total Evaluations: 10
Passed: 8 (80.0%)
Failed: 2 (20.0%)
Errors: 0 (0.0%)
METRIC BREAKDOWN
================
ragas:response_relevancy:
Mean Score: 0.85
Pass Rate: 90%
ragas:faithfulness:
Mean Score: 0.78
Pass Rate: 70%
Use for: Quick review, executive summaries
- Pass Rates Bar Chart: Compare pass rates per metric
- Score Distribution Box Plot: Shows score spread
- Conversation Heatmap: Performance across conversations
- Status Breakdown Pie Chart: Overall pass/fail/error distribution
Use for: Presentations, quick visual insights
- PASS ✅: Score met or exceeded threshold
- FAIL ❌: Score below threshold
- ERROR
⚠️ : Evaluation couldn't complete (missing data, API failure, etc.) - SKIPPED ⏭️: Evaluation skipped due to prior failure (when
skip_on_failureis enabled)
| Score | Quality | Recommendation |
|---|---|---|
| 0.9 - 1.0 | Excellent | Production ready |
| 0.8 - 0.9 | Good | Typical threshold |
| 0.7 - 0.8 | Acceptable | Consider improvements |
| < 0.7 | Poor | Needs work |
| Pass Rate | Status | Action |
|---|---|---|
| ≥ 90% | Production ready | Deploy with confidence |
| 80-90% | Good quality | Minor improvements |
| 70-80% | Acceptable for testing | Needs improvement |
| < 70% | Not ready | Significant work needed |
Scenario: Launching a customer support chatbot
Evaluation Strategy:
-
Create test dataset with 50 common questions
-
Use metrics:
ragas:response_relevancy(0.8)ragas:faithfulness(0.8)custom:answer_correctness(0.75)
-
Configuration:
- conversation_group_id: support_qa
turns:
- turn_id: password_reset
query: "How do I reset my password?"
contexts:
- "Password reset: Click 'Forgot Password', enter email..."
expected_response: "Use forgot password link and check email"
turn_metrics:
- ragas:response_relevancy
- ragas:faithfulness
- custom:answer_correctness- Success criteria:
- Overall pass rate ≥ 90%
- No faithfulness scores below 0.7
- All high-priority questions pass
Scenario: Updating to a new AI model
Strategy:
- Use existing production questions (100-500 samples)
- Run evaluation on old model → Save results
- Run evaluation on new model → Save results
- Compare results
Commands:
# Evaluate old model
lightspeed-eval \
--system-config config/system_old_model.yaml \
--eval-data config/prod_samples.yaml \
--output-dir ./results_old_model
# Evaluate new model
lightspeed-eval \
--system-config config/system_new_model.yaml \
--eval-data config/prod_samples.yaml \
--output-dir ./results_new_model
# Compare results
uv run python script/compare_evaluations.py \
results_old_model/evaluation_summary.json \
results_new_model/evaluation_summary.json- Decision criteria:
- New model must not decrease pass rate by >5%
- Critical metrics must maintain or improve
- Statistical significance test passes
Scenario: AI guides users through complex troubleshooting
Configuration:
- conversation_group_id: troubleshoot_deployment
description: "Multi-turn deployment troubleshooting"
conversation_metrics:
- deepeval:conversation_completeness
- deepeval:knowledge_retention
turns:
- turn_id: turn_1
query: "My pod won't start"
turn_metrics:
- ragas:response_relevancy
- turn_id: turn_2
query: "It says ImagePullBackOff"
turn_metrics:
- ragas:response_relevancy
- turn_id: turn_3
query: "How do I fix the image registry auth?"
turn_metrics:
- ragas:response_relevancy
- custom:intent_evalSuccess criteria:
- Conversation completeness ≥ 0.85
- Knowledge retention ≥ 0.8
- Each turn response relevancy ≥ 0.8
Scenario: AI performs actions in Kubernetes/OpenShift
Configuration:
- conversation_group_id: tool_calling_test
turns:
- turn_id: list_pods
query: "Show me pods in the production namespace"
expected_tool_calls:
- - tool_name: oc_get
arguments:
kind: pod
namespace: production
turn_metrics:
- custom:tool_eval
- turn_id: scale_deployment
query: "Scale web-app to 3 replicas"
expected_tool_calls:
- - tool_name: oc_scale
arguments:
kind: deployment
name: web-app
replicas: 3
turn_metrics:
- custom:tool_evalSuccess criteria: 100% tool call accuracy
Scenario: AI creates and modifies infrastructure
Configuration:
- conversation_group_id: infra_operations
setup_script: "./scripts/setup_test_cluster.sh"
cleanup_script: "./scripts/cleanup_test_cluster.sh"
turns:
- turn_id: create_namespace
query: "Create a namespace called demo-app"
verify_script: "./scripts/verify_namespace.sh"
turn_metrics:
- script:action_evalVerification Script:
#!/bin/bash
# verify_namespace.sh
kubectl get namespace demo-app > /dev/null 2>&1
exit $?Success criteria: 100% pass rate on critical operations
❌ Don't: Start with 1000 questions and all metrics
✅ Do: Start with 10-20 key questions and 2-3 core metrics
Progression:
- Week 1: 10 questions, 2 metrics
- Week 2: 50 questions, add metrics
- Month 1: 100-200 questions, full suite
- Production: 500+ questions
| Scenario | Recommended Metrics |
|---|---|
| Customer Support (Single Q&A) | response_relevancy, faithfulness, answer_correctness |
| Multi-turn Conversations | conversation_completeness, knowledge_retention |
| Tool-calling Agents | tool_eval, response_relevancy |
| Infrastructure Automation | script:action_eval, tool_eval |
| Metric Type | Threshold | Use Case |
|---|---|---|
| Production-critical | 0.85 - 0.95 | Customer-facing |
| Standard quality | 0.75 - 0.85 | General use |
| Beta/Testing | 0.70 - 0.75 | Testing phase |
| Binary metrics | 1.0 | Must match |
Distribution:
- 80%: Common, expected queries
- 15%: Edge cases
- 5%: Negative cases (should refuse/clarify)
Track in Git:
- ✅
system.yaml - ✅
evaluation_data.yaml - ✅ Verification scripts
- ✅ Expected responses
Don't track:
- ❌ API keys
- ❌ Output files
- ❌ Cached results
CI/CD Example:
# .github/workflows/ai_evaluation.yml
name: AI Quality Evaluation
on:
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evaluation
run: |
uv sync
lightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yaml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}Schedule:
- Daily: Quick smoke tests (10-20 questions)
- Weekly: Full regression (100-500 questions)
- Before releases: Extended suite (1000+ questions)
Create a README with:
- Evaluation goals
- Metric selection rationale
- Threshold justification
- Test set composition
- Success criteria
Common edge cases:
- Missing context → Should ask for more info
- Out-of-scope → Should politely decline
- Ambiguous queries → Should ask clarifying questions
- Multiple valid answers → Use broader thresholds
Optimize Judge LLM costs:
- Use cheaper models when possible
llm:
model: "gpt-4o-mini" # Instead of gpt-4o- Enable caching
llm:
cache_enabled: true
cache_dir: ".caches/llm_cache"- Subset testing during development
- Full suite: Weekly
- Sample (10%): Daily
- Critical questions: Per PR
Error: Error: OPENAI_API_KEY environment variable not set
Solution:
export OPENAI_API_KEY="sk-your-key-here"
# Verify
echo $OPENAI_API_KEY
# Persist in shell profile
echo 'export OPENAI_API_KEY="sk-your-key-here"' >> ~/.bashrc
source ~/.bashrcSymptoms: Status shows "ERROR" instead of PASS/FAIL
Common causes & solutions:
- Missing required fields
# ❌ Missing contexts for faithfulness
- turn_id: test1
query: "Question"
response: "Answer"
turn_metrics:
- ragas:faithfulness # Needs contexts!
# ✅ Fixed
- turn_id: test1
query: "Question"
response: "Answer"
contexts:
- "Context document here"
turn_metrics:
- ragas:faithfulness- Empty or null values
# ❌ Empty response
response: ""
# ✅ Provide actual response
response: "This is the answer"Field Requirements:
| Metric | Required Fields |
|---|---|
| response_relevancy | query, response |
| faithfulness | response, contexts |
| context_recall | contexts, expected_response |
| answer_correctness | query, response, expected_response |
| intent_eval | query, response, expected_intent |
| tool_eval | expected_tool_calls, tool_calls |
| action_eval | verify_script (API mode) |
Symptoms: Faithfulness scores consistently below threshold
Diagnosis: Check CSV for reasons like "claims not supported by context"
Solutions:
- Add more context documents
contexts:
- "Document 1 about topic A"
- "Document 2 about topic B"
- "Document 3 with more details"- Adjust prompt to stick to facts
api:
system_prompt: "Only use information from the provided context.
If information isn't in the context, say so."Symptoms: Same question gets different scores each time
Cause: Non-zero temperature (randomness)
Solution:
llm:
temperature: 0.0 # Zero for deterministic evaluationSolutions:
- Increase concurrency
core:
max_threads: 50- Enable caching
llm:
cache_enabled: true- Use faster model
llm:
model: "gpt-4o-mini"Solutions:
- Check permissions
chmod +x scripts/verify.sh- Verify path
# Relative path from eval data file
verify_script: "./scripts/verify.sh"
# Or absolute path
verify_script: "/full/path/to/verify.sh"- Test manually
./scripts/verify.sh
echo $? # Should be 0 for success- Ensure API mode enabled
api:
enabled: true # Required for scriptsSolutions:
- Check format
# ✅ Correct (list of lists of dicts)
expected_tool_calls:
- - tool_name: oc_get
arguments:
kind: pod
# ❌ Wrong
expected_tool_calls:
tool_name: oc_get # Missing list structure- Use regex for flexible matching
expected_tool_calls:
- - tool_name: oc_get
arguments:
namespace: "openshift-.*" # Regex patternSolutions:
- Reduce batch size
core:
max_threads: 10 # Lower from 50- Process in smaller batches
# Split evaluation data into smaller files
lightspeed-eval --eval-data config/eval_batch1.yaml
lightspeed-eval --eval-data config/eval_batch2.yaml| Metric | Score | What It Checks | Threshold | Required Fields |
|---|---|---|---|---|
| ragas:response_relevancy | 0-1 | Answer addresses question | 0.8 | query, response |
| ragas:faithfulness | 0-1 | No made-up information | 0.8 | response, contexts |
| ragas:context_recall | 0-1 | Found all needed info | 0.8 | contexts, expected_response |
| ragas:context_relevance | 0-1 | Retrieved info is relevant | 0.7 | query, contexts |
| ragas:context_precision_* | 0-1 | Retrieved info is useful | 0.7 | query, contexts, response |
| custom:answer_correctness | 0-1 | Matches expected answer | 0.75 | query, response, expected_response |
| custom:intent_eval | 0/1 | Has right intent | 1 | query, response, expected_intent |
| custom:tool_eval | 0/1 | Called correct tools with expected results | 1 | expected_tool_calls, tool_calls |
| script:action_eval | 0/1 | Real action verified | 1 | verify_script |
| deepeval:conversation_completeness | 0-1 | User's goals achieved | 0.8 | Full conversation |
| deepeval:conversation_relevancy | 0-1 | Stayed on topic | 0.7 | Full conversation |
| deepeval:knowledge_retention | 0-1 | Remembered context | 0.7 | Full conversation |
Minimal system.yaml:
llm:
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0
metrics_metadata:
turn_level:
"ragas:response_relevancy":
threshold: 0.8
default: true
output:
output_dir: "./eval_output"Minimal evaluation_data.yaml:
- conversation_group_id: test_1
turns:
- turn_id: q1
query: "What is OpenShift?"
response: "OpenShift is..."
contexts: ["OpenShift is..."]# Basic evaluation
lightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yaml
# Custom output directory
lightspeed-eval \
--system-config config/system.yaml \
--eval-data config/evaluation_data.yaml \
--output-dir ./results
# Compare evaluations
uv run python script/compare_evaluations.py \
results1/summary.json \
results2/summary.json
# Multi-provider evaluation
uv run python script/run_multi_provider_eval.py \
--providers-config config/multi_eval_config.yaml| Score | Quality | Pass Rate | Status |
|---|---|---|---|
| 0.9-1.0 | Excellent | ≥90% | Production ready |
| 0.8-0.9 | Good | 80-90% | Good quality |
| 0.7-0.8 | Acceptable | 70-80% | Needs improvement |
| <0.7 | Poor | <70% | Not ready |
| Problem | Quick Fix |
|---|---|
| No API key | export OPENAI_API_KEY="..." |
| All ERROR | Check required fields for metrics |
| Low faithfulness | Add more context documents |
| Inconsistent results | Set temperature: 0.0 |
| Slow evaluation | Enable caching, increase threads |
| Script fails | Check permissions: chmod +x |
| "Metric not found" | Check spelling against supported list |
-
Ragas Framework: https://docs.ragas.io/
-
DeepEval Framework: https://deepeval.com/docs/
- Metrics Introduction: https://deepeval.com/docs/metrics-introduction
- Conversation Completeness: https://deepeval.com/docs/metrics-conversation-completeness
- Knowledge Retention: https://deepeval.com/docs/metrics-knowledge-retention
- OpenAI: https://platform.openai.com/docs/
- IBM Watsonx: https://www.ibm.com/docs/en/watsonx-as-a-service
- Google Gemini: https://ai.google.dev/docs
- LiteLLM (unified interface): https://docs.litellm.ai/
For Beginners:
- "Introduction to LLM Evaluation" (search for current articles)
- Ragas Getting Started Guide
- DeepEval tutorials
For Advanced Users:
- RAG (Retrieval-Augmented Generation) papers
- LLM evaluation best practices
- Conversational AI assessment techniques
This repository:
- Main README:
../README.md - Agent Guidelines:
../AGENTS.md - Multi-Provider Evaluation:
multi_provider_evaluation.md - Evaluation Comparison:
evaluation_comparison.md - Sample Configurations:
../config/ - Example Scripts:
../config/sample_scripts/
- GitHub Repository: Report issues, request features
- GitHub Discussions: Ask questions, share experiences
- Pull Requests: Contribute improvements
- API-Enabled Mode: Real-time evaluation calling your AI system's API
- Binary Metric: Pass/fail evaluation (0 or 1)
- Context: Background information from knowledge base
- Faithfulness: How well answer sticks to provided facts
- Hallucination: AI making up information
- Judge LLM: AI model used to evaluate another AI
- Pass Rate: Percentage of evaluations meeting threshold
- Ragas: Framework for retrieval-augmented generation metrics
- Static Mode: Evaluation using pre-filled responses
- Threshold: Minimum score required to pass
- Turn: Single question-response pair
- Turn-Level: Evaluation of individual Q&A pairs
- Conversation-Level: Evaluation of multi-turn dialogues
This comprehensive guide has covered everything you need to know to effectively evaluate AI applications using the LightSpeed Evaluation Framework:
✅ Understanding - What evaluation is and why it matters
✅ Methodologies - All 13 evaluation metrics explained in plain English
✅ Implementation - Step-by-step setup and configuration
✅ Interpretation - Understanding and acting on results
✅ Application - Real-world use cases and best practices
✅ Reference - Quick lookup tables and decision trees
- Start with a pilot: Choose 10-20 key questions and 2-3 metrics
- Run your first evaluation: Follow the step-by-step guide
- Analyze results: Use the interpretation section
- Iterate and improve: Adjust thresholds and expand coverage
- Automate: Integrate into your development workflow
- Connect via Slack channel: #forum-lightspeed
Last Updated: December 23, 2025
Status: Complete and Ready for Use
Feedback: Please submit suggestions via GitHub issues or pull requests.
This guide is designed to make AI evaluation accessible to everyone. Whether you're a product manager making decisions, a QA engineer testing systems, or a developer integrating evaluation into workflows, you now have everything you need to ensure your AI applications meet quality standards.
Happy Evaluating! 🚀