This guide will help you get started writing evaluation configurations for your AI agents and tools.
Create a .agentci/evals directory in your project root:
mkdir -p .agentci/evalsYour project structure should look like:
your-project/
├── .agentci/
│ └── evals/ # Evaluation TOML files
├── src/
└── tests/
Create a TOML file in .agentci/evals/ to define test cases for your agents:
# .agentci/evals/test_accuracy.toml
[eval]
description = "Test basic agent responses"
type = "accuracy"
targets.agents = ["*"] # Test all agents
targets.tools = [] # Skip tools
[[eval.cases]]
prompt = "What is 2+2?"
output = "4"
[[eval.cases]]
prompt = "What is the capital of France?"
output.contains = "Paris"AgentCI supports six types of evaluations:
Test that outputs match expected values:
[eval]
description = "Test response accuracy"
type = "accuracy"
targets.agents = ["my-agent"]
targets.tools = []
[[eval.cases]]
prompt = "What is the capital of Japan?"
output.contains = "Tokyo"Measure response time and token usage:
[eval]
description = "Test response speed"
type = "performance"
targets.agents = ["my-agent"]
targets.tools = []
[[eval.cases]]
prompt = "Quick question"
latency = { max_ms = 2000 } # Must respond within 2 seconds
tokens = { max = 500 } # Max 500 tokensTest for harmful content and security issues:
[eval]
description = "Test prompt injection resistance"
type = "safety"
template = "prompt_injection" # Use built-in safety tests
targets.agents = ["*"]
targets.tools = []Verify reproducible outputs across multiple runs:
[eval]
description = "Test output consistency"
type = "consistency"
targets.agents = ["my-agent"]
targets.tools = []
iterations = 5 # Run each test 5 times
[[eval.cases]]
prompt = "Calculate 15 * 23"
min_similarity = 1.0 # Expect exact same answer every timeUse an LLM to evaluate subjective quality:
[eval]
description = "Evaluate response quality with AI"
type = "llm"
targets.agents = ["customer-support"]
targets.tools = []
[eval.llm]
model = "gpt-4"
prompt = "Rate the helpfulness of this response (1-10)"
[eval.llm.output_schema]
score = { type = "int", min = 1, max = 10 }
reasoning = { type = "str" }
[[eval.cases]]
prompt = "I need help with my account"
score = { min = 7 } # Expect score of 7 or higherUse your own Python evaluation logic:
[eval]
description = "Custom business logic validation"
type = "custom"
targets.agents = ["sales-agent"]
targets.tools = []
[eval.custom]
module = "my_evaluations.sales"
function = "validate_quote"
[[eval.cases]]
prompt = "Create a quote for 100 units"
parameters = { max_discount = 0.15 }You can also test tools directly by providing context instead of prompt:
[eval]
description = "Test weather API tool"
type = "accuracy"
targets.agents = []
targets.tools = ["weather-api"]
[[eval.cases]]
context = { city = "San Francisco" } # Tool parameters
[eval.cases.output.schema]
temperature = { type = "float" }
condition = { type = "str" }
humidity = { type = "int" }The filename becomes the evaluation name:
accuracy_test.toml→ Evaluation name:accuracy_testperformance_test.toml→ Evaluation name:performance_testsafety_checks.toml→ Evaluation name:safety_checks
Use descriptive names that indicate what you're testing.
[eval]
targets.agents = ["customer-support", "sales-agent"] # Only test these
targets.tools = [][eval]
targets.agents = ["*"] # Test all discovered agents
targets.tools = [][eval]
description = "Comprehensive accuracy tests"
type = "accuracy"
targets.agents = ["*"]
targets.tools = []
[[eval.cases]]
prompt = "Test case 1"
output = "Expected result 1"
[[eval.cases]]
prompt = "Test case 2"
output.contains = "key phrase"
[[eval.cases]]
prompt = "Test case 3"
output = { similar = "semantic match", threshold = 0.8 }Now that you've created your first evaluation, explore detailed guides for each type:
- Accuracy - Exact matching, regex, semantic similarity, schema validation
- Performance - Latency, token usage, resource constraints
- Safety - Security testing with built-in templates
- Consistency - Reliability testing across multiple runs
- LLM - AI-powered quality assessment
- Custom - Write your own evaluation logic in Python
- Evaluations Overview - Complete TOML schema reference
- Python API - Programmatic usage (optional)
- Framework Configuration - Only needed if you're using a custom framework not already supported (LangChain, LlamaIndex, Pydantic AI, OpenAI Agents, Google ADK, and Agno are built-in)