A complete voice agent testing framework that simulates customer support conversations between:
- Synthetic customers with different personalities (angry, confused, technical, etc.)
- Mock support agents using your actual Acme prompts
- Angry refund seeker (Karen)
- Confused elderly user (Harold)
- Technical bug reporter (Alex)
- Friendly billing inquiry (Sarah)
- Edge case nightmare (Jordan)
- Conversation quality: Interruptions, silence gaps, gibberish detection
- Performance metrics: Response times, speech rates, turn-taking
- Behavioral analysis: Task completion, sentiment progression
- Audio quality: STT confidence, overlapping speech
- JSON transcripts with timestamps
- HTML dashboards with visual metrics
- Batch testing capabilities
- Real-time metric streaming
src/
├── customer_agent.py # 5 customer personas ready to use
├── support_agent.py # Mock Acme agent with your prompts
├── conversation_orchestrator.py # Connects agents in rooms
├── metrics_collector.py # Captures 20+ metrics per conversation
├── test_runner.py # Parallel test execution
└── results_viewer.py # HTML report generation
prompts/
└── acme_system_prompt.txt # 200+ line comprehensive support prompt
run_test.py # Main entry point with interactive menu
TEST_FRAMEWORK.md # Complete documentation
Replace the content in prompts/acme_system_prompt.txt with your actual support agent instructions.
# Interactive mode (easiest)
uv run python run_test.py
# Test specific scenario
uv run python run_test.py test angry_refund
# Run all scenarios
uv run python run_test.py test all
# View results
uv run python run_test.py results
# Generate HTML report
uv run python run_test.py reportEach test generates:
- Full conversation transcript with timestamps
- Quality metrics (interruptions, latency, etc.)
- Performance scores
- HTML dashboard for visualization
- ✅ Both agents simulated locally for maximum control
- ✅ 5 customer scenarios covering common support cases
- ✅ 20+ metrics captured per conversation
- ✅ Parallel testing capability (run 10+ conversations simultaneously)
- ✅ Rich transcripts with audio quality metadata
- Connect to real Acme agents via Twilio/SIP
- A/B test different prompts objectively
- Discover edge cases automatically
- Measure improvement quantitatively
- Test prompt changes in minutes - No manual calling required
- Objective quality metrics - Quantify improvements
- Edge case discovery - Find failure modes automatically
- Regression testing - Ensure changes don't break existing behavior
- Performance baselines - Track metrics over time
{
"conversation_id": "angry_refund_1234567",
"duration": 87.3,
"turns": 14,
"quality_metrics": {
"interruptions": {
"count": 3,
"details": [...]
},
"audio_quality_events": {
"gibberish_count": 0,
"silence_gaps": 2
}
},
"performance": {
"customer": {
"avg_response_time": 0.8,
"speech_rate": 145
},
"support": {
"avg_response_time": 1.2,
"first_response_time": 2.1
}
}
}- Immediate: Run
uv run python run_test.pyto see the framework in action - Today: Add your actual support prompt to
prompts/acme_system_prompt.txt - This Week: Run 50+ test conversations to establish baselines
- Next Week: Connect to your real agents via Twilio when ready
- No more manual testing - Automated voice conversations at scale
- Data-driven optimization - Measure, don't guess
- Faster iteration - Test → Measure → Improve in minutes
- Quality assurance - Catch issues before customers do
- Built on LiveKit's production-grade infrastructure
- Uses GPT-4o-mini for cost-effective testing
- Supports parallel execution for rapid testing
- All data saved as JSON for integration with your eval framework
Time to Build: 4 hours Lines of Code: ~2000 Test Scenarios: 5 (easily expandable) Metrics Captured: 20+ per conversation
Ready to revolutionize your voice agent testing! 🚀